Create custom Metrics for AWS Glue Jobs.

LEVEL: INTERMEDIATE
As you know, CloudWatch lets you publish custom metrics from your applications. These are metrics that are not provided by the AWS services themselves.
Traditionally, custom metrics were published to CloudWatch by applications by calling CloudWatch’s PutMetricData API, most commonly through the use of AWS SDK for the language of your choice.
With the new CloudWatch Embedded Metric Format (EMF), you can simply embed the custom metrics in the logs that your application sends to CloudWatch, and CloudWatch will automatically extract the custom metrics from the log data. You can then graph these metrics in the CloudWatch console and even set alerts and alarms on them like other out-of-the-box metrics.
This works anywhere you publish CloudWatch logs from, EC2 instances, on-prem VMs, Docker/Kubernetes containers in ECS/EKS, Lambda functions, etc.
In this case, we center on custom metrics for AWS Glue Job execution. The final aim of this task is to create a Cloudwatch Alarm to identify if a Glue Job execution was successful or not. The proposed solution to this is the one shown in the following diagram.

Infraestructure
We will create the infrastructure and permissions needed with Terraform.
resource "aws_cloudwatch_event_rule" "custom_glue_job_metrics" {
name = "CustomGlueJobMetrics"
description = "Create custom metrics from glue job events"
is_enabled = true
event_pattern = jsonencode(
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Job State Change"
]
}
)
}
resource "aws_cloudwatch_event_target" "custom_glue_job_metrics" {
target_id = "CustomGlueJobMetrics"
rule = aws_cloudwatch_event_rule.custom_glue_job_metrics.name
arn = aws_lambda_function.custom_glue_job_metrics.arn
retry_policy {
maximum_event_age_in_seconds = 3600
maximum_retry_attempts = 0
}
}
resource "aws_lambda_function" "custom_glue_job_metrics" {
function_name = "CustomGlueJobMetrics"
filename = "python/handler.zip"
source_code_hash = filebase64sha256("python/handler.zip")
role = aws_iam_role.custom_glue_job_metrics.arn
handler = "handler.handler"
runtime = "python3.9"
timeout = 90
tracing_config {
mode = "PassThrough"
}
}
resource "aws_lambda_permission" "allow_cloudwatch" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.custom_glue_job_metrics.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.custom_glue_job_metrics.arn
}
resource "aws_iam_role" "custom_glue_job_metrics" {
name = "CustomGlueJobMetrics"
assume_role_policy = jsonencode(
{
Version : "2012-10-17",
Statement : [
{
Effect : "Allow",
Principal : {
Service : "lambda.amazonaws.com"
},
Action : "sts:AssumeRole"
}
]
})
}
resource "aws_iam_role_policy" "custom_glue_job_metrics" {
name = "CustomGlueJobMetrics"
role = aws_iam_role.custom_glue_job_metrics.id
policy = jsonencode({
Version : "2012-10-17",
Statement : [
{
Effect : "Allow",
Action : [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
Resource : "arn:aws:logs:*:*:*"