Serverless Operations in AWS, part 3: Schedule and Monitor

1 Like

Welcome to the continuation of this series about serverless operations in AWS. Our objective is to setup our custom coding project to run automated procedures in AWS. In Part 1, "Prep and Build", we built a container for our project called "hello-aws" and tested its operations. Then in Part 2, "Deploy and Run", we setup a serverless infrastructure to use spot instances in Amazon Fargate to run jobs that execute code in our "hello-aws" container repository.

Now with Part 3, "Schedule and Monitor", we will establish a schedule to automatically execute jobs on a regular basis. Then we'll follow through by noting the places where we can monitor the status of those jobs over time, delve into their logs as appropriate, and setup notifications if a problem is encountered.

And finally, check out @FrederikV 's post, Running SAS Scoring Runtime Containers through AWS Fargate, where he takes the principals explained in this series and applies it to running analytic models produced by SAS Viya using a serverless approach in AWS.

Serverless architecture in AWS

We're using more of the services in this illustration:

Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

At this point, we've created the Docker container for our project code, "hello-aws", pushed it into Amazon ECR, registered job definitions in AWS Batch, and then used those to submit jobs to run in Fargate. For this post, we'll continue on and utilize Amazon EventBridge for scheduling as well as Amazon CloudWatch and other services for monitoring and notifications.

AWS EventBridge > Schedule the job

Amazon offers the EventBridge service which they describe as a "serverless event bus that helps you receive, filter, transform, route, and deliver events". We're going to use it as an event scheduler to configure appointed time(s) to run our project code.

> IAM permissions to use AWS EventBridge Scheduler

There are all kinds of options for scheduling your job. When do you want it to run? Just once at a specific time in the future? Or repeatedly on some schedule? Perhaps it should be kicked off when some other event occurs? Don't worry, EventBridge likely has your use-case covered. But, with that said, I can only do so much here in this post, so we'll just schedule "hello-aws" to simply run once an hour at :03 minutes past.

To schedule a job:

PROJECT="hello-aws"
QUEUE_NAME="$PROJECT-queue"
 
ROLE_ARN="$(aws iam get-role --role-name ecsTaskExecutionRole --query 'Role.Arn' --output text)"
 
JOBDEF_NAME="${PROJECT}-jobdef"
SCHED_NAME="${JOBDEF_NAME}-schedule"
JSON_FILE="createSched-$SCHED_NAME.json"
 
cat << EOF > $JSON_FILE
{
    "ActionAfterCompletion": "NONE",
    "Description": "executing ${JOBDEF_NAME}",
    "FlexibleTimeWindow": {
        "MaximumWindowInMinutes": 5,
        "Mode": "FLEXIBLE"
    },
    "GroupName": "default",
    "Name": "$SCHED_NAME",
    "ScheduleExpression": "cron(03 * * * ? *)",
    "ScheduleExpressionTimezone": "America/New_York",
    "State": "ENABLED",
    "Target": {
        "Arn": "arn:aws:scheduler:::aws-sdk:batch:submitJob",
        "Input": "{\n  \"JobDefinition\": \"${JOBDEF_NAME}\",\n  \"JobName\": \"${JOBDEF_NAME}\",\n  \"JobQueue\": \"${QUEUE_NAME}\"\n}",
        "RetryPolicy": {
            "MaximumEventAgeInSeconds": 86400,
            "MaximumRetryAttempts": 185
        },
        "RoleArn": "${ROLE_ARN}"
    }
}
EOF
 
echo -e "\nVerify JSON file [$JSON_FILE], then exec this command:"
echo -e "\naws scheduler create-schedule --cli-input-json file://$JSON_FILE\n"

As you can see, EventBridge offers numerous scheduling options (and many more that are not referenced in the JSON above), but I want to call out a few things to your attention:

ScheduleExpression:
specifies running every hour at :03 minutes past. (see cron expression syntax)
Target: Input:
Identifies the job definition we want to run. Remember, the job definition has the ability to override the container's default CMD.
RoleArn:
Determines the role to inherit which provides the permissions and capabilities the job needs to run. We've been using our account's "ecsTaskExecutionRole" consistently throughout this exercise.

Successfully creating the schedule should look similar to:

{
    "ScheduleArn": "arn:aws:scheduler:us-east-1:182999999954:schedule/default/hello-aws-jobdef-schedule"
}

You can manage this schedule in the AWS Console > Amazon EventBridge > Schedules > named similarly to "hello-aws-jobdef-schedule".

AWS EventBridge > Schedule the job with overrides

We already showed how to register a job definition that specifies an override to the container's CMD in Part 2, "Deploy and Run". So extending on that concept, if you want to schedule multiple jobs that specify different overrides to the "hello-aws" container's CMD, you'll need to register a job definition for each override, and then you can create a schedule for each job definition.

That is:

Create a new job definition with the desired CMD override
Schedule a job that references the new job definition
Repeat as needed.

The challenge I see with this is if you have one program that can take a large number of different input parameters that you want to schedule. If you want to schedule 100 different invocations of the one program, then you must register 100 different job definitions for each schedule to reference. It's not a huge dilemma as we've already established an automated coding approach to do so, but it does seem excessive.

Sidebar:

How about creating multiple schedules that all reference the same job definition but where each one specifies a different CMD override for the container?

You can't. At least, I don't think you can. Well, not directly from the schedule itself anyway.

To be honest, this vexes me.

In the last post, we showed that you can interactively submit a job referencing a job definition and override the CMD parameter for the container (that is, overriding the job definition's CMD specification as well as the container's default CMD specification). That's exactly what I'd like to see the scheduler to do here. Why can "aws batch submit-job" do it, but "aws scheduler create-schedule" cannot? Grrrrrrrr.

Just for fun, I gave it a shot anyway by providing a slightly different "Input:" line:

"Input": "{\n  \"JobDefinition\": \"${PROJECT}\",\n  \"JobName\": \"${JOBDEF_NAME}\",\n  \"JobQueue\": \"${QUEUE_NAME}\",\n  \"containerOverrides\": [\n    {\n      \"command\": [\"echo\",\"hello, scheduled job in EventBridge\"]\n    }\n  ]\n}\n",Note that I've specified an override CMD that should cause the job to echo the string, "hello, scheduled job in EventBridge". Unfortunately, when I try to create that schedule, the AWS CLI complains:

An error occurred (ValidationException) when calling the CreateSchedule 
operation: Invalid RequestJson provided. Reason The field 'containerOverrides' 
is not supported by api 'submitJob' for the service 'aws-sdk:batch'.

Referring back to the "aws-sdk:batch:submitJob" API reference documentation, it certainly looks like "containerOverride" should be allowed. :-\

Ah well, maybe I'm just missing something... please drop me a line if you figured it out.

Monitoring jobs in AWS

Wow - there are so many options here depending on what you're doing, the services involved, what you're willing to pay for, and so on. I can't even really scratch the surface, but let's do a few things that will work for the "hello-aws" project and that I hope will make sense such that you can extend them to your own projects in the future.

In Part 2, "Deploy and Run", we already showed that you can access the log of your AWS Batch job directly using the CLI if you know the Job ID. And, of course, you can use the point-and-click UI as well: AWS Console > Batch > Jobs > select your queue > select the job instance. You'll see its final status and will also see a link to its log stream for viewing the stdout captured from its run.

CloudTrail is another useful service offered by AWS. If offers pretty granular tracking of your account's event history so that you can ascertain who's doing what and when. It tracks activity at a lower level than "job" though, breaking that up into events like "submitJob", "runTask", "createLogStream", etc. This is really helpful in situations when - as this author can attest - you've misnamed the job definition for your scheduled job such that it can't run and so nothing shows up in the AWS Batch jobs nor in the CloudWatch log stream. Being able to peek inside the "submitJob" item to see the error about "job definition not found" is really helpful.

I could go on and on and on looking at observability and monitoring services offered by AWS. But for now, let's setup a simple notification to alert us if there's a problem when the job runs.

Use CloudWatch and SNS to setup job alerts

Amazon CloudWatch offers a selection of tools to monitor AWS resources and the applications "in real time". You can use CloudWatch to collect and track metrics which you define as measures for your resources and applications. As we've seen with the "hello-aws" project, it also provides a repository for logs from your jobs run in AWS Batch (or elsewhere).

Amazon Simple Notification Service (SNS) is a managed publish and subscribe service. The idea is that a publisher (likely associated with your job) can communicate asynchronously with subscribers (or consumers) by sending messages to a topic. For the "hello-aws" project, we'll leverage this to get email alerts about what's happening when the code runs.

Like everything else we've seen, getting this setup requires a few steps to get everything defined:

> SNS: Create a topic

In Amazon SNS, a topic is a communication channel used as a publishing destination and subscription target. We'll create one for the "hello-aws" project to refer to:

PROJECT="hello-aws"

aws sns create-topic --name ${PROJECT}-topic

And in return, you're provided with the "TopicArn" for later reference, similar to:

{
    "TopicArn": "arn:aws:sns:us-east-1:182999999954:hello-aws-topic"
}

> SNS: Subscribe to the topic to get emails

In this scenario, we want an email notification whenever something is published to the "hello-aws" project's topic in SNS. So, we will setup a subscription to the topic that sends an email to the address we specify.

Edit the Endpoint in the JSON below to specify your own email address!
Else you won't get any messages at all.

AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
PROJECT="hello-aws"
JSON_FILE="subscribe-to-${PROJECT}-topic.json"

cat << EOF > $JSON_FILE
{
    "TopicArn": "arn:aws:sns:us-east-1:${AWS_ACCOUNT_ID}:${PROJECT}-topic",
    "Protocol": "email",
    "Endpoint": "your.email@example.com",
    "ReturnSubscriptionArn": true
}
EOF
 
echo -e "\nVerify JSON file [$JSON_FILE], then exec this command:"
echo -e "aws sns subscribe --cli-input-json file://$JSON_FILE\n"

In return, you should see confirmation with the Arn of the new subscription similar to:

{
    "SubscriptionArn": "arn:aws:sns:us-east-1:182999999954:hello-aws-topic:e098b1cf-7922-492b-ba7b-7289c0490530"
}

You will also receive an email at the address you specified which allows you to opt-in to receive the messages:

And when you click the link in the email to accept the subscription, your web browser should show the confirmation:

And you're all set with SNS. Now let's get some content generated that will post messages to that topic so you'll get email about it.

> Cloudwatch: Create a metric filter

As you'll recall, the "hello-aws" project is really very simple. And in Part 2, "Deploy and Run", we created a job definition that basically echos out, "hello, job definition in AWS Batch". But for a more complex project, the log might return all kinds of interesting information, warning, and/or error messages. So, let's setup a metric filter in CloudWatch that will trigger when a specific keyword or string is found.

PROJECT="hello-aws"
LOGGROUPNAME="/aws/batch/job"
JSON_FILE="put-metric-filter-${PROJECT}.json"

cat << EOF > $JSON_FILE
{
    "logGroupName": "$LOGGROUPNAME",
    "filterName": "${PROJECT}-filter",
    "filterPattern": "\"hello, job\"",
    "metricTransformations": [
        {
            "metricName": "Counting_hello_job",
            "metricNamespace": "$PROJECT",
            "metricValue": "1",
            "defaultValue": 0.0,
            "unit": "Count"
        }
    ]
}
EOF
 
echo -e "\nVerify JSON file [$JSON_FILE], then exec this command:"
echo -e "aws logs put-metric-filter --cli-input-json file://$JSON_FILE\n"

In particular, this metric filter will look for the complete string "hello, job" and count each instance it sees in logs that are placed in the "/aws/batch/job" log group. > Cloudwatch: Create an alarm

A CloudWatch alarm is basically an event that's created/triggered based on the metric it's associated with. In this simple example, we want to know when jobs submitted using the "hello-aws" job definition in AWS Batch write the string, "hello, job" to their log.

Notice that this isn't necessarily a real problem, per se. The "hello-aws" code produces that message by design of our intent. In other words, it's OK that "hello, job" is in the log - we expect it to be there. And we want to be notified about it without setting off a lot of alarm bells.

But that said, it helps to understand that when it comes to defining this stuff to AWS, the word "alarm" has two meanings:

a·larm
/əˈlärm/

noun

the resource we're defining for tracking events in our AWS resources.
"We'll use the concept of an alarm to capture any events of interest."

an alert for when a problem that needs attention has been found.
"Raise the alarm! We need to fix something!"

Let's create an alarm in Cloudwatch for the normal condition of finding the expected output:

AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
PROJECT="hello-aws"
JSON_FILE="put-metric-alarm-${PROJECT}.json"

cat << EOF > $JSON_FILE
{
    "AlarmName": "${PROJECT}-alarm",
    "AlarmDescription": "Alarm for occurrence of hello job",
    "ActionsEnabled": true,
    "OKActions": [
        "arn:aws:sns:us-east-1:${AWS_ACCOUNT_ID}:${PROJECT}-topic"
    ],
    "AlarmActions": [],
    "InsufficientDataActions": [],
    "MetricName": "Counting_hello_job",
    "Namespace": "$PROJECT",
    "Statistic": "Sum",
    "Period": 300,
    "EvaluationPeriods": 1,
    "DatapointsToAlarm": 1,
    "Threshold": 1.0,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold",
    "TreatMissingData": "missing",
    "Tags": [
        {
            "Key": "owner",
            "Value": "your.email@example.com"
        }
    ]
}
EOF
 
echo -e "\nVerify JSON file [$JSON_FILE], then exec this command:"
echo -e "aws cloudwatch put-metric-alarm --cli-input-json file://$JSON_FILE\n"

Because this is an expected event - and not a problem - we've specified the alarm state using the OKActions directive. Alternatively, if we were triggering off of "ERROR" messages in the log, we might choose to employ the AlarmActions state instead. A third state is possible for when there's not enough data in the metric - that's when you'd use the InsufficientDataActions state.

We've also established that this alarm will monitor the "Counting_hello_job" metric filter for 1 event within 5 minutes (period = 300 seconds). Recall that we setup the schedule above to run "hello-aws" every hour at :03 minutes past.

Sit back and enjoy

The next time the hello-aws job definition is referenced to run a job and generates the "hello, job definition in AWS Batch" message, then the hello-aws-filter metric filter we defined will match on the "hello, job" string in the log and increment its count. That will trigger the hello-aws-alarm we set up to push an event to the hello-aws-topic in Amazon SNS where the subscription for your email notification will deliver a message similar to:

You are receiving this email because your Amazon CloudWatch Alarm "hello-aws-alarm" in the US East (N. Virginia) region has entered the OK state, because "Threshold Crossed: 1 out of the last 1 datapoints [0.0 (05/03/24 17:51:00)] was not greater than or equal to the threshold (1.0) (minimum 1 datapoint for ALARM -> OK transition)." at "Tuesday 05 March, 2024 18:21:54 UTC".
 
View this alarm in the AWS Management Console:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/hello-aws-alarm
 
Alarm Details:
- Name:                       hello-aws-alarm
- Description:                Alarm for occurrence of hello job
- State Change:               ALARM -> OK
- Reason for State Change:    Threshold Crossed: 1 out of the last 1 datapoints [0.0 (05/03/24 17:51:00)] was not greater than or equal to the threshold (1.0) (minimum 1 datapoint for ALARM -> OK transition).
- Timestamp:                  Tuesday 05 March, 2024 18:21:54 UTC
- AWS Account:                182999999954
- Alarm Arn:                  arn:aws:cloudwatch:us-east-1:182999999954:alarm:hello-aws-alarm
 
Threshold:
- The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 1.0 for at least 1 of the last 1 period(s) of 1800 seconds.
 
Monitored Metric:
- MetricNamespace:                     hello-aws
- MetricName:                          Counting_hello_job
- Dimensions:
- Period:                              300 seconds
- Statistic:                           Sum
- Unit:                                not specified
- TreatMissingData:                    missing
 
 
State Change Actions:
- OK: [arn:aws:sns:us-east-1:182999999954:hello-aws-topic]
- ALARM:
- INSUFFICIENT_DATA:

In this very simple example, we set up notifications on the minimum occurrence of a specific string in the log. Looking at the message generated above, notice that state change ALARM -> OK is what caused AWS to send an event to the hello-aws-topic in SNS. If we want notifications for when this falls into an ALARM state (or INSUFFICIENT_DATA state), then we can specify the same topic or provide a different one.

Tear it all down

If you've gotten this far and experimented with all the examples, remember it's a good idea to delete everything when you're done if you're no longer using it. This is especially true if you're playing in a shared sandbox provided by someone else's subscription in AWS. Not only does this save you money, but it also reduces the administrative overhead of the folks responsible for keeping those sandboxes operational (your IT team).

The following script will generate the commands you can use to delete everything. The idea is that you'll copy-and-paste section-by-section so that you can watch what happens in the environment.

# Common vars
PROJECT="hello-aws"
AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
REGION="us-east-1"
LOGGROUPNAME="/aws/batch/job"

# Colors
RED='\033[0;31m'
PLAIN='\033[0m' 


# DELETE CLOUDWATCH ALARM
# -----------------------
# Watch from: AWS Console > CloudWatch > All Alarms
#
CMD="aws cloudwatch delete-alarms --alarm-names ${PROJECT}-alarm"
echo -e "\nVerify it looks correct, then exec this command:"
echo -e "${RED}${CMD}${PLAIN}\n"


# DELETE CLOUDWATCH METRIC-FILTER
# -------------------------------
# Watch from AWS Console > CloudWatch > Log groups > /aws/batch/job > Metric filters
#
CMD="aws logs delete-metric-filter --log-group-name $LOGGROUPNAME --filter-name ${PROJECT}-filter"
echo -e "\nVerify it looks correct, then exec this command:"
echo -e "${RED}${CMD}${PLAIN}\n"


# DELETE SNS SUBSCRIPTION
# -----------------------
# Watch from AWS Console > Simple Notification Service > Subscriptions
#
SUB_ARN=$(aws sns list-subscriptions-by-topic --topic-arn arn:aws:sns:${REGION}:${AWS_ACCOUNT_ID}:${PROJECT}-topic  --query 'Subscriptions[0].SubscriptionArn' --output text)
CMD="aws sns unsubscribe --subscription-arn $SUB_ARN"
echo -e "\nVerify it looks correct, then exec this command:"
echo -e "${RED}${CMD}${PLAIN}\n"


# DELETE SNS TOPIC
# ----------------
# Watch from AWS Console > Simple Notification Service > Topics
#
CMD="aws sns delete-topic --topic-arn arn:aws:sns:${REGION}:${AWS_ACCOUNT_ID}:${PROJECT}-topic"
echo -e "\nVerify it looks correct, then exec this command:"
echo -e "${RED}${CMD}${PLAIN}\n"


# DELETE ALL HELLO-AWS LOGS
# -------------------------
# Watch from AWS Console > CloudWatch > Log groups > /aws/batch/job > filter="hello-aws"
#
HELLO_LOGS=( $(aws logs describe-log-streams --log-group-name $LOGGROUPNAME  --log-stream-name-prefix $PROJECT --query 'logStreams[].logStreamName' --output json | grep -v "\[\|\]"| awk -F"\"" '{ print $2 }') )
echo -e "\nVerify command(s) looks correct, then exec:"
for log in "${HELLO_LOGS[@]}"
do
  CMD="aws logs delete-log-stream --log-group-name $LOGGROUPNAME --log-stream-name $log"
  echo -e "${RED}${CMD}${PLAIN}\n"
done

# DELETE EVENTBRIDGE SCHEDULE
# ---------------------------
# Watch from AWS Console > EventBridge > Schedules
#
JOBDEF_NAME="${PROJECT}-jobdef"
SCHED_NAME="${JOBDEF_NAME}-schedule"
CMD="aws scheduler delete-schedule --name $SCHED_NAME"
echo -e "\nVerify it looks correct, then exec this command:"
echo -e "${RED}${CMD}${PLAIN}\n"


# DEREGISTER BATCH JOB DEFINITION
# -------------------------------
# Watch from AWS Console > Batch > Job definitions
#
LATEST_REV=$(aws batch describe-job-definitions --status ACTIVE --job-definition-name $JOBDEF_NAME --query 'jobDefinitions[].revision' --output text | awk ' { print $1 } ')
CMD="aws batch deregister-job-definition --job-definition $JOBDEF_NAME:$LATEST_REV"
echo -e "\nVerify it looks correct, then exec this command:"
echo -e "${RED}${CMD}${PLAIN}\n"
# repeat this block as needed for any older, active revisions


# DISABLE, then DELETE BATCH JOB QUEUE
# ------------------------------------
# Watch from AWS Console > Batch > Job queues
#
QUEUE_NAME="$PROJECT-queue"
CMD1="aws batch update-job-queue --state disabled --job-queue $QUEUE_NAME"
CMD2="aws batch delete-job-queue --job-queue $QUEUE_NAME"
echo -e "\nVerify they look correct, then exec these commands:"
echo -e "${RED}${CMD1}\n\n${CMD2}${PLAIN}\n"


# DISABLE, then DELETE BATCH COMPUTE ENVIRONMENT (FARGATE)
# --------------------------------------------------------
# Watch from AWS Console > Batch > Compute environments
#
CE_NAME="$PROJECT-fargate"
CMD1="aws batch update-compute-environment --state disabled --compute-environment $CE_NAME"
CMD2="aws batch delete-compute-environment --compute-environment $CE_NAME"
echo -e "\nVerify they look correct, then exec these commands:"
echo -e "${RED}${CMD1}\n\n${CMD2}${PLAIN}\n"


# DELETE IMAGES and REPOSITORY 
# ----------------------------
# Watch from AWS Console > Elastic Container Registry > Private Repositories > filter="hello-aws"
#
IMAGELIST=( $(aws ecr list-images --repository-name $PROJECT --query 'imageIds[].imageDigest' --output text) )
for img in "${IMAGELIST[@]}"
do
  DIGESTLIST="$DIGESTLIST imageDigest=$img"
done

CMD1="aws ecr batch-delete-image --repository-name $PROJECT --image-ids $DIGESTLIST"
CMD2="aws ecr delete-repository --repository-name $PROJECT  --region $REGION"
echo -e "\nVerify they look correct, then exec these commands:"
echo -e "${RED}${CMD1}\n\n${CMD2}${PLAIN}\n"

Coda

I'll go ahead and wrap up here. Hopefully, this series of posts has been helpful to get you started with setting up serverless processes in AWS where you can schedule them to run regularly, monitor their operations, and receive notifications of notable events. While we couldn't cover most of the rich functionality provided by AWS across its numerous services, the examples provided should get you a lot closer to figuring out how to make things happen.

To that end, this series has focused heavily on using the AWS CLI utility along with JSON input files to provide a scriptable approach in support of DevOps procedures. You're also invited to navigate to the AWS Console in your web browser where these same actions can be performed using a point-and-click approach. I found it helpful to see how the web pages were organized (and documented) to make sense of the JSON parameters being fed into the AWS CLI. Many of the "wizard" pages will also show you the underlying JSON they submit which is also useful.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library