It’s something you can’t get away from today, having to interact with a Cloud provider. Whether that’s Microsoft Azure, Amazon Web Services (AWS), Google Cloud (GCP), or any other provider. We all have a preference for which ones we like better, some with more validity than others.
Recently, a colleague and I held a presentation about AWS, more specifically on why it’s essential for a Data/Analytics Engineer to feel comfortable using the Cloud Storage and Serverless functions. In this case, AWS S3 Buckets and AWS Lambda. After the presentation, we challenged the participants to perform a relatively simple task. Retrieve data from the Joke API, place the raw file into an S3 Bucket using AWS Lambda. Then, with another Lambda function, have a trigger whenever something is PUT into that S3 Bucket. This second Lambda is supposed to skim through the retrieved file and filter the safe and unsafe jokes. Afterwards, place the safe ones into a folder called “safe” and the unsafe ones into another folder called “unsafe”. Relatively straightforward, but it provided some challenges along the way. Like an event that was wrongfully set up, and eventually being exponentially triggered overnight. That resulted in me waving the Free Tier goodbye and having to pay. Let’s get into how to make sure you can learn from this mistake—and avoid a surprise bill of your own?
Getting started in the cloud
As a data/analytics engineer, you luckily don’t have to know the cloud by heart and have to set up everything yourself. If you do, I feel for you. Now, if you don’t, we can put our focus on the resources that we will most likely use. This blog will focus primarily on the AWS stack.
The resources that we will be going over are AWS Lambda, S3, CloudWatch, Secrets Manager, SQS, and lastly, an IaC tool.
In this blog we’re assuming you’re using another Data Warehouse solution like Snowflake and ETL tool. This blog will therefore not delve into the Data resources AWS has to offer (DynamoDB, RDS, etc.).
Automation between environments
The cloud is primarily UI based, which is what makes it great in some cases. But as developers, we like to have things dynamic and automatic between our development, staging, and production environments. There are a couple of tools that could help with this, Terraform is a big one, or AWS their on CDK. Since last autumn, I’ve gained the opportunity to gain familiarity and experience with the latter while helping one of our clients.
Starting with AWS CDK, we first have to make sure it is installed on our local machine. A great starting point is to follow the guide AWS put out themselves on how to install and make your first CDK App. If your company already has AWS and is pretty mature, it might have a local package, making developing even easier. Make sure you know this beforehand, because it can remove a lot of overhead in the long run.
What AWS CDK basically serves for, is creating the resources and configuration you need. We set up our Lambda event triggers from here, create our S3 Buckets, IAM policies and perhaps even some logging along the way. Make sure you make your resource environment proof by including account information.
Save, save, save
We love hoarding data these days; the more, the merrier. Since storage has become so cheap these last few years, we have lost track of what we’re actually saving and doing with the data. Now, I’m not diving into what we can do to better improve this. I will be telling you how we can save even more data! And that is cloud storage. No more dusty old messy office basements, slow and old hardware, being maintained by a random power cord hanging from the ceiling. No, now we have an easily scalable storage solution with the cloud, S3 for AWS in this case.
S3 is very easy to set up and work with, but offers a lot of powerful features under the hood. Like versioning, intelligent tiering, lifecycle policies, and more. All that data we’re hoarding and not using but have to keep. We simply make a lifecycle policy that archives that data into the storage class we’d want. Now we have more budget to delay the fix of the infinite loop!
What’s also great about S3 like a lot of other Cloud Storage solutions, there’s a lot of compatibility with different tools, like your Data Warehouse or your ETL tool. This makes communicating and interacting with that said data so much easier.
λ
Lambda is amazing for programming and communicating with all your resources inside AWS on an event basis. But also for regular functions. It makes it all central, there’s no overhead from having to open a line up to the Cloud Provider, but directly interacting securely with your resources. It’s key that you know how to read the event input and maximize that however you can. You can make things more dynamic that way and reduce hard-coded variables in your code. You can use AWS’ pre-made events for e.g. Object creations on an S3 Bucket. Or you can use AWS SQS. You can customize the body however you please and include as much information in there as you’d like. A lot of the time, these premade ones are good enough for most cases. But it might happen that these won’t suffice; in that case, you could use different triggers. Like EventBridge with custom events, or directly invoking the Lambda outside your AWS environment if the trigger is outside AWS. Or even use a 3rd-party webhook from Shopify or Atlassian.
Shhhh
We might need an API key or some credentials for an outside application in Lambda. Now, definitely don’t hard-code this, or add them manually in the “Environment variables via the UI“, but make use of AWS Secrets Manager. You can use your AWS CDK Stack to pass the Secret’s name down as an environment variable into your Lambda function. This makes it dynamic cross-environment and safer.
Logs don’t lie
Logs are not to be forgotten, they tell the story of how good your program/application is running. Within AWS, if set up correctly from the start, all your Lambdas should have CloudWatch Logs. These logs will let us know whenever our function has been triggered at it’s core, but can provide so much more if we use them right. As a beginner developer, we learn to print everything we do or create. We can apply this same ritual to our Lambda function! Print every essential variable or outcome you’d want, but make it clear! If set up correctly across functions, you could even make a Lambda monitoring function for the logs and notify whenever something essential goes wrong, to your Teams channel or other communication tool by using keywords.
Of course, there is also the logging library available, and we incorporate this into our CloudWatch logs. It will be dumbed down to whatever your team is using or your preference.
In conclusion, diving into AWS as a Data/Analytics Engineer can feel like a steep learning curve, but with the right tools and knowledge, it’s an incredibly powerful platform. Whether you’re working with Lambda, S3, or CloudWatch, the key is to stay flexible, understand the events you’re triggering, and always keep an eye on your costs. Mistakes happen, but as long as we learn from them, they’re just stepping stones to becoming more efficient and effective in the cloud. So, take these lessons, avoid the pitfalls I encountered, and start building your cloud infrastructure with confidence!
Let’s get in contact and let me know your biggest fail in AWS Cloud!