New AWS Certification: Data Engineer — Associate

Today (Monday 27th of November) is the first day of a new AWS Certification: The AWS Certified Data Engineer —Associate. This is in a beta period now, and as such, any candidates who sit this certification won’t get a result until up to 13 weeks after the end of the beta period.

That beta period is November 27, 2023 – January 12, 2024; so in some time in February or March candidates will find out how they went. During that period, AWS will be assessing where the pass mark should be.

I’ve been pretty forthright in attempting most AWS certifications; I’ve always like to lead from the front to help demonstrate even an aging open source tech geek sys admin like myself can do this. And to that end, I reserved a quiet meeting room at 0200 UTC today (10am AWST) to do the online-proctored exam for this certification.

As always there are terms & conditions on disclosure, so I can only speak at high level. It was 85 questions, the vast majority where to select one correct answer (key) from four possibilities; only a handful had the “select any TWO” option.

The questions I received focuses on Glue, Redshift, Athena, Kinesis, and in passing, S3 and IAM. I say “I received“, as there is a pool of questions, and I would have only had 85 from a larger pool; your assessment would likely be different questions.

It took me around ne hour fifteen minutes, and I went back to review just 2 questrions.

Overall I found this was perhaps a little more detailed and domain based than the existing Associate level certifications. There were use cases for Redshift that I’ve not used that stumped me. There were Glue and Databrew use cases I haven’t used in production.

All in all I think its well placed to ensure that candidates have a solid understanding of data engineering, fault tolerance, cost, and durability of data. For those that are doing cloud native data analytics pipelines, then I would say this should be on your list of certs to get.

We’ll find out in Q1 if I am up to speed on this. 😉

AWS CodeBuild: Lambda Support

A few days ago, AWS announced Lambda support for their Code Build service.

Code build sits amongst a slew of Code* services, which developers can pick and chose from to get their jobs done. I’m going to concentrate on just three of them:

  • Code Commit: a managed Git repository
  • Code Build: a service to launch compute* to execute software compile/built/test services
  • Code Pipeline: a CI/CD pipeline that helps orchestrate the pattern of build and release actions, across different environments

My common use case is for publishing (mostly) static web sites. Being a web developer since the early 1990s, I’m pretty comfortable with importing some web frameworks, writing some HTML and additional CSS, grabbing some images, and then publishing the content.

That content is typically deployed to an S3 Bucket, with CloudFront sitting in front of it, and Route53 doing its duty for DNS resolution… times two or three environments (dev, test, prod).

CodePipeline can be automatically kicked off when a developer pushes a commit to the repo. For several years I have used the native Code Pipeline service to deploy this artifact, but there’s always been a few niggles.

As a developer, I also like having some pre-commit hooks. I like to ensure my HTML is reasonable, that I haven’t put any credentials in my code, etc. For this I use the pre-commit hooks.

Here’s my “.pre-commit-config.yaml” file, that sits in the base of my content repo:

repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v4.4.0
     hooks:
     -   id: mixed-line-ending
     -   id: trailing-whitespace
     -   id: detect-aws-credentials
     -   id: detect-private-key
 -   repo: https://github.com/Lucas-C/pre-commit-hooks-nodejs
     rev: v1.1.2
     hooks:
     -   id: htmllint

There’s a few more “dot” files such as “.htmllintrc” that also get created and persisted to the repo, but here’s the catch: I want them there for the developers, but I want them purged when being published.

Using the original CodePipeline with the native S3 Deployer was simple, but didn’t give the opportunity to tidy up. That would require Code Build.

However, until this new announcement, using code build meant defining a whole EC2 instance (and VPC for it to live in) and waiting the 20 – 60 seconds for it to start before running your code. The time, and cost, wasn’t worth it in my opinion.

Until now, with Lambda.

I defined (and committed to the repo) a buildspec.yml file, and the commands in the build section show what I am tidying up:

version: 0.2
phases:
  build:
    commands:
      - rm -f buildspec.yml .htmllintrc .pre-commit-config.yaml package-lock.json package.json
      - rm -rf .git

Yes, the buildspec.yml file is one of the files I don’t want to publish.

Time to change the Pipeline order, and include a Build stage that created a new output artifact based upon that. The above buildspec.yml file then has an additional section at the end:

artifacts:
  files:
    - '**/*'

This the code Build job config, we define a new name for the output artifacts, int his case I called it “TidySource”.However there was an issue with the output artifact from this.

When CodeCommit triggers a build, it makes a single artifact available to the pipeline: the ZIP contents from the repo, in an S3 Bucket for the pipeline. The format of this object’s key (name) is:

s3://codepipeline-${REGION}-{$ID}/${PIPELINENAME}/SourceArti/${BUILDID}

The original S3 Deployer in CodePipeline understood that, and gave you the option to decompress the object (zip file) when it put it in the configured destination bucket.

CodeBuild supports multiple artifacts, and its format for the output object defined from the buildspec is:

s3://codepipeline-${REGION}-{$ID}/${PIPELINENAME}/SourceArti/${BUILDID}/${CODEBUILDID}

As such, S3 Deployer would then look for the object that should match the first syntax, and fails.

Hmmm.

OK, I had one more niggle about the s3 deployer: it doesn’t tidy up. If you delete something from your repo, the S3 deploy does not delete from the deployment location – it just unpacks over the top, leaving previously deployed files in place.

So my last change was to ditch both the output artifact from code build, and the original S3 deployer itself, and use the trusty aws s3 sync command, and a few variables in the code pipeline:

- aws s3 --region $AWS_REGION sync . s3://${S3DeployBucket}/${S3DeployPrefix}/ --delete

So the pipeline now looks like:

You can view the resulting web site at https://cv.jamesbromberger.com/. If you read the footnotes here it will tell you about some of the pipeline. Now I have a new place to play in – automating some of the framework management via NPM during the build phase, and running a few sed commands to update resulting paths in HTML content.

But my big wins are:

  1. You can’t hit https://cv.jamesbromberger.com/.htmllintrc any more (not that there was anything secure in there, but I like to be… tidy)
  2. Older versions of frameworks (bootstrap) are no longer lying around in /assets/bootstrap-${version}/.
  3. Its not costing me more timer or money when doing this tidy up, thanks to Lambda.

IPv6 for AWS Lambda connections (outbound)

Another step forward recently with the announcement that AWS Lambda now supports IPv6 for connections made from your lambda-executed code.

Its great to see another minor improvement like this. External resources that your service depends upon – APIs, etc – should now see connections over IPv6.

If you host an API, then you should be making it dual-stack in order to facilitate your clients making IPv6 connections, and avoiding things like small charged and complexity for using up scarce IPv4 addresses.

However, this is also useful if you’re trying to access private resources with in an AWS VPC.

VPC Subnets can be IPv4 only, dual-stack, or IPv6 only. Taking the IPv6 only approach permits you to provision vast numbers of EC2 instances, RDS, etc. Now your Lambda code can access those services directly without needing a proxy bottleneck) to do so.

At some stage, we’ll be looking at VPCs that are IPv6 only, with only API Gateways and/or Elastic Load Balancers being dual stack for external inbound requests.

Presumably Lambda will be dual stack for some time, but perhaps there is a future possibility that IPv6-only Lambda could be a thing – ditching the IPv4 requirement completely for use cases that support. Even then, having a VPC lambda connecting to an IPv6 only subnet, but with DNS64 and NAT64 enabled, would perhaps still permit backwards connectivity to IPv4 only services. It’s a few hops to jump through, but could be useful when there is very rare IPv4-only services being accessed from your code.

AWS SkillBuilder: moving to the new AWS Builder ID

For anyone learning AWS, then online learning platform that is Skillbuilder.aws is a well known resource. It took over from training.aws as the training platform several years ago, though the latter lives on in a very minor way – as a bridge between SkillBuilder and the AWS Certification portal.

Many individuals hold AWS certifications – personal I currently hold 9 of them. The majority of people have those certifications in an AWS Training & Certification account that is accessed via logging in via a personal login – something not linked to your employer. This is because certifications remain attached to the individual, not the employer.

Historically, this meant using Logon For Amazon, to use the same credentials you could potentially use on the Amazon.com retail platform. Yes, the same credentials that some people buy underpants with also links to your AWS Certifications.

In the AWS Partner space, individuals often end up needing to acquire AWS Accreditations; however they are not exposed on SkillBuilder.aws to individuals not linked to a recognised AWS Partner organisation. Instead, individuals much register in the AWS Partner portal, using a recognised email address domain that links to a given AWS partner entity.

From the partner portal, a user can then access SkillBuilder, using the AWS Partner Portal as a login provider, triggering the creation of a (second) SkillBuilder login – but this time linked as a Partner account.

This is clearly confusing, having multiple logins.

SkillBuilder has evolved, and now also offers paid subscriptions for enterprises. For that set of organisations, federated logins (SSO) are available.

Now, the situation is changing again for individuals. AWS has introduced their separate identity store, called the AWS Builder ID.

Login prompt from Skillbuilder, 2023

Luckily, this new AWS Builder ID does not create a separate identity within SkillBuilder, but adds an additional login for the account linked to your existing Login with Amazon identity.

Onboarding to this is easy. The process validates that you have the email address, and then you can simply use the AWS Builder ID login option where you used to use Login with Amazon.

I’d suggest if you are learning AWS, start using your AWS Builder ID to access SkillBuilder.

Software License Depreciation in a Cloud World

Much effort is spent on preserving and optimising software licenses when organisations shift their workloads to a cloud provider. It’s seen as a “sunk cost”, something that needs to be taken whole into the new world, without question.

However, some vendors don’t like their customers using certain cloud providers, and are making things progressively more difficult for those organisations that value (or are required) to keep their software stack well maintained.

Case in point, one software vendor who has their own cloud provider made significant changes to their licensing, removing rights progressively for customers to have the choice to run their acquired licences in a competitors cloud.

I say progressively, customers can continue to run (now) older versions of the software before that point in time the licensing was modified.

The Security Focus

Security in IT is a moving target. Three’s always better ways of doing something, and previous ways which, once were the best way, but are now deemed obsolete.

Let me give you a clear example: network encryption in flight. The dominant protocol used to negotiate this is called Transport Layer Security (TLS), and its something I’ve written about many times. There’s different versions (and if you dig back far enough, it even had a different name – SSL or Secure Sockets Layer).

Older TLS versions have been found to be weaker, and newer versions implemented.

But certain industry regulators have mandated only the latest versions be used.

Support for this TLS is embedded in both your computer operating system, and certain applications that you run. This permits the application to make outbound connections using TLS, as well as listen and receive connections protected with TLS.

Take a database server: its listening for connections. Unless you’ve been living under a rock, the standard approach these days is to insist on using encryption in flight in each segment of your application. Application servers may access your database, but only if the connection is encrypted – despite them sitting in the same data centre, possibly in the same rack or same physical host! It’s an added layer of security, and the optimisations done mean its rarely a significant overhead compared to the eavesdropping protection it grants you.

Your operating system from say 2019 or before may not support the latest TLS 1.3 – some vendors were pretty slow with implementing support for it, and only did so when you installed a new version of the entire operating system. And then some application providers didn’t integrate the increased capability (or a control to permit or limit the version of TLS) in their software in those older versions from 2019 or earlier.

But in newer versions they have fixed this.

Right now, most compliance programs require only TLS 1.2 or newer, but it is foreseeable that in future, organisations will be required to “raise the bar” (or drawbridge) to use only TLS 1.3 (or newer), at which time, all that older software becomes unusable.

Those licences become worthless.

Of course, the vendor would love you to take a new licence, but only if you don’t use other cloud providers.

Vendor Stickiness

At this time, you may be thinking that this is not a great customer relationship. You have an asset that, over time, will become useless, and you are being restricted from using your licence under newer terms.

The question then turns to “why do we use this vendor”. And often it is because of historical reasons. “We’ve always used XYZ database”, “we already have a site licence for their products, so we use it for everything”. Turns out, that’s a trap. Trying to smear cost savings by forcing technology decisions because of what you already have may preclude you from having flexibility in your favour.

For some in the industry, the short term goal is the only objective; they signa purchase order to reach an immediate objective, without taking the longer term view of where that is leading the organisation – even if that’s backing hem into a corner. They celebrate the short term win, get a few games of golf out of it, and then go hunting for their next role elsewhere, using the impressive short term saving as their report card.

A former colleague of mine once wrote that senior executive bonuses shouldn’t be paid out in the same calendar year, but delayed (perhaps 3 years) to ensure that the longer term success was the right outcome.

Those with more fortitude with change have, over the last decade, been embracing Open Source solutions for more of their software stack. The lack of licence restriction – and licence cost – makes it palatable.

The challenge is having the team who can not only implement potential software changes, but also support a new component in your technology stack. For incumbent operations and support teams, this can be an upskilling challenge; some wont want to learn something new, and will churn up large amounts of Fear, Uncertainty and Doubt (FUD). Ultimately, they argue it is better to just keep doing what we’ve always done, and pay the financial cost, instead of the effort to do something better.

Because better is change, and change is hard.

An Example

Several years ago, my colleagues helped rewrite a Java based application and change the database from Oracle, to PostgreSQL. It was a few months from start to finish, with significant testing. Both the Oracle and PostgreSQL were running happily on AWS Relational Database Service (RDS). The database was simple table storage, but the original application developers already had a site license for Oracle, and since that’s what they had, that’s what they’ll use.

At the end of the project, the cost savings were significant. The return on investment for the project services to implement the change was around 3 months, and now, years later, the client is so much better off financially. It changed the trajectory of the TCO spend.

The coming software apocalypse

So all these licences that are starting to hold back innovation are becoming progressively problematic. The time that security requirements tighten again, you’re going to hear a lot of very large, legacy software license agreements disintegrate.

Meanwhile, some clod providers can bundle the software licence into the hourly compute usage fee. If you use it, you pay for it; when you don’t use it, you don’t pay for it. if you want a newer version, then you have flexibility to do so. Or perhaps event to stop using it.