AWS Re:Invent Day 1 thoughts

This is going to be a long week of learning how the world has changed. I’m already tired, and I’m not even there. My brain hurts (you’d not believe how many typos I am correcting here).

While (once again) I am not at Re:Invent in Las Vegas, Nevada, I’m tuned in to as many news sources as possible to try and catch what parts of the undifferentiated heavy lifting has changed. I’ve been one of the AWS Cloud Warriors for the last two years (2017-2018), which has been lucky enough for me to be given a conference ticket, but unfortunately I’ve not been able to get there.

While I may not be physically there, I am in spirit, having been nominated as one of the AWS Ambassadors.

However the live stream video (which has improved dramatically since 2014), the Tweets from various people, the updates on LinkedIn, RSS feeds, Release Notes, What’s New page, AWS Blog (hi Jeff), and indeed, the Recent Changes/Release History sections of lots of the documentation pages (such as this Release History page for CloudFormation) have given me more information to trawl through.

It’s now Tuesday night in Perth, Western Australia, and day two of Re:Invent but its only 7am Tuesday morning in Las Vegas (yes, I’m 16 hours in the future). Here’s my thoughts on the releases thus far:

100 GB/s networking in VPC

The ENA network interface was previously limited to 25 Gb/sec per instance on the largest instance types. Indeed, its worth noting that most network resources are limited to some degree by the instance size within an instance family. But now a new family – the C5n instances – have interfaces capable of up to 100 Gb/sec (that’s 12.6 GB/s – little b is bits, big B is bytes).

Much has been said about network throughput, and the comparison between ENA and SR-IOV in the AWS Cloud, and comparisons to other Cloud environments. 100 Gb/s now sets a new high bar that other vendors are yet to reach.

While its wonderful to have that level of throughput, its also worth noting that scale-out is still sometimes a good idea. 100 instances at 1 GB/s each may provide a better solution sometimes, but then again sometimes a problem doesn’t split nicely between multiple server instances. YMMV.

Transit gateway

Managing an enterprise within AWS usually a case of managing multiple AWS accounts. The ultimate in separation from a console/account level sometimes reverts to integration questions around network, governance and other considerations.

In March 2014 (yes, 4 and a half years ago), VPC service team introduced VPC Peering, a non-transitive peering arrangement between VPCs – non-throttling, no single point of failure way of meshing two separate VPCs together (including in separate accounts).

This announcement now gives a transitive way (hence the name) of meshing a spread enterprise deployment. There’s multiple reasons for doing so:

  • Compliance: all outbound (to Internet) traffic is deemed by your corporate policy to funnel via a centralised specific gateways.
  • Management overhead: organising N-VPCs to mesh together means creating (N-1)*N/2 peering arrangements, and double that number for routing table entries. If we have 4 environments (dev, test, UAT and Production), and 10 applications in their own environments, then that’s 40 VPCs, and 780 peering relationships and 1560 routing table updates.

Its worth noting that in some organisations, an accounts administrative users may themselves not have access to create an IGW for access to the internet; a Transit gateway may be the only way permitted for connectivity so it can be centrally managed.

But in taking central management, you now have a few considerations:

Blast radius. If you stuff up the Transit gateway configuration, you take down the organisation. With separation and peering, each VPC is its own blast radius.

  1. Cost: Transit gateway isn’t free. You probably want to permit S3 Endpoints for large volume object storage
  2. Throughput: 50 Gb/s may seem a lot, but now there are 100 Gb/s instances

ARM based A1 Instances

In 2013, when I worked at AWS, I spoke with friends at ARM and AWS Service teams about the possibility of this happening. The attractiveness of the reduced power envelope, and cost comparison of the chip itself made it already look compelling then. This was before Windows was compiled to ARM – and that support is only strengthening. Its heartening now to see this coming out the door, giving customers choice.

Earlier this month we saw an announcement about AMD CPUs. Now we have three CPU manufacturers to choose from in the cloud when looking to run Virtual Machines. Customers can now vote with their workloads as to what they want to use. The CPU manufacturers now have more reason to innovate and make better, faster or cheaper CPUs available. When you can switch platforms easily (you do DevOps, right? All scripted installs?) then its perhaps down to the cost question now.

Now, recently it was announced there will be a t3a. Wonder if there will be a t3a1?

Compute in the cloud just got even more commodity. Simon Wardley, fire up your maps.

S3 improvements (lots here)

Gosh, so much here already.

Firstly, an admission that AWS Glacier is no longer its own service, but folded under S3 and renamed as S3 Glacier. There’s a new API for glacier to make it easier to work with, and the ability to put objects to S3 and have them stored immediately as Glacier objects without having to have zero day archive Lifecycle policies.

SFTP transfers – finally, a commodity protocol for file uploads that simple integrators can use, without having to deploy your own maintained, patched, fault-tolerant, scalable ingestion fleet of servers. This right here is the definition of undifferentiated heavy lifting being simplified, but with a price of 30c/hour, you’re looking at US$216 before you include any data transfer charges.

Object Lock: the ability to put files and not be able to delete them for a period. For when you have strict compliance requirements. Currently can only be defined on a Bucket during Bucket creation.

S3 events seem to have got a lot more detailed as well, with more trigger types than can be sent to SQS, SNS, or straight to a Lambda function.

KMS with dedicated HSM storage

KMS has simplified the way that key management is done, but some organisations require a dedicated HSM for compliance reasons. Now you can tell KSM to use your custom key store (a single-tenneted CloudHSM devices in our VPC) as the storage for these keys, but still use KMS APIs for your own key interaction, and use those keys for your services.

A dedicated Security Conference

Boston, End of June. Two days.

Not so new (but really recent)

CLI Version 2

Something so critical – the CLI – used by so many poor-man (poor-person) integrations and CI/CD pipelines, now with a version 2 in the works. Its breaking changes time – but in the mean time, the v1 CLI continues to get updates.

Predictive AutoScaling

Having EC2 AutoScaling reactively scale when thresholds are breached has been great, but combining that with machine learning based upon previous scaling events to make predictive scaling is next-level .

Lambda Support for Python 3.7

You may initially think this is trivial, stepping up from Python 3.6 to Lambda with Python 3.7, but it means that Python Lambda code can now make TLS 1.3 requests. Updating from Python 3.6 to 3.7 is mostly trivial; from 2.7 to 3.x normally means re-factoring liburi/requests client libraries and liberal use of parentheses where previously they weren’t required (eg, for print()).

S3: Public Access Blocking

Block Public Access finally removes the need for custom Bucket policies to prevent accidental uploads with acl:public (which, when you’re using a 3rd party s3 client for which you can’t see or control the ACL used may be scary). The downfall of the previous policies that rejected uploads if ACL:public (or not acl:private) was used is that it interfered with the ability to do multi-part puts (different API).

There’s been way too many cases of customers leaving objects publicly accessible. This will become a critical control in future. Most organisations don’t want public access to S3: those that do want public, anonymous access probably should be using CloudFront to do so (and a CloudFront origin Access identity for this as well, with Lambda@Edge to handle auto indexing and trailing slash redirects).

DynamoDB: Encrypted by default

A big step up. In reality, the ‘encryption at rest’ scenario within AWS is a formality: as one of the few people in Australia who has actually been inside a US-East-1 facility (hey QuinnyPig, I recall that from your slide two weeks ago at Latency Conf) the physical security is superb; the separation of responsibility between the logical allocation of data, and the knowledge of the physical location are separate teams.

So given that someone in the facility doesn’t know where your data is, and someone who knows where it is doesn’t have physical access (and those with physical access cant smuggle storage devices in or out), we’re at a high bar (physical devices only leave facilities when crushed into a very fine powder, particularly for SSD based storage).

So the Encrypted At Rest capability is more a nice to have – an extra protection should the standard storage wiping techniques (already very robust) have an issue. But given the bulk of the AES algorithm has been in CPU extensions for years, the overhead of processing encryption is essentially no impact.

Summary

I’ve tried my best to stay aware of so much, but the last 24 months has stretched the definition of what Cloud is so very wide. IoT, Robotics, Machine Learning, Vision Processing, Connect, Alexa, Analytics, DeepLens, this list seems so wide before you dive deep to the details. And the existing stalwarts: Ec2, S3, SQS, and even VPC keep getting richer, and richer.

The above is the services I’ve been interested in – there is definitely a hell of a lot more in the last 24 hours as well.

What’s today (US time) going to bring? I need to get some sleep, because this is exhausting just trying to keep the brain up to date.

Gartner Magic Quadrant for Cloud 2018: Half the players dessimated

It’s as if the left hand side of the 2017 Gartner Cloud MQ just imploded! And just as interesting, most of those that were in the Visionaries quadrant are now relegated to the Niche Players sector.

Only six players remain compared to last years 14. Lets see who has gone this year:

  • Skytap
  • NTT
  • Joyent
  • Interoute
  • Fujitsu
  • Rackspace
  • CenturyLink
  • Virtustream

It’s no surprise to those close to the ground that the only real survivors here thus far are AWS and MS Azure, with Google barely making it into the top right corner, which actually looks like they just flopped over the line from last year &mdash but progress none the less. And while the gap between the top two is closing, AWS is still far above in “Ability to execute”, and slightly ahead in “Completeness of vision”.

For those that had chosen one of the departed 8 as their Cloud provider, then its time to question what their strategy is, and what yours is. Gartner may have upped its inclusion requirements resulting in some of these players being filtered out, and that may have no impact on those providers. But it may resonate poorly for their sales prospects going forward, all of which could have a long term downward trend, increased cost per customer, loss of economies of scale, etc.

Meanwhile, the rate of innovation by service improvement, refinement, or new offerings continues.

AWS VPCs: Calculating Subnets in CloudFormation

Virtual Private Cloud is a construct in AWS that gives the customer their own, er, virtual network for the deployment of network based resources such as virtual machines and more. Its been around for nearly a decade, and is a basic construct that helps provide security of those resources within an AWS Region.

CloudFormation is the (text, either YAML or JSON) templating language (service) that can take a definition of resources you would like configured, and does the execution of creating these resources for you, saving you the hassle of having to either navigate the web console for hours, or scripting up many API calls (which could be thousands of API create calls).

VPCs can be quite complex; they can specify subnets for resources, across multiple Availability Zones within a Region, define routing tables, Endpoints to create, and much more. So it probably comes as no surprise that managing a VPC via CloudFormation is a natural desire. The configuration of the virtual network for a workload needs to be as management in a CI/CD fashion as the workload that will live in there.

But there’s often been a limitation in making this simple; mathematics.
Continue reading “AWS VPCs: Calculating Subnets in CloudFormation”

AWS Certifications in Perth (II)

I wrote last year about sitting AWS Certifications in Perth. I’ve done another two AWS Certifications in the last month (Networking Specialty, and Cloud Practitioner), and a few things have changed. Gone is Kryterion as the assessment provider, and in has come PSI; this means new venues- and there’s now only one in Perth at 100 Havelock St, West Perth.

It’s a new-ish building I know well; an old friend was working on the top floor for a while, and I spoke to his teams about AWS several times (they became and AWS reference customer). Small Italian-inspired coffee shop on the ground floor (more on this later).

The booking process for exams is much the same, but now via https://aws.training/ (funky new DNS TLD). The certifications with PSI happen via their customer rigged Kiosk systems: a PC with two webcams, one mounted on the monitor facing the candidate, and one positioned on mast protruding above the screen facing the desk (down). With these two cameras, a remote monitor can view the candidate and the desk at all times to ensure there is no compromise of reference materials; and one person remotely monitoring can theoretically be proctoring multiple students in many locations simultaneously (I suspect they are listening too).

With this customer rig, there are only limited seats — in Perth, there are two. And the booking process is scheduling candidates to one of these Kiosks — literally called Kiosk 1 and Kiosk 2 — are located in a small room on the 1st floor of 100 Havelock St, looked after by the friendly Regus staff.

The exam start time is often 8:30am, and advise on the booking emails recommends turning up 15 minutes before this. By contrast, some non-AWS exams scheduled with PSI on the same Kiosks recommend arriving 30 minutes before hand. But there’s a catch; the doors on the ground floor do not unlock for access until around 8:25am, and Regus doesn’t often get staffed until 8:30am (Regus checks you in and sets you up at the Kiosk).

Unlike the Kryterion centers, this doesn’t seem to be a big problem — previously being just a few minutes late was an issue; so, if you do get there with plenty of time, the aforementioned cafe on the ground floor is open much earlier (there were open at 8:00am they day I got there early).

Photo ID is critical to have with you; a scanner mounted on the Kiosk rig is used to get an image of documents like Passports and Drivers’ Licences. You should have two forms of photo ID, but if you have bank cards or others they can supplement (just cover some of your card numbers for security’s sake). The moderator looking at the camera compares the Photo ID with the image of you sitting there in real time.

The assessment interface itself is then very similar, with the addition of a chat window to communicate with the moderator at any time. Feedback comments can be left on questions. I found one question had assumed that multi-choice answers that did not include the answer that had changed in mid-December (just a few weeks ago) so I left a commend for the AWS certification team on this and followed up by my contacts directly.

I’ve had no problem scheduling certifications with a week’s notice, but I envisage that as demand grows, the lead time to book a slot may become an issue until more Kiosks are added (or additional venues). But that’s not an issue right now.

AWS GuardDuty: taking on the undifferentiated heavy lifting of network security analytics

Guard Duty is a machine learning security analytics service for AWS

Several years ago saw the introduction of AWS CloudTrail, the ‘almost’ audit log of API calls performed by a customer against an AWS Account. This was a huge security milestone; the ability for the customer to play back what they had asked for.

I say ‘almost’, as a critical design decision was for CloudTrail in no way to inhibit the already authenticated API call that had been made by the customer. If the internal logging mechanism of CloudTrail were to ever fail, it should not stop the API call that was issued. Other logging mechanisms in computing may place logging in the critical path of call execution, and if logging fails, then the API call fails.

With CloudTrail (and the ability to go directly cross-account to from AWS direct to a trusted independent account, came the second task – looking at the data. Its all JSON text, and it has a corresponding chain of check-summed and signed digest files meaning the set of log files cannot be tampered with, and cannot be removed without breaking the chain.

Numerous solutions were put in place, but they were mostly basic individual pattern matches against single lines of logs. If you see X, then alert with a message Y: If there is a Console Login event, and it doesn’t come from XX.YY.ZZ.AA/32, then alert.

Similarly, VPC introduced VPC flow logs, tracking the authorisation or rejection of connections through the VPC (no payload content, just payload size, start time, ports, addresses).

In December, AWS introduced a managed service that would use a private copy of the VPC Flow Logs, a private copy of the CloudTrail log, as well as a Route53 query log, and supplement this with some centrally managed, maintained and updated threat lists, mix in some customer defined threat lists and white lists, mix with a bit of machine learning, and produce much richer alerting.

Guard duty currently has not finished yet. At re:Invent, Tom Stickle indicated in a graph that there is a slew of additional capability coming shortly to GuardDuty, and now that it’s GA, more customers will have feedback and input into the future direction of the service.

However, this doesn’t replace the need to have your own, secured and trusted copy of your CloudTrail logs, and your own alerting for events that you think are particularly significant, such as a SAML Identity Provider being updated with a new Metadata document!

But between this, and Amazon Macie (for analysing and helping you review and secure your S3 documents), your visibility of security compliance and issues continues to get even higher.