Current AWS Workload recommendations December 2020

There’s a heap of Best Practice around workloads online and in AWS, and here’s some of my current thoughts as at December 2020 – your mileage may vary, caveat emptor, no warranty expressed or implied, and you may have use-cases that justify something different:

PatternRecommendationRationale
Multi-AZ VPCDesign Address space for 4 AZsIn an AZ outage, having just one AZ remaining to satisfy demand during a rush is not enough; using contiguous address space and CIDR masks means after 2, we have 4
VPC DNSSEC validationEnable for VPC Validation, but be ready for external zones to stuff up their DNSSEC keysFailing closed maybe better than failing open; but new failure modes need to be understood.
Route53 Hosted Zone DNSSECHold off until current issues are resolved if you use CloudFrontNew service, new failure modes.
TLS1.2 and above onlyOlder versions are now already removed from many clients; be ready for TLS 1.3 and above only
VPC IPv6Enable for all subnets33% of traffic worldwide is now IPv6; your external interface (ALB/NLB) should all be dual stack now as a minimum. Don’t forget your AAAA Alias DNS records.
VPC External EGRESS for private subnetsMinimise, avoid if possible.You shouldn’t have any boot time or runtime dependencies – apart form the outbound integrations you are explicitly creating. Use ENDPOINTS for S3 and other services. Minimise Internet transit.
CloudFront IPv6Enable for all distributionsAs above; particularly if your origin is only on IPv4; Don’t forget your AAAA Alias DNS records.
HTTP interfacesOnly for the APEX of the domain if you think people will type your address by hand into a browser; for all other services, do not listen on port 80 HTTPAvoid convenience redirects, they are a point of weakness. Use HTTPS for everything, including internal services.
ACM Public TLS CertificatesUse DNS validation, and leave validation in place for subsequent reissueRemove the manual work in renewing and redeploying certificates.
S3 Block Public AccessDo this for every bucket, and if possible, Account-wide AS WELL.Two levels of this in case you have to disable account-wide in future.
S3 Website public (anonymous) hostingDo not use; look at CloudFront with Origin Access IdentityYou can’t get a custom certificate nor control TLS on S3. But beware default document handling and other issues.
S3 Access LoggingEnable, but set a retention policy in the S3 BucketNo logs means no evidence when investigating issues.
CloudFront Access LoggingEnable, but set a retention policy in the S3 BucketNo logs means no evidence when investigating issues.
VPC Flow LogsEnable for all, but set a retention policy in the CloudWatch LogNo logs means no evidence when investigating issues.
DatabaseUse RDS or Aurora wherever possible Less operational overhead
RDS Maintenance; Minor versionsAlways adopt latest minor version pro-actively, from Dev through to ProdDon’t wait for Auto graduand to happen; that’s typically on decommission of the version being available.
RDS Maintenance: Major VersionsAfter testing, move up to latest Major versionAvoid being on a decommissioned major version; the enforced upgrade jump may be a bigger jump forward than your application can support.
RDS Encrypt in flightEnforceEnsure privacy of the credentials for connection regardless of where the client it. Don’t assume the client config to use encryption is correct
RDS Encryption in flightValidateGet the RDS CA certificate(s) in your trust path during application build time. Always automate brining them in (and validate and log where you get these from).
RDS Encryption at restEnableKMS is fine. Use a dedicated key for important workloads (and don’t share the key with other accounts).
DNS RecordsAlways publish a CAA and SPF record, even for parked domainsProtect risk and reputation
HTTP Security HeadersValidate on SecurityHeaders, Hardenize, SSLLabs, Mozilla Observatory, and Google Lighthouse (and possibly more).This is an entire lesson, but an A will get you in good stead.
HTTP Security Headers: HSTSEnforce HSTS for a yearWe’re never going back to unencrypted HTTP
Public CDNs for libraries in major projectsAvoid; host your own assets.Remove external dependencies

DNSSEC and Route53

DNS is one of the last insecure protocols in use. Since 1983 it has helped identify resources on the Internet, with a name space and a hierarchy based upon a common agreed root.

Your local device – your laptop, your phone, your smart TV – whatever you’re using to read this article – typically has been configured with a local DNS resolver that, when your device needs to look up an address, it can ask the local resolver to go find the answer to a query.

The protocol used by your local device to the resolver, and from the resolver to reach out across the Internet, is an unencrypted protocol. It normally runs on UDP port 53, switching to TCP 53 under certain conditions.

There is no privacy across either your local network, or the wider Internet, of what records are being looked up or the responses coming back.

There’s also no validation that the response sent back to the Resolver IS the correct answer. And malicious actors may try to spuriously send invalid responses to your upstream resolver. For example, I could get my laptop on the same WiFi as you, and send UDP packets to the configured resolver telling it that the response to “www.bank.com” is my address, in order to get you to then connect to a fake service I am running, and try and get some data from you (like your username and password). Hopefully your bank is using HTTPS, and the certificate warning you would likely get would be enough to stop you from entering information that I would get.

The solution to this was to use digital signatures (not encryption) to have a verification of the DNS response received by the Upstream resolver from across the Internet. And thus DNSSEC was in born 1997 (23 years ago as at 2020).

The take up has been slow.

Part of this has been the need for each component of a DNS name – each zone – needing to deploy a DNSSEC-capable DNS server to generate the signatures, and then to have each domain be signed.

The public DNS root was signed in 2010, along with some of the original Top Level Domains. Today the Wikipedia page for the Internet TLDs shows a large number of them are signed and ready for their customers to have their DNS domains return DNSSEC results.

Since 2012 US Government agencies have been required by NIST to deploy DNSSEC, but most years agencies opt out of this. Its been too difficult, or the DNS software or service they are using to host their Domain does not support it.

Two parts to DNS SEC

One the one side, the operator of the zone being looked up (and their parent domain) all need to support and have established a chain-of-trust for DNSSEC. If you turn on DNSSEC for your authoritative domain, then those clients who are not validating the responses won’t see any difference.

Separately, the client side DNS Resolver (often deployed by your ISP, Telco, or network provider) needs to understand and validate the DNSSEC Response. If they turn on DNSSEC for your Resolver, then there’s no impact for resolving domains that don’t support DNSSEC.

Both of these need to be in place to offer some form of protection for DNS spoofing, cache poisoning or other attacks.

Route 53 Support for DNSSEC

In December 2020, Route53 finally announced support for DNSSEC, after many years and many customer requests. And this support comes in two ways.

Firstly, there is now a tick box to enable the VPC-provided resolver to validate DNSSEC entries, if they are received. Its either on, or off at this stage.

And separately, for hosted DNS Zones (your domains), you can now enable DNSSEC and have signed responses sent by Route53 for queries to your DNS entries, so they can be validated.

A significant caveat right now (Dec 2020) for hosted zones is that this doesn’t support the customer Route53 ALIAS record type, used for defining custom names for CloudFront Distributions.

DNSSEC Considerations: VPC Resolver

You probably should enable DNSSEC for your VPC resolvers, particularly if you want additional verification that you aren’t being spoofed. There appears to be no additional cost for this, so the only consideration is why not?

The largest risk comes from misconfiguration of the domain names that you are looking up.

In January 2018, the US Government had a shut down due to blocked legislation. Staff walked off the job, and for some of those agencies, they had DNS SEC Deployed – and for at least one of those agencies, its DNS keys expired, rendering their entire domain off-line (many other let their web site TLS certificates expire, causing warnings for browsers, but email still worked for them for example).

So, you should weigh up the improvement in security posture, versus the risk of an interruption through misconfiguration.

In order to enable it, go to the Route53 Console, and navigate to Resolvers -> VPCs.

ChoOse the VPC Resolver, and scroll to the bottom of the page where you’ll see the below check box.

DNSSEC enabled for a VPC

DNSSEC Considerations: Your Hosted Zones

As a managed service, Route53 normally handles all maintenance and operational activities for you. Serving your records with DNSSEC at least gives your customers the opportunity to validate responses (as they enable their validation).

I’d suggest that this is a good thing. However, with the caveat around CloudFront ALIAS records right now, I am choosing not to rush to production hosted zones today, but staying on my non-production and non-mission critical zones.

DNSSEC enabled on a hosted zone

I have always said that your non-production environments should be a leading indicator of the security that will get to production (at some stage), so this approach aligns with this.

The long term impact of Route53 DNSSEC

Route5 is a strategic service that enables customers to not need their own allocate fixed address space and run their own DNS servers (many of which never receive enough security maintenance and updates). With DNSSEC support this means that barriers for adoption are reduced, and indeed, I feel we’ll see an up-tick in DNSSEC deployment worldwide because of this capability coming to Route53.

Other Approaches

An alternate security mechanism being tested now is called DNS over HTTPS, or -DoH. This encrypts the DNS names being requested from the local network provider (they still see the IP addresses being accessed).

In corporate settings, DoH is frowned upon, as many corporate It departments want to inspect and protect staff by blocking certain content at the DNS level (eg, block all lookups for betting sites) – and hiding this in DoH may prevent this.

In the end, a resolver somewhere knows which client looked up what address.

What does a 2nd AWS Region for Australia mean to the Australian IT Industry?

I was there launching the 1st AWS Region in Sydney in 2013 as an AWS staff member, and as the AWS Solution Architect with a focus (depth) on security meant that it was a critical time for customers who were looking to meet strict (or perhaps, even stricter) security requirements.

Back in 2013, the question was the US Patriot Act. That concern and question has long gone.

Subsequently came cost effectiveness. Then domestic IRAP PROTECTED status.

And back in 2013, secrecy was everything about a Region. We launched Sydney the day it was announced, as ready-for-service. This made recruiting operational staff, securing data centre space (or even building data centres), and having large amounts of fibre run between buildings by contractors, difficult to keep under wraps. These days, pre-announcements like this help ease the struggle to execute the deployment without the need for code names and secrecy.

So, what does the launch of a second AWS Region in Australia, with three Availability Zones, and the general set of services within AWS being present, going to mean to the domestic and international markets?

Proximity to customers & revenue

Let’s look at some population and revenue statistics for these few cities in Australia (and use NZ as a comparison):

LocalityPopulation 2018% population of AustraliaGross State Product 2018/19 (USD$B)
All of Australia25M100%1,462 (100%)
NSW (Sydney)7.5M (5.2M)30% (20.8%)482 (32%)
Victoria (Melbourne)6.3M (4.9M)25.2% (19.6%)344 (23%)
New Zealand (for comparison)4.8M19.2%209 (14%)
Population comparisons in AN/Z

So with a Melbourne Region launch, we see 55.2% of the Australian population in the same state as a Region, and 40.4% in the same city as a Region. This also represents being close to where 55% of the GDP of Australia comes from.

Moreover this is coverage for where the headquarters of most Australian national organisations are based, and typically their IT departments are helmed from their national HQ.

Where does Latency matter?

The main industry that will see the latency impact from the new Melbourne Region is probably the video/media production vertical. There’s a sizable video media production industry in Melbourne that will now not discount AWS for the 11ms or so that was previously seen to Sydney.

Of course, latency doesn’t imply bandwidth.

Melbourne has been a Direct Connect location for some time, with customers able to take a partial port, or a whole 1GB or 10 GB port, and multiples three of with Link Aggregation Control Protocol (LACP) to deliver higher throughputs.

But the latency remained. And thus the Big Fat Pipe Problem would be a consideration: the amount of data sitting IN the pipe since transmission, and before being confirmed as received on the other end. For some devices and TCP/IP stacks, as the bandwidth increases, this becomes a problem.

You canna change the laws of physics

Mr Scott, Enterprise

Then there are applications that make multiple sequential connections from on-cloud to on-premises. An application that opens perhaps 100 SQL connections to a remote database over 11ms latency in series will see three round trips of TCP/IP handshake, and perhaps a TLS 1.2 handshake of 3 round trips, for 6.6 seconds of wall time before any actual query data and response is sent and processed.

The death of Multi-Cloud in Australia

Despite extremely strong multiple-Availability Zone architectures (see the Well-Architected principles), the noise of “multi-Region” has escalated in the industry. From the AWS-customer perspective, multi-cloud has become recognised as a “how about me” call from desperate competitors.

Of course the complexity of multi-cloud is huge, and not understood by most (any?) CIO who endorses this strategy. It’s far better to get the best out of one cloud provider, rather than try and dilute your talent base across implementing, maintaining and securing more than one.

However, some industry regulators have demanded more than one Region for some workloads, again mostly as a lack of understanding of what a strong Well-Architected design looks like.

With this announcement, multi-Region domestically within Australia will be a reality.

But we’re a Melbourne based infrastructure provider, we’re local!!

Sorry, time’s up.

You’re about to lose your customers, and your staff to an unstoppable force, with easy on-boarding, pay as you go, no commitment required terms.

It’s self-service, so there’s no cost of sale.

It’s commodity.

And it’s innovating at a clip far faster than any infrastructure organisation around. There’s almost nothing special about your offering that isn’t faster, cheaper and better in cloud. It’s time to work out what happens when 1/2 your staff leave, and 1/2 your customers are gone.

Getting customers to sign a 5 year contract at this point is only going to sound like entrapment.

Where next in Australia & NZ?

There’s a lot to consider when planning an AWS Region.

First, their are tax & legal overheads in establishing multiple entities in a country to implement, operate and own infrastructure. That means that even if New Zealand was next in line by population, or GDP, it may fall to another location in Australia to follow Melbourne.

And while the state of Queensland may look like it’s third, the latency it already has between Brisbane and Sydney of around 17ms may be outweighed by the fourth in the pack, Western Australia at 50ms.

Lots of variables and weightings to consider. And despite all of this, we have some time to see what the customer cost for AWS resources will be in the new Melbourne Region when it becomes available.

UniFi: Should I wait for the next DreamMachine Pro?

I switched to a 1 Gb/s NBN connection a few months back, but it soon became apparent that the original Unifi Security Gateway (USG) is no match for 1 Gb/s link.

While I love the Wifi access points, the management interface, and the rapid firmware updates, throughput limitations of the USG only became noticeable when the link speed went up. Ubiquity Networks, the manufacturers of the Unifi range, has released faster products – the throughput being the major selling point. And of course, the pricing goes up accordingly.

But in thinking of the current top-of-the-line device, the DreamMachine Pro, it kind of gives me some pause for consideration.

The device has two “WAN links”; one is an RJ45 gigabit Ethernet port, the other is a 10Gb/s SFP slot for a fibre GBIC. I’d love to have a fail-over Internet connection, but the fibre connection isn’t an advantage to me.

Ubiquiti are not selling their LTE fail-over device in Australia. I’d have to drop the 10 Gb/s SFP port back to a vanilla 1 Gb/s RJ34 copper port to plug into an alternate LTE device. But then again, carrier plans for this pattern are expensive.

However, it could be that I have two RJ45 Internet connections; my NBN connection affords me up to 4 ISP connections in the fiber-to-the-premesis that I have available. Now, the upstream link from my CPE to the Point of Interconnect (PoI) may be limited to 1 GB, but having the ability to fail-over to another ISP may be useful. Or I may ant to route traffic by port or Service to a different link (eg, VPN traffic over Link #2, or Web Traffic over Link #2, or perhaps just streaming video from some specific providers over alternate link.).

The Dream Machine also has a built in 8 port switch (EJ45), but none of the ports are Power over Ethernet (POE). After all, the majority of links going in to this are going to be WiFi access points directly, and linking an 8 port PoE switch in here seems a waste. A long tail of customers would find this fills their needs without having additional switches to worry about.

I would also have expected more ports here, given the cost of the device: say 16 ports, even if only half of them were PoE.

The inclusion of Protect for video cameras is a neat idea, but having two local disks to RAID together would be nice. I have shied away form on=premise storage, but for large volumes of video, I still like having the highest bandwidth version not traversing the WAN. So Its great we have one disk option available, but it could be so much more awesome if we just had some local resilience.

Of course, if Japan has residential 2 Gb/sec Internet connection, then would this device still be usable? I’m guessing Australia will max out on 1 Gb/sec for a while…

So, trying to decide if I dive in for the current DreamMachine Pro, or wait until it’s tweaked…..

“Well Maintained”

I touched on this in a recent article, but I wanted to dive deeper on this.

AWS makes much about Well-Architected principles, something I worked on the early stages of circa 2013, and applied to $work. I strongly recommend anyone deploying to any cloud provider think about these principles and their responsibility in implementing these, or in ensuring they are implemented.

Around the same time (2013/2014), the term DevOps and the rise of CI/CD pipelines was also coming to the fore. Looking back, the biggest advantage that Well-Architected lent on DevOps for was the ability to make rapid, incremental improvements to an architecture.

Poor architectural implementations traditionally went unchanged during he lifecycle of the deployed solution. Poor software would eventually be replaced in a follow up project; replaced with massive fanfare and budget.

So while Well-Architected starts a project, the concept of Well Maintained is the constant re-application of Well-Architected to a workload post go-live. It’s also the rapid adoption of software patches throughout the stack: the database version, the SDKs and libraries in the code base, the uplift of runtime versions (such as Java 8 -> 11, and beyond), the enabling of new TLS protocols and sun-setting of the old (TLS 1.3 turned on, TLS <= 1.1 turned off at this time).

A project that always adopts the current version of SDKs, and is always in good compliance with current best practice over time is Well Maintained. It’s almost evergreen. Its ages well – in the fact it doesn’t really age.

How can you tell if something is Well-Maintained?

Check the versions of its components. Dive deep. Find out what prevents you from updating these items. Find the known vulnerabilities in the versions between what your project has now, and the current released version.