The Move to VPC NAT Gateway

Once more on the move to… this time, the move to RE-move Public and Elastic IP addresses from EC2 Instances…

As previously stated, one of my architectural decisions with AWS VPC using S3 and other critical operational services is to not have Single Points of Failure or anything overly complicated between my service Instances, and the remote services they depend upon. Having addressed private access to S3 via VPC Endpoints for S3, the volume of traffic I have that must traverse out of my VPC has reduced. Additional Endpoints have been indicated by the AWS team, I know this is going to further reduce the requirement my EC2 Instances will have on outbound Internet access in future.

But for now, we still need to get reliable outbound traffic, with minimal SPOFs.

Until recently, the only options for outbound traffic from the VPC were:

Randomly assigned Public IPs & a route to the Internet
Persistently allocated Elastic IPs & a route to the Internet
A route to on-premise (or other – but outside of the VPC), with NAT to the Internet performed there
A SOCKS Proxy that itself had Public or Elastic IP & a route to the Internet
An HTTP Proxy, perhaps behind an internal ELB, that itself had a Public IP or Elastic IP & a route to the Internet

All four of these options would remain with some part of my architecture having an interface directly externally exposed. While Security Groups give very good protection, VPC Flow Logs would continue to remind us that there are persistent “knocks on the door” from the Internet as ‘bots and scripts would test every port combination they could. Of course, these attempts bounce off the SGs and/or NACLs.

We can engineer solutions within the instance (host OS) with host-based firewalls and port knocking, but we can also engineer more gracefully outside in the VPC as well.

Until this year, VPC supported routing table entries that would use another Instance (an a separate subnet) as a gateway. Using IPTables or its nfTables replacement, you could control this traffic quite well, however that’s an additional instance to pay for, and far worse, to maintain. If this NAT Instance was terminated, then it would have to be replaced: something that AutoScale could handle for us. However, a new NAT instance would have to adjust routing table(s) in order to add it’s Elastic Network Interface (ENI) to be inserted as a gateway in its dependent subnet’s routing tales. Sure we can script that, but its more moving parts. Lastly, the network throughput was constrained to that of the NAT Instance.

Then came NAT Gateway as a managed service. Managed NAT, no bandwidth limit, nothing to manage or maintain. The only downside is that managed NAT is not multi-AZ: a NAT Gateway exists in only a single Availability Zone (AZ).

Similar to NAT Instances, a NAT Gateway should be defined within a Public subnet of our VPC (i.e., with a direct route to the Internet via IGW). The NAT Gateway gets assigned an Elastic IP, and is still used as a target for a routing rule.

To get around the single-AZ nature of the current NAT Gateway implementation, we define a new routing table per AZ. Each AZ gets its own NAT Gateway, with its own EIP. Other subnets in the same AZ then use a routing table rule to route outbound via the NAT Gateway.

The biggest downside of Managed NAT is that you can’t do interesting hacks on the traffic as it traverses the NAT Gateway at this time. I’ve previously used Instance-NAT to transparently redirect outbound HTTP (TCP 80) traffic via a Squid Proxy, which would then do URL inspection and white/black listing to permit or block the content. In the NAT Gateway world, you’d have to do that in another layer: but then perhaps that proxy server sits in a NAT-routed subnet.

Having said that, soon the ability to do interception of HTTP traffic will go away, soon to be replaced by an all-SSL enabled HTTP/2 world â€” well, over the next 5 years perhaps. This would require SSL-interception using Server Name Indication (SNI) “sneek & peak” to determine the desired target hostname, then on-the-fly generate a matching SSL certificate issued by our own private CA â€” a CA that would have to be already trusted by the client devices going through the network, as otherwise this would be a clear violation of the chain of trust.

What else is on my wishlist?

All the VPC Endpoints I can dream of: SQS, AutoScaling (for signaling ASG events SUCCESS/FAILURE), CloudWatch (for submitting metrics), DynamoDB, EC2 and CloudFormation API
Packet mangling/redirection on Managed NAT on a port-by-port basis.
Not having to create one NAT Gateway per-AZ, but one NAT Gateway with multiple subnets such that it selects the egress subnet/EIP in the same AZ as the instances behind it if it is healthy, so I don’t need to make one Routing table per NAT Gateway, and auto fail-over of the gateway to other AZs should here be an issue in any AZ
VPC peering between Regions (Encrypted, no SPOFs)

Conclusion

So where is this architecture headed?

I suspect this will end up with a subnet (per AZ) of instances that do require NAT egress outbound, and a subnet of instances that only require access to the array of VPC Endpoints that are yet to come. Compliance may mean filtering that through explicit proxies for filtering and scanning. Those proxies themselves would be in the “subnet that requires NAT egress” — however their usage would be greatly reduced by the availability of the VPC Endpoints.

The good news is, that based upon the diligent work of the VPC team at AWS, we’re sure to get some great capabilities and controls. VPC has now started a next wave of evolution: its launch (back in 2009) had kind of stagnated for a while, but in the last 2 years its back in gear (peering, NAT Gateway, S3 Endpoints).

Looking back over my last three posts – The Move to Three AZs in Sydney, The Move to S3 Endpoints, and now this Move to NAT Gateway, you can see there is still significant improvements to a VPC architecture that needs to be undertaken by the administrator to continue to improve the security and operational resilience in-cloud. Introducing these changes incrementally over time while your workload is live is possible, but takes planning.

More importantly, it sets a direction, a pattern: this is a journey, not just a destination. Additional improvements and approaches will become recommendations in future, and we need to be ready to evaluate and implement them.

The move to S3 Endpoints

And now, continuing my current theme of “the move to…” with the further adventures of running important workloads and continuing the evolution of reliability and securityÂ at scale; the next improvement is the enabling of S3 endpoints for ourÂ VPC.

VPC Perspective: going to S3

Access to S3 for many workloads is critical. Minimising the SPOFs (Single Points of Failure), artificial maximum bandwidth or latencyÂ constraints, and maximising the end-to-end security is often required. For fleets of instances, the options used to be:

Public IPs to communicate directly over the (local) Internet network within the Region to talk to S3 – but these are randomly assigned, so you’d rely on the API credentials alone
Elastic IPs in place of Public IPs, similar to above, and then have to manage the request, release, limits and additional charges associated with EIPs.
NAT Instances in AutoScale groups, with boot time scripts and role permission to update dependent routing tables to recover from failed NAT instances
Proxy servers, or an ASG of proxy servers behind an internal ELB

In all of these scenarios, you’d look to use EC2 Instances with temporary, auto-rotating IAM Role credentials to access S3 over an encrypted channel (HTTPS). TLS 1.2, modern ciphers, and a solid (SHA256) chain of trust to the issuing CA was about as good as it got to ensure end-to-end encryption and validation that your process had connected to S3 reliably.

But S3 Endpoints enhances this, and in more than just a simple way.

Let’s take a basic example: an Endpoint is attached to a VPC with a policy (default, open) for a outbound access to a particular AWS Service (S3 for now), and the use of this Endpoint is made available to the EC2 Instances in the VPC by way of the VPC Routing table(s) and their association to a set of subnets. You may have multiple routing tables; perhaps you’d permit some of your subnets to use the endpoint, and perhaps not others.

With theÂ Endpoint configured as above it permits direct access to S3 in the same Region without traversing the Internet network. The configured S3 Logging will start to reflect the individual Instance Private IPs (within the VPC) and no longer have the Public or Elastic IPs they may have previously used. They don’t need to use a NAT (Instance or NAT Gateway) or other Proxy: the Endpoint providesÂ reliable, high through-put access to S3.

However, the innovation doesn’t stop there. That policy mentioned above on the Endpoint can place restrictions on the APIs and Buckets that are accessible via this Endpoint. For example, a subnet of Instances that I want to ensure they can ONLY access only my named bucket(s) Endpoint policy. As they have no other route to S3, then they can’t access 3rd party anonymously accessible buckets.

I can also limit the API calls via the Endpoint: perhaps permitting on Get, Put, List operations. These instance couldn’t assume another role (sts:assumeRole) that may have s3:DeleteBucket privileges, and use it via this restricted Endpoint.

Let’s make it a little more complex, with a second Endpoint on the VPC. Perhaps I’ll associate this second Endpoint with my administrativeÂ subnet, and permit an open policy on it.

S3 Perspective: Restricting sources

An S3 bucket, once created in a Region, accepts valid signed requests from the Principals you permit in IAM policy. You can add Bucket Policies to them to restrict this to a set of trusted IP CIDR blocks (both IPv4 and IPv6 now – IPv6 only for the S3 public API Service Endpoint, not the optionally enabled S3 website or VPC Endpoint). For example, a DENY policy with a condition of:

"Condition": {
   "NotIpAddress": {
     "aws:SourceIp": [
       "54.240.143.0/24",
       "2001:DB8:1234:5678::/64"
     ]
   },
}

But with VPC Endpoints, you would instead add a DENY role with a condition of:

"Condition": {
  "StringNotEquals": {
    "aws:sourceVpc": "vpc-1234beef"
  }
}

Items in the condition block are AND-ed together at this time, so if you’re writing a policy with both VPC endpoint requirement OR an on-premise IP block, things get interesting: you’re going to want to Boolean OR these two separate Conditions in aÂ Deny block:

"Condition": { 
  "NotIpAddress": { 
    "aws:SourceIp": [ "54.240.143.0/24", "2001:DB8:1234:5678::/64" ],
  },
  "StringNotEquals": {
    "aws:sourceVpc": "vpc-1234beef"
  }
} # FAILS EVERY TIME AS BOTH ARE EVALUATED!!

Luckily there’s a work around. IfExists can conditionally check a Condition key, and skip it if its not defined:

"Condition": {
  "NotIpAddressIfExists": {
    "aws:SourceIp" : [ "54.240.143.0/24", "2001:DB8:1234:5678::/64" ]
  },
  "StringNotEqualsIfExists" : {
    "aws:SourceVpc", [ "vpc-1234beef" ]
  }
}

Thus these two can be ANDED together and still pass if either one is TRUE. Kind of like an OR! Add the Action: DENY to this and we should be looking pretty good.

In summary

So what’s this got usÂ now?

OurÂ S3 logs should only contain IP addresses from within the VPC now, so it’s fairly obvious to pick out any other access attempts.
Our reliance on external Internet access has slightly reduced – but there areÂ other sites and services in use (eg, SQS, CloudWatch for metric submission, or even AutoScale for signaling ASG scaling action results) then these are still requiredÂ to go our the Internet Gateway (IGW) one way or another
Our S3 buckets can have additional constrains to further limit the scope of credentials.
We’ve avoided complex scenarios of lashing together scripts that dynamically adjust routing tables, intercept SSL traffic on proxy servers, or other nasty hacks

The AWS team has publicly indicated more Endpoints are to come, so this shows a clear trajectory: less reliance on “Internet” access for instances.Â All of this is a long, long way fromÂ what VPC looked like back in 2008, when it was S3-backed instances with no IGW – just private subnets with an IPSEC VGW to on-premise.

The underlying theme, however, is that the security model is not set and forget, but to continue this journey as the platform further improves.

So, key recommendations:

Use IAM Roles for EC2 instances (unless youÂ have multiple un-trusted clientsÂ using SSH/RDP to the instance). These credentials auto-rotate multiple times per day, and are transparently used by the AWS SDKs.
Turn on S3 Bucket Logging (to a separate bucket). When setting he bucket logging destination, make sure you end the prefix with a trailing slash (/). Eg, “MyBucket” logs to bucket “MyLoggingBucket”, with prefix “S3logs/MyBucket/”. S3 Logging is a Trusted Advsior recommendation: setting a Lifecycle policy on these logs is my recommendation (dev/test at X days, Production at Y years?).
Create (at least one) VPC S3 Endpoint for the buckets in region, and adjust routing tables accordingly. Perhaps startÂ with an open policy if you’re comfortable (it’s no worse than the previous access to S3 over Internet), and iterate from there.
Consider locking your S3 buckets down to just your VPC, or your VPC and some well known ranges.

The move to 3 AZs in Sydney

I’ve been working on a significant public-sector AWS cloud deployment (as previously mentioned) and surfed our way confidently through the June 2016 storms with a then 2-Availability-Zone architecture, complete with AWS RDS Multi-AZ, ELB and AutoScale services making the incurred AZ failure almost transparent.

However, not resting on ones laurels, I’ve been looking at what further innovations are in place, and what impact these would have.

Earlier this year, AWS introduced a third AZ in Sydney. This has meant that a set of higher order AWS services have now appeared that themselves depend upon 3 AZs. This also gives us a chance toÂ â€” optionallyÂ â€” spread ourselves wider across more AZs. I’ve been evaluating this for some time, and came to the following considerations and conclusions.

RDS Multi-AZ during an AZ outage

Multi-AZ was definitely failing-over and doing its thing in June, but for the following 6-8 hours or so the failed AZ stayed off-line, along with my configured VPC subnets in that AZ that were members of my RDS DB Subnet Group. This meant that during this period I didn’t have multi-AZ synchronous protection.

Sure, this returned automatically when the AZ came back online but I thought about this small window a fair bit.

While AWS deploys its AZs on separate flood plains and power distribution grids, the same storm is passing over many parts of the Region at one time, so there’s a possibility that lightning could strike twice.

In order to alleviate this, I added a third subnet from a third AZ to my DB Subnet Group. Multi-AZ RDS only (at this time) has one replica. During a single AZ failure event, RDS would still have a choice of two operating AZs to provide me with the same level of protection.

The data in my relational database is important, and the configuration to add a third Subnet in AZ C has no real financial overhead (just inter-AZ traffic at around US1c/GB for SQL traffic). Thus the change is mostly configuration.

I thought about not having three AZs for my RDS instance(s), and considered what I would say to my customer if there was a double failure and I didn’t move to this solution. I was picturing the reaction I would get when I tell them it was mostly just a configuration change that could have helped protect against a second failure after an AZ outage.

That was a potential conversation that I didn’t want to have! Time to innovate.

EC2 AutoScale during an AZ outage

I then through about what happens to my auto-scale groups the moment an AZ goes dark. Correctly it reacts, and tries to recover capacity into the surviving AZ(s) that the ASG is configured for. In the case where I have two AZs and two instances (one per AZ normally), then I have lost 50% of my capacity. To return to my minimum configuration (two instances) I need ASG to launch from only the one surviving AZ.

Me, and everyone else who is still configured for the original two AZs in the Region. Meanwhile a third AZ is sitting there, possibly idle, just not configured to be in use.

If I had three AZs configured, and my set of ASGs randomly populating two instances across these three AZs, then I would have 1/3rd of my ASGs with no instances a failing AZ. So my immediate demand from this failure has decreased: I would require replacing just 33% of my fleet. Furthermore, I am not constrained to one AZ to satisfy my (reduced) demand.

Migrating from 2 AZ to 3 AZ

So I had a VPC, created from a CloudFormation template. It was reasonably simple, as when starting out on deploying a reasonably sized cloud-native workload, we had no idea what it would look like before we started it. Here’s a summary:

Purpose	AZ A	AZ B	Total IPs (approx)
Internal ELBs	10.x.0.0/24	10.x.1.0/24	500 (/23)
App Servers	10.x.2.0/24	10.x.3.0/24	500 (/23)
Backend Services	10.x.4.0/24	10.x.5.0/24	500 (/23)
Databases	10.x.6.0/24	10.x.7.0/24	500 (/23)
Misc	10.x.8.0/24	10.x.9.0/24	500 (/23)

My entire VPC design was a /20 constraint (some 4000 IPs), designated from a corporate topology. Hence using a /24 as a subnet, we would have 16 subnets possible in the VPC.

On-premise firewalls (connected by both Direct Connect and VPN) would permit access from on-premise to specific subnets, and from in-cloud subnets to specific on-premise destinations.

It became clear as our architecture evolved, that having some 500 IPs for what ended up being few Multi-AZ databases (plus a few read replicas) was probably overkill. we also didn’t need 500 addresses for miscellaneous instances and services – again overkill that’s only appreciated with 20/20 hindsight.

Moving Instances

This workload was live, so there’s no chance of any extended downtime. So we refer to the rules of the VPC: you can only delete a VPC if there are no interfaces present in it. Thus we have to look at ENIs, and see where they are.

Elastic Network Interfaces are visible in the AWS console and CLI. You’ll note that Instance, ELBs, RDS instances all have ENIs. Anything that is “in” the VPC likely has them. So we need to jostle these around in order to reallocate.

Lets look at the first pair of subnets for the ELBs. I want to spread this same allocation across three AZs. three is an unfortunate number, as splitting subnets doesn’t nicely go into threes. However, four is a good number. Taking the existing pair of /24 networks (a contiguous /23), we would re-distribute this as 4 /25 networks (120 IP addresses apiece). I only have 40 internal ELBs (TCP pass through), so three’s enough room there. This would leave me with a /25 unused – possibly spare should a 4th AZ every come along.

And thus it began. ELBs were updated to remove their nodes from AZ A. This meant that nodes in AZ A were out of service (ELBs must be present in the same AZ as the instances they are serving to). So at the same time, ASGs were updated to likewise vacate AZ A. ELB reacted by deploying two ENIs in AZ B. ASGs reacted by satisfying their minimum requirements all from subnets in AZ B.

While EC2 instances were quick to vacate AZ A, ELB took some time to do so. Partly this is because ELB uses DNS (with low TTLs), and needs to wait until a sufficient amount of time has past that most clients would have refreshed their cached lookups and discovered the node(s) of ELB only in AZ B. In my case (and in more than one occasion) the ELB got stuck shutting down its ENIs in AZ A.

A support call or two later, and the AZs were vacated (but we’re still up!).

At this stage, the template I used to create the VPCs was read for its first update in 18 months. One of the parameters to my VPC is the CIDR range it holds, so the update was going to be as simple as updating this ONLY for the now-vacant subnets.

However, there’s a catch. For some reason, CloudFormation wants to create NEW subnets before deleting old ones. I was taking my existing 10.x.0.0/24 and going to use 10.x.0.0/25 as the address space. However, since the new subnet was to be created before the old was deleted, this caused an address conflict, and the update safely rolled back (of course, this was learnt in lower environments, not production).

The solution was to stage a two-phase update to the CFN stack. The first update was to set a new temporary range that didn’t conflict – from the spare space in the VPC. Anything would be fine to use so long as (a) it was currently unused, and (b) it didn’t conflict with my final requirements.

So my first update was to set the ELB subnets in AZ A to 10.x.15.0/25, and a follow-up a few seconds later to 10.x.0.0/25. Similarly with the other subnets for App servers and back-end servers.

With these subnets redefined (new subnet IDs), we could reverse the earlier shuffle: defined ELBs back in to AZ A, then define ASGs to span the two AZs. next was the move to vacate AZ B. Just as with the ELBs when they left AZ A, three was a few hours wait for the ENIs to finally disappear.

However this time, I was moving from 10.x.1.0/24, to 10.x.0.128/25. This didn’t overlap, and wasn’t in use, so was a simple one step CloudFormation parameter update to apply.

Next was a template update (not just a parameter update) to define the subnets in AZ C, and provide their new CIDR allocations.

The final move here is to update the ELBs and ASGs to now use their third subnets.

Moving RDS

RDS Multi-AZ is a key feature underpinning the databases we use. In this mode, the ENIs for the master and the standby are in place from the moment that Multi-AZ is selected.

My first move was to force a fail-over of any RDS nodes active in AZ A. This is a reboot “with fail-over”, and incurs about a 3 minute outage. My app is durable to this, but its still done outside of peak service hours with notification to the client.

After failing over, we then temporarily modify the RDS instance to NOT be multi-AZ. Sure enough, the ENI from AZ A is duly removed, and the subnet when vacant can be replaced with the smaller allocation (in my case, a /26 per AZ suffices). With the replaced subnet created, I can then update my DB Subnet Group to include this new SubnetID, and re-enable Multi-AZ. Another reboot “with fail-over”, and convert again to Single AZ, and I can re-define the second subnet. Once more we update the DB Subnet Group again, and re-enable Multi-AZ.

The final chess move was to define the third subnet in the third AZ, and include that in the DB Subnet Group.

Purpose	AZ A	AZ B	AZ C	‘Spare’	Total IPs (approx)
Internal ELBs	10.x.0.0/25	10.x.0.128/25	10.x.1.0/25	10.x.1.128/25	500 (/23), only 370 available now
App Servers	10.x.2.0/25	10.x.2.128/25	10.x.3.0/25	10.x.3.128/25	500 (/23), only 370 available now
Backend Services	10.x.4.0/25	10.x.4.128/25	10.x.5.0/25	10.x.5.128/25	500 (/23), only 370 available now
Databases	10.x.6.0/26	10.x.6.64/26	10.x.6.128/26	10.x.6.192/26 and 10.x.7.0/24	500 (/23), only 190 available now
Misc	10.x.8.0/24	10.x.9.0/24	–	–	Same, yet to be re-distributed

Things you can’t easily move

What I found was there are a few resources that once created, actually require deletion. WorkSpaces and Directory services were two that, once present, aren’t currently easy to transfer between subnets. Technically instances aren’t transferable, but since I am in an ASG world (cattle, not pets), I can terminate and instantiate at will.

Closing Thoughts

With spare addressing space available for a fourth subnet, I don’t think I’m going to have to re-organise for a while. My CIDR ranges are still consistent with their original purposes. I have plenty of addressing space to define more subnets in future (perhaps a set of subnets for Lambda-in-VPC).

There’s other VPC improvements I’ve added at the same time, but I’ll save those for my next post.

#TemplateAllTheThings

Note: I also run some of the most advanced security and operation training on AWS. See https://nephology.net.au/ for information.