The new AWS Security Certification

Thursday I sat the new AWS Security Certification introduced at Re:Invent 2016, currently in “beta” and apparently over-subscribed to the point that AWS has removed content from the Certification web site and stopped more candidates from taking sitting this test at this time.

Being a beta, the exam is US$150 (will be US$300 post-beta), and unlike the “Generally Available” existing certifications which disclose your result immediately, we’ll be waiting until the end of March to get a final answer. So perhaps then I’ll be eating my words and keeping quiet! 🙂

Like the rest of the AWS Certifications, it’s proctored via Kryterion and their WebAssessor platform. At 170 minutes, I was presented with just over 100 questions of multi-choice answers, where the response was either a “choose 1” answer, “choose 2”, “choose 3”, or “chose all that apply” to the question or statement.

This did feel like a beta: one question was shown the three of the six responses to chose from listed as “[Reserved for beta]”. One question presented to me twice. Errant spaces appeared in one example IAM Policy. There was some inconsistent capitalisation in the text of some questions.

At one stage I tried to submit comments for the above issues into the supplied input box, but somehow this triggered the “security” module for WebAssessor to immediately lock me out and require a Proctor to unlock it. This presented a new learning: the Proctor was given two options to unlock the screen: “End Exam”, and “Re-start”. Luckily “Re-start” was more “resume”, and I continued. Kryterion, perhaps some better wording here?

I was pleased the the challenges that the questions presented, and generally felt the mix was a good test of a broad range of AWS Security. It took me the majority of the time to get through and then review my responses, so I wouldn’t say it was easy. It felt like more a professional level cert than an associate level given the length of time. I would have liked some more crypto questions — KMS and key sharing, ELB and SSL Policies, etc.

Clearly putting together any certification on a platform like AWS, which is constantly evolving, is like trying to hit a moving target — the Certification team have done a great job brining these additional Speciality certifications to market.

 

Footnote: I did reach out to the global head of certification at AWS to submit these comments and try and get some indication if these issues are known or being fixed, but after 2 days I haven’t had a response. Either way, I hope this feedback is helpful.


If you’re interested in AWS and Security, then please check out my training at https://nephology.net.au/, where in a 2 day in-person class we cover above and beyond the AWS courses to ensure you have the knowledge and are prepared for the agile world of running and securing environments in the AWS Cloud.

The Debian Cloud Sprint 2016

I’m at an airport, about to board the first of three flights across the world, from timezone +8 to timezone -8. I’ll be in transit 27 hours to get to Seattle, Washington state. I’m leaving my wife and two young children behind.

My work has given me a days’ worth of leave under the Corporate Social Responsibility program, and I’m taking three days’ annual leave, to do this. 27 hours each way in transit, for 3 days on the ground.

Why?

Backstory

I started playing in technology as a kid in the 1980s; my first PC was a clone (as they were called) 286 running MS-DOS. It was clunky, and the most I could do to extend it was to write batch scripts. As a child I had no funds for commercial compilers, no network connections (this was pre Internet in Australia), no access to documentation, and no idea where to start programming properly.

It was a closed world.

I hit university in the summer of 1994 to study Computer Science and French. I’d heard of Linux, and soon found myself installing the Linux distributions of the day. The Freedom of the licensing, the encouragement to use, modify, share, was in stark contrast to the world of consumer PCs of the late 1980’s.

It was there at the UCC at UWA I discovered Debian. Some of the kind network/system admins at the University maintained a Debian mirror on the campus LAN, updated regularly and always online. It was fast, and more importantly, free for me to access. Back in the 1990s, bandwidth in Australia was incredibly expensive. The vast distances of the country mean that bandwidth was scarce. Telcos were in races to put fiber between Perth and the Eastern States, and without that in place, IP connectivity was constrained, and thus costly.

Over many long days and nights I huddled down, learning window managers, protocols, programming and scripting languages. I became… a system/network administrator, web developer, dev ops engineer, etc. My official degree workload, algorithmic complexity, protocol stacks, were interesting, but fiddling with Linux based implementations was practical.

Volunteer

After years of consuming the output of Debian – and running many services with it – I decided to put my hand up and volunteer as a Debian Developer: it was time to give back. I had benefited from Debian, and I saw others benefit from it as well.

As the 2000’s started, I had my PGP key in the Debian key ring. I had adopted a package and was maintaining it – load balancing Apache web servers. The web was yet to expand to the traffic levels you see today; most web sites were served from one physical web server. Site Reliability Engineering was a term not yet dreamed of.

What became more apparent was the applicability of Linux, Open Source, and in my line-of-sight Debian to a wider community beyond myself and my university peers. Debain was being used to revive recycled computers that were being donated to charities; in some cases, unable to transfer commercial software licenses with the hardware that was no longer required by organisations that had upgraded. It appeared that Debian was being used as a baseline above which society in general had access to fundamental capability of computing and network services.

The removal of subscriptions, registrations, and the encouragement of distribution meant this occurred at rates that could never be tracked, and more importantly, the consensus was that it should not be automatically tracked. The privacy of the user is paramount – more important than some statistics for the Developer to ponder.

When the Bosnia-Herzegovina war ended in 1995, I recall an email from academics there, having found some connectivity, writing to ask if they would be able to use Debian as part of their re-deployment of services for the Tertiary institutions in the region. This was an unnecessary request as Debian GNU/Linux is freely available, but it was a reminder that, for the country to have tried to procure commercial solutions at that time would have been difficult. Instead, those that could do the task just got on with it.

There’s been many similar project where the grass-roots organisations – non profits, NGOs, and even just loose collectives of individuals – have turned to Linux, Open Source, and sometimes Debian to solve their problems. Many fine projects have been established to make technology accessible to all, regardless of race, gender, nationality, class, or any other label society has used to divide humans. Big hat tip to Humanitarian Open Street Map, Serval Project.

I’ve always loved Debian’s position on being the Universal operating system. Its’ vast range of packages and wide range of computing architectures supported means that quite often a litmus test of “is project X a good project?” was met with “is it packaged for Debian?”. That wide range of architectures has meant that administrators of systems had fewer surprises and a faster adoption cycle when changing platforms, such as the switch from x86 32 bit to x86 64 bit.

Enter the Cloud

I first laid eyes on the AWS Cloud in 2008. It was nothing like the rich environment you see today. The first thing I looked for was my favourite operating system, so that what I already knew and was familiar with was available in this environment to minimise the learning curve. However there were no official images, which was disconcerting.

In 2012 I joined AWS as an employee. Living in Australia they hired me into the field sales team as a Solution Architect – a sort of pre-sales tech – with a customer focused depth in security. It was a wonderful opportunity, and I learnt a great deal. It also made sense (to me, at least) to do something about getting Debian’s images blessed.

It turned out, that I had to almost define what that was: images endorsed by a Debian Developer, handed to the AWS Marketplace team. And so since 2013 I have done so, keeping track of Debian’s releases across the AWS regions, collaborating with other Debian folk on other cloud platforms to attempt a unified approach to generating and maintaining these images. This included (for a stint) generating them into the AWS GovCloud Region, and still into the AWS China (Beijing) Region – the other side of the so-called Great Firewall of China.

So why the trip?

We’ve had focus groups at the Debconf (Debian conference) around the world, but its often difficult to get the right group of people in the same rooms at the same time. So the proposal was to hold a focused Debian Cloud Sprint. Google was good enough to host this, for all the volunteers across all the cloud providers. Furthermore, donated funds were found to secure the travel for a set of people to attend who otherwise could not.

I was lucky enough to be given a flight.

So here I am, in the terminal in Australia: my kids are tucked up in bed, dreaming of the candy they just collected for Halloween. It will be a draining week I am sure, but if it helps set and improve the state of Debian then its worth it.

Changing the definition of High Availability

I was recently speaking with someone about high availability. Their approach to high availability was two hand crafted instances (servers), in each in a separate Availability Zones on AWS, behind an ELB.

My approach to high availability is two (or more) instances in two (or more) Availability Zones behind an ELB, auto-created (and replaced upon failure) by AutoScale, with health-checks to ensure the instances are processing their workloads correctly, and replacing those that are not.

I extend this to include:

  • rolling updates so when the need arises, such as replacing the underlying OS every 6 – 12 months I can do so as seamlessly as possible
  • applying critical security updates daily without human intervention
  • ensuring that logs are sent off the instance to persistent storage and analytics where they can’t be overwritten or tampered with
  • ensuring data workloads are on object storage, or managed database platforms

Of course the ASG provisioning approach to EC2 deployment requires that everything during install is scripted and repeatable.And it gives me the freedom to size my ASG to zero instances outside of service hours, and ensure that the system deploys every single time.

I’ve done this resize ASGs to zero in non production times for many years now, across hundreds of instances per day. You can bet its pretty well tested, has saved hundreds of thousands of dollars, and ensured that the process works.

I bake my own AMIs, and re-baseline these semi-regularly using automation to do so – a process that takes 10 minutes of wall time, and about 60 seconds of my time. I treat AMIs as artefacts, as precious and as important as the code they run.

I have to stop myself sometimes when others rely upon pre-cloud approaches to high availability.

The snowflakes must stop. Cattle, not pets.

The Move to VPC NAT Gateway

Once more on the move to… this time, the move to RE-move Public and Elastic IP addresses from EC2 Instances…

As previously stated, one of my architectural decisions with AWS VPC using S3 and other critical operational services is to not have Single Points of Failure or anything overly complicated between my service Instances, and the remote services they depend upon. Having addressed private access to S3 via VPC Endpoints for S3, the volume of traffic I have that must traverse out of my VPC has reduced. Additional Endpoints have been indicated by the AWS team, I know this is going to further reduce the requirement my EC2 Instances will have on outbound Internet access in future.

But for now, we still need to get reliable outbound traffic, with minimal SPOFs.

Until recently, the only options for outbound traffic from the VPC were:

  1. Randomly assigned Public IPs & a route to the Internet
  2. Persistently allocated Elastic IPs & a route to the Internet
  3. A route to on-premise (or other – but outside of the VPC), with NAT to the Internet performed there
  4. A SOCKS Proxy that itself had Public or Elastic IP & a route to the Internet
  5. An HTTP Proxy, perhaps behind an internal ELB, that itself had a Public IP or Elastic IP & a route to the Internet

All four of these options would remain with some part of my architecture having an interface directly externally exposed. While Security Groups give very good protection, VPC Flow Logs would continue to remind us that there are persistent “knocks on the door” from the Internet as ‘bots and scripts would test every port combination they could. Of course, these attempts bounce off the SGs and/or NACLs.

We can engineer solutions within the instance (host OS) with host-based firewalls and port knocking, but we can also engineer more gracefully outside in the VPC as well.

Until this year, VPC supported routing table entries that would use another Instance (an a separate subnet) as a gateway. Using IPTables or its nfTables replacement, you could control this traffic quite well, however that’s an additional instance to pay for, and far worse, to maintain. If this NAT Instance was terminated, then it would have to be replaced: something that AutoScale could handle for us. However, a new NAT instance would have to adjust routing table(s) in order to add it’s Elastic Network Interface (ENI) to be inserted as a gateway in its dependent subnet’s routing tales. Sure we can script that, but its more moving parts. Lastly, the network throughput was constrained to that of the NAT Instance.

AWS VPC NAT Gateway
AWS VPC NAT Gateway

Then came NAT Gateway as a managed service. Managed NAT, no bandwidth limit, nothing to manage or maintain. The only downside is that managed NAT is not multi-AZ: a NAT Gateway exists in only a single Availability Zone (AZ).

Similar to NAT Instances, a NAT Gateway should be defined within a Public subnet of our VPC (i.e., with a direct route to the Internet via IGW). The NAT Gateway gets assigned an Elastic IP, and is still used as a target for a routing rule.

To get around the single-AZ nature of the current NAT Gateway implementation, we define a new routing table per AZ. Each AZ gets its own NAT Gateway, with its own EIP. Other subnets in the same AZ then use a routing table rule to route outbound via the NAT Gateway.

The biggest downside of Managed NAT is that you can’t do interesting hacks on the traffic as it traverses the NAT Gateway at this time. I’ve previously used Instance-NAT to transparently redirect outbound HTTP (TCP 80) traffic via a Squid Proxy, which would then do URL inspection and white/black listing to permit or block the content. In the NAT Gateway world, you’d have to do that in another layer: but then perhaps that proxy server sits in a NAT-routed subnet.

Having said that, soon the ability to do interception of HTTP traffic will go away, soon to be replaced by an all-SSL enabled HTTP/2 world — well, over the next 5 years perhaps. This would require SSL-interception using Server Name Indication (SNI) “sneek & peak” to determine the desired target hostname, then on-the-fly generate a matching SSL certificate issued by our own private CA — a CA that would have to be already trusted by the client devices going through the network, as otherwise this would be a clear violation of the chain of trust.

What else is on my wishlist?

  • All the VPC Endpoints I can dream of: SQS, AutoScaling (for signaling ASG events SUCCESS/FAILURE), CloudWatch (for submitting metrics), DynamoDB, EC2 and CloudFormation API
  • Packet mangling/redirection on Managed NAT on a port-by-port basis.
  • Not having to create one NAT Gateway per-AZ, but one NAT Gateway with multiple subnets such that it selects the egress subnet/EIP in the same AZ as the instances behind it if it is healthy, so I don’t need to make one Routing table per NAT Gateway, and auto fail-over of the gateway to other AZs should here be an issue in any AZ
  • VPC peering between Regions (Encrypted, no SPOFs)

Conclusion

So where is this architecture headed?

I suspect this will end up with a subnet (per AZ) of instances that do require NAT egress outbound, and a subnet of instances that only require access to the array of VPC Endpoints that are yet to come. Compliance may mean filtering that through explicit proxies for filtering and scanning. Those proxies themselves would be in the “subnet that requires NAT egress” — however their usage would be greatly reduced by the availability of the VPC Endpoints.

The good news is, that based upon the diligent work of the VPC team at AWS, we’re sure to get some great capabilities and controls. VPC has now started a next wave of evolution: its launch (back in 2009) had kind of stagnated for a while, but in the last 2 years its back in gear (peering, NAT Gateway, S3 Endpoints).

Looking back over my last three posts – The Move to Three AZs in Sydney, The Move to S3 Endpoints, and now this Move to NAT Gateway, you can see there is still significant improvements to a VPC architecture that needs to be undertaken by the administrator to continue to improve the security and operational resilience in-cloud. Introducing these changes incrementally over time while your workload is live is possible, but takes planning.

More importantly, it sets a direction, a pattern: this is a journey, not just a destination. Additional improvements and approaches will become recommendations in future, and we need to be ready to evaluate and implement them.

The move to S3 Endpoints

And now, continuing my current theme of “the move to…” with the further adventures of running important workloads and continuing the evolution of reliability and security at scale; the next improvement is the enabling of S3 endpoints for our VPC.

VPC Perspective: going to S3

Access to S3 for many workloads is critical. Minimising the SPOFs (Single Points of Failure), artificial maximum bandwidth or latency constraints, and maximising the end-to-end security is often required. For fleets of instances, the options used to be:

  1. Public IPs to communicate directly over the (local) Internet network within the Region to talk to S3 – but these are randomly assigned, so you’d rely on the API credentials alone
  2. Elastic IPs in place of Public IPs, similar to above, and then have to manage the request, release, limits and additional charges associated with EIPs.
  3. NAT Instances in AutoScale groups, with boot time scripts and role permission to update dependent routing tables to recover from failed NAT instances
  4. Proxy servers, or an ASG of proxy servers behind an internal ELB

In all of these scenarios, you’d look to use EC2 Instances with temporary, auto-rotating IAM Role credentials to access S3 over an encrypted channel (HTTPS). TLS 1.2, modern ciphers, and a solid (SHA256) chain of trust to the issuing CA was about as good as it got to ensure end-to-end encryption and validation that your process had connected to S3 reliably.

But S3 Endpoints enhances this, and in more than just a simple way.

Let’s take a basic example: an Endpoint is attached to a VPC with a policy (default, open) for a outbound access to a particular AWS Service (S3 for now), and the use of this Endpoint is made available to the EC2 Instances in the VPC by way of the VPC Routing table(s) and their association to a set of subnets. You may have multiple routing tables; perhaps you’d permit some of your subnets to use the endpoint, and perhaps not others.

With the Endpoint configured as above it permits direct access to S3 in the same Region without traversing the Internet network. The configured S3 Logging will start to reflect the individual Instance Private IPs (within the VPC) and no longer have the Public or Elastic IPs they may have previously used. They don’t need to use a NAT (Instance or NAT Gateway) or other Proxy: the Endpoint provides reliable, high through-put access to S3.

However, the innovation doesn’t stop there. That policy mentioned above on the Endpoint can place restrictions on the APIs and Buckets that are accessible via this Endpoint. For example, a subnet of Instances that I want to ensure they can ONLY access only my named bucket(s) Endpoint policy. As they have no other route to S3, then they can’t access 3rd party anonymously accessible buckets.

I can also limit the API calls via the Endpoint: perhaps permitting on Get, Put, List operations. These instance couldn’t assume another role (sts:assumeRole) that may have s3:DeleteBucket privileges, and use it via this restricted Endpoint.

Let’s make it a little more complex, with a second Endpoint on the VPC. Perhaps I’ll associate this second Endpoint with my administrative  subnet, and permit an open policy on it.

S3 Perspective: Restricting sources

An S3 bucket, once created in a Region, accepts valid signed requests from the Principals you permit in IAM policy. You can add Bucket Policies to them to restrict this to a set of trusted IP CIDR blocks (both IPv4 and IPv6 now – IPv6 only for the S3 public API Service Endpoint, not the optionally enabled S3 website or VPC Endpoint). For example, a DENY policy with a condition of:

"Condition": {
   "NotIpAddress": {
     "aws:SourceIp": [
       "54.240.143.0/24",
       "2001:DB8:1234:5678::/64"
     ]
   },
}

But with VPC Endpoints, you would instead add a DENY role with a condition of:

"Condition": {
  "StringNotEquals": {
    "aws:sourceVpc": "vpc-1234beef"
  }
}

Items in the condition block are AND-ed together at this time, so if you’re writing a policy with both VPC endpoint requirement OR an on-premise IP block, things get interesting: you’re going to want to Boolean OR these two separate Conditions in a Deny block:

"Condition": { 
  "NotIpAddress": { 
    "aws:SourceIp": [ "54.240.143.0/24", "2001:DB8:1234:5678::/64" ],
  },
  "StringNotEquals": {
    "aws:sourceVpc": "vpc-1234beef"
  }
} # FAILS EVERY TIME AS BOTH ARE EVALUATED!!

Luckily there’s a work around. IfExists can conditionally check a Condition key, and skip it if its not defined:

"Condition": {
  "NotIpAddressIfExists": {
    "aws:SourceIp" : [ "54.240.143.0/24", "2001:DB8:1234:5678::/64" ]
  },
  "StringNotEqualsIfExists" : {
    "aws:SourceVpc", [ "vpc-1234beef" ]
  }
}

Thus these two can be ANDED together and still pass if either one is TRUE. Kind of like an OR! Add the Action: DENY to this and we should be looking pretty good.

In summary

So what’s this got us now?

  1. Our S3 logs should only contain IP addresses from within the VPC now, so it’s fairly obvious to pick out any other access attempts.
  2. Our reliance on external Internet access has slightly reduced – but there are other sites and services in use (eg, SQS, CloudWatch for metric submission, or even AutoScale for signaling ASG scaling action results) then these are still required to go our the Internet Gateway (IGW) one way or another
  3. Our S3 buckets can have additional constrains to further limit the scope of credentials.
  4. We’ve avoided complex scenarios of lashing together scripts that dynamically adjust routing tables, intercept SSL traffic on proxy servers, or other nasty hacks

The AWS team has publicly indicated more Endpoints are to come, so this shows a clear trajectory: less reliance on “Internet” access for instances. All of this is a long, long way from what VPC looked like back in 2008, when it was S3-backed instances with no IGW – just private subnets with an IPSEC VGW to on-premise.

The underlying theme, however, is that the security model is not set and forget, but to continue this journey as the platform further improves.

So, key recommendations:

  1. Use IAM Roles for EC2 instances (unless you have multiple un-trusted clients using SSH/RDP to the instance). These credentials auto-rotate multiple times per day, and are transparently used by the AWS SDKs.
  2. Turn on S3 Bucket Logging (to a separate bucket). When setting he bucket logging destination, make sure you end the prefix with a trailing slash (/). Eg, “MyBucket” logs to bucket “MyLoggingBucket”, with prefix “S3logs/MyBucket/”. S3 Logging is a Trusted Advsior recommendation: setting a Lifecycle policy on these logs is my recommendation (dev/test at X days, Production at Y years?).
  3. Create (at least one) VPC S3 Endpoint for the buckets in region, and adjust routing tables accordingly. Perhaps start with an open policy if you’re comfortable (it’s no worse than the previous access to S3 over Internet), and iterate from there.
  4. Consider locking your S3 buckets down to just your VPC, or your VPC and some well known ranges.