Log4Shell and your apps in AWS

There’s a great XKCD cartoon entitled Depencency that cuts to the heart of today’s software engineering world: developers (and in turn organisations) everywhere love the use of libraries to accelerate their development efforts, particularly if that library of code is free to use, and typically that’s Open Source Free.

The image speaks about large complex systems, critical to organisations, needing the unpaid, thankless contributors of these libraries but upon whom everything relies.

In the last week, we’ve seen Log4J, a Java logging utility, come under such focus due to a critical remote code execution bug that can see the server side triggered to make outbound requests. A vast amount of Java based solutions for the last 15+ years has dependencies on logging messages being implemented using this library.

It’s lead to articles like “The Internet is on Fire” by the Australian Broadcasting Corporation, and thousands of posts on Infosec specific news sites (eg: (Kaspersky, Postswigger, SecurityScorecard, BleepingComputer).

Java is widely used, as Oracle corporation points out clearly:

3 Billion Devices Run Java – Oracle

There’s two sides to this: invalid requests coming in that should be handled with sensible data validation, and the resulting external requests that servers can be tricked into making.

Now I am not saying everyone should use their own logging library; that would be even more on fire. But we should stand ready to update these things rapidly, and we should help with either code contributions or financial donations (or both) to help improve this for the common good.

Untrusted Data Validation

Validating untrusted data sources is critical. The content of a local configuration file is vastly different from the query from the Internet. I’ve often joked about setting my Browser user-agent string to the EICAR test file content, used as a dummy value to trigger Antivirus software to match on this text.

In this case, we have remote attackers stuffing custom generated data strings in HTTP requests (and email and other sources that accept external traffic/data) to try and trick the Log4j library into processing and interpreting this data instead of just writing it to a log file.

Web servers always accept data from the Internet, and Web Application Firewalls can offer some protection, but in this case, the actual “string to check” can be escaped, making it harder to write simple rules that match.

Restricting outbound traffic

An attacker is often trying to get a better access into the systems they target; their initial foothold may be tentative. In this example, the ability to trick a target server to fetch additional data (payload) from an external service is key. There’s two main types of external data egress: direct, and indirect.

In the direct model, your server, which you installed and thus trust, may be running behind a firewall, but have you checked if you have restrictions on what it can fetch directly from the Internet?

In AWS, the default AWS Security Group for egress is to permit all traffic; this is a terrible idea, but is the element of least surprise for those new to the AWS VPC environment. It is strongly recommended that you pair this down for all applications, to end up with only the minimum network access you need, even when behind a (managed) NAT Gateway or routing rules, and even if you think your server only has internal network access.

I wrote a whitepaper on this topic for Modis in 2019 about Lateral Movement within the AWS VPC, and some of the concepts there are relevant now.

Your VPC-deployed virtual machine instance probably only needs to initiate connections to S3 on 443, and its database server on the local CIDR (address) range. For example, if you have three Subnets for databases:

  • 10.0.0.0/26 (Databases in AZ-A)
  • 10.0.0.64/26 (Databases in AZ-B)
  • 10.0.0.128/26 (Databases in AZ-C)
  • 10.0.0.196/26 (reserved for future expansion of Databases in a yet-to-be announced AZ-D)

… and are running MySQL (eg, RDS MySQL) in those AZs, then you probably want an egress rule on your Application Server/instance of 10.0.0.0/24:3306. (Note, be ready for making this all IPv6 in future). However, your inbound rule on the same group is probably referential to your managed Load Balancer, on port 443.

What about DNS and Time Sync?

If you have cut down your egress to just the two rules (HTTPS for S3 to bootstrap, CFN-init to signal ASG creation, and database traffic), what about things like DNS and Time. These are typically UDP based (ports 53, 123).

Indeed, the typical DNS firewall used for NTP, when syncing from external time services, is *:123 inbound and *:123 outbound. Ouch.

AWS Time Sync Service

The good news is you do not need to permit this in your security group rules IF you are using the AWS VPC provided Time Sync service and DNS Resolvers. These are available over the link-local network, and security groups do not restrict this traffic; hence can be left closed for UDP port 123.

This time service is also scalable; you don’t need to have thousands of hosts pointing at one or two of your own NTP servers; the AWS Time Sync service runs from the hyper-visor, so as you randomly add instances, you have more physical nodes (droplets) involved in provisioning this, so your time services scale.

Managed & Scalable DNS Resolution

DNS can be used for data exfiltration. If you run your own DNS resolver (eg, on a Windows Domain Controller(s) or Linux host(s) and set your DHCP to hand this resolver address to clients, then you may be at risk of not even seeing this happen. This is an indirect way of being exploited; your end server may not have access to egress to the Internet, but it can egress to your DNS resolver to… well, look up addresses. If you do run your own DNS server, you should be looking at the log of what is being looked up, and managing the process to match this against a threat list, and issuing warnings of potential compromise.

Managed DNS Security Checks: Guard Duty

If that’s too much effort, then there is a managed solution for this: AWS Guard Duty and the VPC-provided DNS resolver. In order for Guard Duty to inspect and warn on this traffic, you must be sending DNS queries via the VPC resolver. Turning on Guard Duty while not sending DNS traffic through the AWS provided service – for example, running your own root-resolving DNS server, means the warnings from Guard Duty will probably never trigger.

By contrast, having your self-managed resolver (eg Active Directory server) use the VPC resolver means that it is the one that will be reported upon when any other instance uses it as a resolver with a risky lookup! I’m sue that will be a mild panic.

Managed DNS Proactive Blocking: DNS Firewall

Going beyond simply retrospectively telling you that traffic happened is pro-actively blocking DNS traffic. Route53 DNS Firewall was introduced in 2021, using managed block lists for malicious domains. This gives some level of protection that clients (instances) will get a failed DNS lookup when trying to resolve these bad domains.

My Recommendations

So here’s the approach I tell my teams when using VPCs:

  1. Always use the link-local time Sync service; it scales, and reduces SPOFs and bad firewall rules.
  2. Always use the link-local DNS resolver; it scales. use a Resolver Rule if you need to then hook the DNS traffic up to your own DNS server (AD Domain Controller).
  3. Turn on Guard Duty, set up notifications of the Findings it generates.
  4. Turn on DNS Firewall to actively BLOCK DNS lookups for bad domains.
  5. Turn on your own Route53 query logs for yourself, with some retention period (90 days?)
  6. For inbound Web traffic, use a managed Web Application Firewall with managed rules, and/or scope your application to the country you’re intending to serve traffic to. In particular, block access to administrative URL paths that don’t come from trusted source ranges.
  7. Leverage any additional managed services that you can, so you minimise the hand-crafted solutions in your application.
  8. Template your workload, and implement updates from template automation; no local changes. Deploy changes rapidly using DevOps principles. Socialiase with your team/management the importance of full stack maintenance and least privilege access — including at the network layer ingress and egress — and schedule and prioritise time to include technical debt in each iteration, including the updating of every third party library in your app.
  9. If you have a DevOps pipeline with something like SonarQube or Whitesource, have it report on dependencies (libraries), and get reports on how out-of-date those libraries are, and/or if those out of date versions have known CVEs against them. Google Lighthouse (in the browser) does a great job of his for JavaScript web frameworks.

For this exploit you need to go widerthat what you run in cloud: your company printer (MFP), network security cameras, VoIP phones, UPS units, air-conditioners, Smart Hubs, TVs, Home Internet Gateways, and other devices will probably have an update. Your games console, and the games on it (this started from an update in Minecraft to address this and has… escalated quickly!). Even the physical on–prem firewalls and virtual appliances themselves – but ensure you don’t just do firewalls and ignore the larger landscape of equipment you have.

PS: I highly recommend my colleague Elliot Segler’s recent post entitled Learnings with AWS WAF and Log4Shell.

PPS: You may want to apply something like this to Apache:

 <IfModule mod_rewrite.c>
 RewriteEngine On
 RewriteCond %{THE_REQUEST} ^.*%7Bjndi:.* [NC]
 RewriteRule ^(.*)$ - [F,L]
 </IfModule>

Thoughts on the IPv6 Transition

I’ve been discussing the IPv6 transition with our customers more recently; for over 3 years we’ve been dual-stack IPv4 and IPv6 for public-facing AWS-Cloud-based solutions and services for our customers.

So what?“, you’re thinking?

It’s worth noting that from Google’s numbers, global IPv6 is now approaching 36%, while at home in Australia 27%, helped by TelCo carriers like Telstra enabling IPv6 to their mobile phone subscribers, and advanced ISPs like Aussie Broadband and Internode making IPv6 trivial to enable.

Google IPv6 Adoption, as of 12/Oct/2021

I first had an IPv6 tunnel established to Hurrican Electric in 1999 when I worked for The University of Western Australia. I championed the adoption of IPv6 as a first-class citizen in the cloud when I worked at Amazon Web Services as a Solution Architect, and these days, a large majority of AWS public-facing services already support dual-stack approaches, and more are on the way.

As the next billion people come online, the unavailability of more existing IPv4 Internet is a limiting factor. The temporary value of the IPv4 address space, being reallocated (“sold“) between assignees will eventually presumably peak when a majority of clients (people) and the services they are accessing are all on IPv6.

I have been advising a government body, who had two IPv4 “Class B” sized IPv4 subnets allocated to them. Each of these subnets is a “/16” netblock (65,535 addresses); they had only ever used a handful of /24 ranges from within their first allocation.

Most services they use, both for staff and for public-facing services, now run on the cloud, from cloud-provider address space. They’re unlikely to need all of the address blocks they currently have from the first /16 block, let alone the second.

This netblock has a current value of a couple of million dollars (AUD).

It’s likely that many public sector agencies have IPv4 address netblocks that they’re unlikely to ever use, and could also benefit from reallocating to service providers desperate for their own address space to host solutions from.

Well, desperate until most clients are using IPv6.

I’d urge any public sector organisation to review their plans for using their address space, and if they have large unused, contiguous address space, consider reallocating that. The funds raised can then help with further modernisation of workloads – including those workloads to move to IPv6 addressing.

For any managed service providers, I would urge you to “dual-stack” all public-facing Internet services. You should continue to use strong encryption in flight, modern TLS protocols, and strong authentication, regardless of the network transport protocol version.

If you are using AWS CloudFront as a CDN in front of your origin service, then enable IPv6 in the CloudFront configuration, and then publish the corresponding AAAA DNS record just as you have to the A DNS record. Similar works if using CloudFlare, Akamai, Fastly or others.

For those who use managed service providers for their corporate business networking, ask why your work Internet connection is not dual-stacked already. It’s typically a configuration question, and rarely has any actual cost associate with it. If you have a corporate proxy service, then if it is dual-stacked, the clients (on your internal corporate network) already get some benefit of being able to talk to IPv6 services.

If you have DNS services, check they not only can serve IPv6 records (AAAA), but they are reachable using IPv6. Services like AWS Route53 have done this for years (see my earlier point about getting IPv6 as a first-class citizen within AWS).

While you’re looking at DNS, have a look at creating a simple CAA record, to list the Certificate Authorities you obtain certs from.

Blocking outbound HTTP from the Home Network, 2021

With the move to HTTPS as a default, I took a chance and recently blocked outbound (EGRESS) HTTP (TCP 80) traffic from my home network. I’ve got around 30 – 40 devices on the network, and I was intrigued to see what we (my family and I) would experience.

With my Unfi Dream Machine Pro, this was a reasonably easy update: Settings -> Traffic & Security -> Global Threat Management -> Firewall. I added a rule for Internet Out that that dropped anything going to port 80:

HTTP reject rule for Internet Out from Unifi Dream Machine Prod

This is a rule I had in place for two days. I checked my own laptop access for HTTPS, SSH, IMAPS and SMTPS egress, and all was fine.

What transpired over the following two days helped me identify the devices and vendors that still produce products that have a dependency to operate using unencrypted HTTP over the Internet.

Logitech Smart Radio

We have had a streaming radio for some time; we still like to listen to London Capital Radio despite the 7-8 hour timezone offset. Within a few minutes of blocking HTTP, the audio stream stopped.

We purchased this device around 2012. On my network, it identifies as a Squeezebox running RedHat. The manufacturer discontinued it years ago and there have been no firmware updates for a long time. It only supports 802.11g wifi in the 2.4 GHz spectrum (is this WiFi 3?).

I wasn’t prepared to replace the device, so for the moment, a work-around rule to permit HTTP (by MAC address) fixes this for the short term. We’re unlikely to see any updates from Logitech anyway.

GoToWebinar

This one surprised me; I was signing in to a webinar, and the obligatory download tried to execute and stalled. It turned out the installer was doing an HTTP based OCSP check.

Now, for web browsers, OCSP has been mostly relegated to the annals of history, replaced with OCSP Stapling.

OCSP is a network efficient query that a client can do against a Certificate Provider’s endpoint to get a signed confirmation that the certificate in question has not been revoked recently. However, in doing so, it tells the certificate authority which site you (your source IP address) just visited; this is called an Information Disclosure vulnerability. Instead, the website in question fetches these signed validations at a regular interval and passes this to the clients that it’s already communicating with – stapling the validation to the certificate during TLS negotiation: “Hi client, here’s my certificate, and here’s a recent verification that my certificate is not revoked”.

Using HTTP for OCSP isn’t too bad, as the response that is being downloaded is itself cryptographically signed. But it’s still visible in the plain for all to see.

Update 27/9: No response from GoToWebinar.

Enphase Envoy

12 hours later, my solar panel data aggregation service, Enphase Enlighten, alerted me that it was no longer receiving data from my solar panel inverters. With another rule to permit the Envoy controller to make HTTP outbound requests, and the data started flowing.

This is a reasonable issue. The submission of the generation and consumption of power in my home should not be trundling over the Internet unencrypted.

I raised a support request with Enphase (21/September 2021 at 15:04 AWST UTC +8), asking them to contact me, I received a ticket auto-response (03164xxx) but no other contact.

Later that night I tried reaching out over Twitter to any security folk at Enphase, but after 24 hours, no response.

Update 27/9: After a few days of no reply, Enphase asked (via automated email) how my support experience was. Unbelievable!

I then tried calling them on their Australian support number as shown on their website. I ended up in a call queue which was quite amusing in itself; every 5 seconds (no, literally, every 5,000 ms) it would announce that all callers were busy, and then it would restart the same audio music clip, only to then interrupt itself… I gave up after 20 minutes in the queue.

Lastly, I have DMed the Enphase Twitter account, and await a reply.

Enphase does not have a security.txt file on their website!

Any customer data should be transmitted to an HTTPS endpoint. The firmware of the Envoy device should have the Certificate Authority’s Root Certificate, used to issue that Endpoint certificate, in its trust store. The device should receive updates as that Root CA expires and is replaced (this happens every 10 – 20 years per CA). The embedded firmware also would need to keep step with the improving TLS protocols over time, now TLS 1.3 would be ideal, but in future, who knows.

What also struck me was the lack of IPv6 being picked up by this device; not only should it have picked up a new IPv6 address locally, the Endpoint it submits its data to should also be dual-stack IPv4 and IPv6.

Apple iPad 14.8 -> 15.0 upgrade

This one was very unusual. Apple had released iOS 15, and our iPads were about to make the jump from 14.8. However, despite full WiFi signal, the devices keep announcing they couldn’t verify the downloaded image because they couldn’t connect to the WiFi!

I’m hoping Apple can address this dependency before the next iOS update.

Update 27/9: Apple responded on Twitter concerned only that I could install the update, not that the security issue was being investigated, resolved or understood.

Nintendo Switch

It turns out the Nintendo Switch won’t join a WiFi network that doesn’t have outbound HTTP access; it must do a call home or validation using HTTP.

Summary

I’ve paused the experiment for the moment, but next month I’ll resume it and find more edge cases where devices we rely upon still use unencrypted channels, exposing our data, without us even knowing…

Browser support for FTP (another sunset)

As Hanno Böck noted in the recent Bulletproof TLS Newsletter, FTP Support in Firefox 90 has been removed. We’ve seen similar messaging from most major browser vendors over the last few years.

I’m going to make a bold prediction, and say in 10 years time we’ll be seeing the removal of (plain text) HTTP support as well. Regardless of internal or external networks (an out-dated concept aligned to the Crunch Shell of network security), the move to stronger security for all communications, backed by free TLS Certificate Authorities (such as Let’s Encrypt) means we should be doing end-to-end encryption for everything the common web browser fetches.

For some time, Firefox has had an HTTPS-only mode, with warnings when services try and dip back to unencrypted access. I’ve typically found this warning pops up when various link-shortening services are chained together, and I’m grateful for the awareness that a jump in that chain is poorly implemented.

In the meantime, the distribution of files using FTP needs to stop. If you run an FTP service then you need to think about transitioning to something that permits access using HTTPS as the transport protocol.

Another sunset in the circle of life of a protocol.

Migrating & Operating Public DNS with AWS Route53

DNS is the fundamental directory service of the Internet, and these days, of all corporate systems performing discovery of their various components in digital deployments. It is used at least hundreds of times per day per device, and no one really notices until it breaks.

For the most part, its what changes the hostname you have used to navigate to this article – blog.james.rcpt.to – into an IP address of some sort, the numbered address that is supposed to uniquely identify the server or load balancer that is connected to the public Internet to which your browser will then make a network connection.

If your DNS breaks, then your service, or perhaps your entire company, is “off the air”, no longer discoverable by clients wishing to use human-readable labels and names to find your address.

I recently gave a 45 minute presentation at the AWS User Group in Perth, Western Australia, highlighting some of the advantages of using Route53 for DNS, an some of the more modern security protections therein; slides can be found here.

I have helped various organisations transition their authoritative DNS from their existing services to AWS Route53. This migration, in isolation, requires coordination, preparedness, and awareness of the impact of poor planning, and opportunity for service improvement.

DNS Migration Planning

DNS manages to serve the scale of the planet by effective use of caching. This is done at multiple layers in the service.

When a DNS service does a looking for a query it has not already cached (or has expired from cache), it must start a recursive resolve. For this to happen, it needs to find the Authoritative Name Server for a Zone. This Authoritative Name Server will be delegated down from a parent domain, and so on, until we reach the top of the DNS tree.

A common example in this scenario is the address “www.example.com”. Before answering the address for this exact name, a client must determine the address of the Authoritative Name Server(s) for “example.com”. If this is unknown, then the client must ask the address of the Authoritative Name Server(s) for “.com”.  Again, if “.com” Name Servers are not already known (cached), then the level above that must be queried, commonly called the Public DNS Roots, or “.”.

The Global Root DNS Servers

These DNS roots are globally agreed, and every DNS server is given a (relatively) static file of the addresses for these. They are given generic names, A-M, and are operated by a variety of organisations who agree to share the same data for the 1000+ Top Level Domains under the globally agreed root.

These Root DNS Servers are all accessible by both the existing IPv4 address scheme, and the newer IPv6 Internet. These addresses exist in a file commonly called the “root.hints” file and is distributed with all DNS server software as the initial glue. It rarely changes.

In a trick of Internet routing (BGP), many of these 13 hosts (A-M) are also each replicated multiple times by a process called Anycast: the same small address segment that the server lives on is “announced” to the world from multiple locations, and a duplicate server, performing the same process, and responding with the same answers.

This first layer of scalability helps the root servers deal with billions of devices using the DNS service every second.

The Global Top Level Domains (TLDs): Registry Operator and Registrars

Each of the global TLDS are operated by a Registry Operator, but records are added and removed by multiple Registrar organisations. For example, the “.com” zone is operated by Verisign, but there are many Registrar organisations you can obtain a DNS name from, amongst which is Route53 itself.

These operators have a selection of innovations and policies they apply to their delegation. Some operate their service with just IPv4, and some are dual-stack IPv4 and IPv6. Some operators have their DNS zones cryptographically signed (using DNSSEC) to provide some validation of the DNS queries.

Route53

Starting in December 2010, Route53 originally provided support only for hosting DNS Zones for customers. The engineering for the service at that time was designed to eclipse what most organisations had in place, providing higher reliability and scalability.

Back in the day, the authoritative references on running DNS services were the Bind Operators Guide (aka The BOG), and the O’Reilly book by author Cricket Liu. Most organisations organised just two DNS servers to respond to their customers queries, and the most common software for doing so was ISC’s Berkeley Interned Name Daemon, or BIND.

Of course, to have your own DNS server, you need a fixed IP address that your server would operator from, as this IP address is what the upstream zone would respond with to clients. And thus the initial problem for most organisations was getting a pool of static IP addresses.

Most ISPs only hand out dynamic addresses, and charge substantially more for static routes. Other (typically larger) organisations went through a laborious process of having IP addresses themselves assigned to their organisation (through ARIN, APNIC, or other IP address registries), and then deploying BGP to announce their range to their connected ISP(s) – could be multiple.

This overhead of assigned IP address ranges, setting up corporate BGP (and trying to secure it) all went away with the launch of hosted DNS services, and Route53 turned out to be one of the most well engineered and cost-effective solutions.

With Zones hosted we can delegate from the parent domain to the name servers that Route53 provides us; each individual record (e.g., “www”) can then point to any IP address (i.e., anywhere). The entire need for corporations acquiring large blocks of IP addressing for their organisation went away.

Indeed, I have helped organisations who previously had very large, fixed blocks of IPv4 addresses to relinquish some of these in a commercial market (for millions of dollars).

Route53 has expanded its remit in the AWS Cloud environment. In addition to hosting authoritative DNS zones, it also offers Registrar Services for hundreds of domains, as well as tuning the use of DNS with in the Virtual Private Cloud Environment. Each of these functions can be used completely separately for example, you can:

  • Register a domain with Route53 to handle the re-registration, but delegate to your own (or a 3rd party) DNS servers.
  • Register a domain with another Registrar (e.g., Go Daddy, Verisign) and delegate to Route53 Hosted Zone.
  • Configure complex routing and protection mechanisms for your Virtual Cloud Environment.
  • Host private DNS for your VPC, invisible to the outside world.

In this article, we are going to concentrate on running Public DNS Zones, and the protections you can put in place.

Route53 Public DNS Zone Hosting: Scalability

By default, Route53 gives the operator a choice of 4 DNS servers to pass to the parent domain for delegation. Each of the 4 names are themselves given DNS Names, from four different TLDs. The Four names also themselves resolved to both IPv4 and IPv6 addresses.

The parent domain and then record (and cache) the delegation addresses of these 4 DNS server endpoints; and can instruct end clients doing lookups to also cache this delegation.

Each of those four endpoint addresses are also potentially themselves Anycast announced from multiple locations worldwide. This helps clients reach the closest deployed endpoint for each of the four names, reducing DNS latency.

This set of 4 DNS servers provides much greater reliability than the traditional two, and the multiple anycast presentation further improves this. The chance of any other AWS Route53 customer having the same set of 4 DNS server endpoints is very small, so any Denial of service on specific set of delegated IP address for another Zone is unlikely to affect your zone significantly. This is part of the reason why Route53 offers a 100% availability Service Level Agreement (SLA).

(Note, the control plane, for providing updates to records, is not covered by this SLA)

Route53: Questions

A number of configuration questions arise when planning the migration:

  1. do you want query logging turned on?
    1. this is delivered to an S3 bucket: what’s the retention policy on this (look at S3 life cycle policies) – always set an automated time to delete, perhaps in 12 months?
    2. what’s the analytics processing done on this data, if any?
    3. who has access to this log data, is the bucket marked private, default encryption, versioning?
  2. do you want DNSSEC enabled on the zone –perhaps do this after service migration if you don’t currently have DNSSEC enabled.
  3. What integrations for automated updates are in place, if any?
  4. Who needs access to the console to see and/or update records?

Route53: Migration

The process for migrating to Route53 is relatively simple:

  1. Reduce the parent domains Cache time (TTL – time to live; see below) for the delegation records that point at your current service: a value of 300 seconds may be reasonable.
  2. Prepare the DNS Export form the older service; review the records in there before doing a test import into the new service to ensure that no records cause any issue. This is perfectly safe as we have not redelegated yet. You should also review the individual records’ TTL values, and potentially reduce them down as well as part of the export/import. Any web site or load balancer should run with a TTL no higher than 300 seconds. Once the export/import has been successful, then delete the imported records – we will take a fresh update later…
  3. Determine if there are any processes that are automatically updating your DNS. These will have to be integrated to call the Route53 API for those updates after the migration.
  4. Ensure you have access to redelegate with the Registrar. Test username & password to log in.
  5. Schedule a time for the transition, during which we will avoid updates, or update both old and new. You will have to wait for 5 periods of the previous original TTL time to pass since step 1. For example, if the Delegation TTL was a week, then wait 5 weeks before proceeding (see below on TTL)!
  6. At the agreed time:
    1. Record the current (old) delegation IP addresses the registrar has configured.
    2. re-export the values
    3. update the record TTLs in the export.
    4. import into the Route53.
    5. update the Registrar with the four new Delegation address.
    6. test the DNS immediately; if it fails, revert the Delegation to the addresses in step (a) above.
    7. watch logs/metrics from other systems that rely upon this domain name, such as Web traffic for the zone, or mail traffic.
    8. test the DNS again after 25 minutes.

DNS Caching: TTL & Delegation records

A key element of DNS’s scalability is caching as much as possible. Records all have a customer defined Time To Live value, the duration (in seconds) that a record can be kept by a client.

When making changes to records, we typically observe 50% of clients seeing cached values updated after the TTL period; a further 50% of the remainder see the update after another TTL period; in practice, the TTL almost acts like a half-life:

DurationCumulative % clients seeing update after this time
1 * TTL50%
2 * TTL75%
3 * TTL87.5%
4 * TTL93.75%
5 * TTL96.875%
6 * TTL98.4375%

We would typically wait at least 5 periods of the time-to-live value on a delegation before making a change. As many organisations have this value set to a week (or a month), this could take some time; we’d also recommend keeping this value relatively small once migrated, to ensure you have flexibility in re-delegation in future. 24 hours is a reasonable time for delegation records, unless you’re about to do a migration, in which case 300 seconds is reasonable (25 minutes for 96% of clients to see the update).

During this period, however, you could have your new DNS zone hosting the identical records of the old one, and any updates during this period should be applied to both (or updates can be avoided during this window).

However, some network operators chose to override values as they see fit. A certain ISP in Australia would not honour any small TTL values, and this would result in at least a 24-hour TTL being enforced. Given the 5-TTL-period duration to get 96% of clients seeing the update, you may have to adjust your time frame to accommodate a parallel run of old and new DNS service. Unfortunately, you cannot force update this ISPs.

Route53: Records to Add and Modify

Route53 will automatically create a Start of Authority (SOA) record for your zone. This standard record type has two fields of interest: the RNAME (the responsible person’s email address), and the default TTL value (used for Negative DNS responses, when a query tries to find something not defined. You can leave these as the default, but if you adjust them, then the RNAME field must point to a monitored mailbox, and the TTL adjustment may result in higher query traffic.

Outside of the SOA, there’s a number of other DNS records you should put in place:

  • SPF on the apex, now implemented as a Text (TXT) record, to indicate where your Email is permitted to originate from (low volume lookup traffic). Something like “v=spf1 mx -all” may do. You should set a different record even if your domain is not hosting email, to indicate that all fraudulently generated email from it is SPAM.
  • CAA on the apex, to indicate to Certificate Authorities, who your permitted CAs for your domain are (extremely low volume lookup traffic). Something like:
    0 issue “letsencrypt.org”
    0 issue “amazon.com”
    may do.
  • DMARC: a TXT record on hostname _dmarc.yourdomain, with value “v=DMARC1; p=quarantine; rua=mailto:youremail@yourdomain
  • SMT TLS Reporting: a TXT record on the hostname _smtp._tls.yourdomain with value v=TLSRPTv1;rua=mailto:youremail@yourdomain
  • An MTA (Mail Transfer Agent) STS (Strict Transport Security) record: a TEXT record for hostname _mta-sts.yourdomain with value “v=STSv1; id=2021021000;” – the number can be a representation of the current datetime in yyyymmddhhmm that can be incremented. You should also set up a static web site to host your MTA STS policy document itself on https://mta-sts.yourdomain/.well-known/mta-sts.txt.

For checking this domain’s security configuration, have a look at Hardenize.com, by Ivan Risti?.

Post-Migration

Ensure that all administrative staff have access to set and update the records they need.

Lastly, don’t forget to decommission the existing DNS service once you are convinced you do not need to go back to it.