Secure your infrastructure

According to the Australian Dept Defence’s Australian Signals Directorate and their mandate in Cyber Security support to Australian Government and the whole of Australian industry, 55% of the cybersecurity incidents reported to them are for compromised asset, network, or infrastructure.

That dominates the #2 in the list Denial and Distributed Denial of Service, at 21% of incidents.

Securing infrastructure is critical. While this includes physical security, its dominated by virtual access to assets: compromised credentials, flawed firmware with known hard coded credentials, and other attack vectors.

While network restrictions are useful, strong logging and alerting is also critical, as is actually reading those alerts, triaging them and prioritising them.

Every piece of infrastructure in your environment should have some form of remote logging available. Local logging, on a device, is not sufficient. These logs should be treated with the same security deference as your PCI payment credentials, medical information or more.

Step 0: authentication

If your device only permits local username and password, then it should be a unique combination for each device. That could be a large list, so you’ll need some sort of password management in place.

Never use default passwords; and change usernames where possible. If I had a dollar for every time I saw “admin/admin” as the default… please use “${mycompany}admin/device-unique-password” or something unusual.

If the device supports MFA, then (with Step 2: Time configured) you should enable that.

If the device supports RADIUS or other network authentication and single sign-on, then consider using that (but more considerations may exist). Even still, a fallback to local credentials may still exist.

Step 1: Restricted network access

Your devices on your network probably don’t need a whole lot of inbound access, nor outbound for that matter. Lets talk about both.

The admin interface to your device is the most sensitive. It should not face the open internet if possible, and if it does, it should have some level of address range restriction as a rudimentary first step of protection.

IP address range should be on a permit basis: eg, permit only from your trusted range where you expect to admin the device from, including from backup networks in emergencies, and reject everything else. The Internet is full of bots and scripts that scan juicy looking admin ports, testing for zero day exploits, known bad configurations, and hard coded defaults or back doors. Even if you have patched and remediated what you know of, there could be more, as yet undiscovered by you or the vendor, so why take the risk?

If I have to have public facing interfaces then the restrictions that I like to use include reasonably large ranges from the corporate ISP network providers I use, and the well known ranged for cell/mobile phone providers, so that I can tether in an emergency. You may also wish to include your home ISP range, so that in an emergency, you can WFH to fix things.

This isn’t considered trusted, its just more trusted than the open Internet. And even if you have a large internal network where all staff — including admins — work from, its worth rearranging your networks to keep those admins in one subnet, and restricting internally as well, particularly if you have a wide area network, and possibly have publicly accessible ethernet ports that can be accessed by untrusted devices. Yes, 802.1x port authentication is a step up here, but why have that exposure in the first place.

Then think about what egress is needed form the device itself. Probably a remote logging destination (Step 3), which may be over TCP HTTPS, for example. Your device may also need to access internal DNS (UDP and TCP), but probably only to a small, possibly internal, set of ranges. And lastly, UDP NTP (for the next step, Time). Not that UDP traffic typically needs an ALLOW rule on network traffic in both directions.

Step 2: Time

Lets start with the basics: the time. Every device in your infrastructure should have the correct time. They should all be synced to a very high accuracy, using NTP or similar protocols. Its imperative for timestamps between systems to line up so that logs can be correlated. You don’t need to run out and buy a stratum 1 atomic clock, but configure NTP sensibly for your network.

Your Cloud provider may have a scalable, reliable time source that you can synchronise virtual machine clocks with. For your colo or private networks, you may want to configure a set of NTP servers that the rest of your environment can depend upon.

And when I say depend upon, you should monitor the time difference between your NTP servers to detect any drift, and detect if any of your NTP servers are offline. Start with having every device use a private DNS resolver on your network that all devices can use, and publish an internal DNS entries that list your set of NTP servers:

ntp.internal IN A 10.0.0.6
ntp.internal IN A 10.0.0.7
ntp.internal IN A 10.0.1.6

Your internal DNS suddenly became a critical vector for compromise, so ensure that it is also in scope for this advise!

In AWS Cloud, check out the Time Sync service.

Step 3: Logging

Do not log locally. Always send acros the network to a logging endpoint.

Your logging endpoint should be scalable so it doesn’t get overwhelmed or limited to how much logs it can ingest.

It must be encrypted in flight for both privacy and integrity, and it must be authenticated to ensure the right device is sending the right logs.

Logs should contain the timestamp of when they are received, as well as when devices sent them; and there should be minimal difference between these times.

And lastly, logs should be verbose enough that you do NOT need to go back to the original device to get more information. Get everything off the device, and you (or someone else) should never need to access the device itself directly. This handles the case where the device is compromised, no longer accessible, or has been bricked, deleted or otherwise removed.

Now that logs are in a uniform place, there’s two things to do:

  1. Provision authenticated, encrypted access to those logs for the people who need to search them (and log their access to these logs!)
  2. Set up some automated alerts

In AWS, definitely use CloudWatch Logs. And remember, you can use CloudWatch logs from your on-premises networks, over HTTPS, with authentication

Step 4: Alerts

This is where the fun happens. How many things can you think of that would be an indicator of a compromise (IOC). Let’s start with the simple: any access that fails authenticate to the device should be an alert. Your endpoint should not have unfeted public exposure, so the authentication attempts should all be legitimate

Auth Failure: this could be a bot, even on your internal network, probing for access. Or perhaps its just you before a coffee and you mistyped a password. Good to know where these come from as early as possible.

Auth Success: so you know the alerting is working, and have a record of what you are doing, it’s nice to get confirmation to show its you on the device. Or it could be compromised credentials being used. An auth success alert at 3am in your local time could be a sign you’re working late, or… something else.

Timestamp mismatch: the log receive time and the log time from the device could be out by a meaningful amount. This could be indication that submission of logs was delayed for some reason.

Device reboot: why should devices be unstable? Did they just flash a new firmware? Where they replaced/cloned by compromised devices?

Lack of regular log submission: a reliable heartbeat is very useful, but watch out for no longs when you expect at least something.

Config change: for critical components like routers, or other devices that will have a reasonably stable configuration, then alerting on this is a nice feedback confirmation off changes you (or someone else) has done.

Local device password change: if you can’t used centralised access control and single sign on, then you should alert on this. And you should probably alert on this NOT having happened after a year.

Log access: this is becoming a little meta, but having an alert when someone inspects the logging system itself, to view the logs, may be a reason for a notification.

Step 5: Alert Destinations and Escalations

Email is a terrible log destination, but the easiest to set up. Then again, its the easiest to set up a rule to then ignore. Some people use Slack or other instant messenger interfaces.

One thing you will want is a way to determine all the alert that have been triggered historically, filtered by device or device type (all switches), time span (last 7 days, last week), alert type (auth failure & auth success), etc.

Creating a dashboard to show these alerts will help you understand what’s happening.

A single auth failure is an interesting event, but a repeated auth failure, over a relatively small time window (an hour, a day) may be a brute force attack. A repeated reboot may be a device failing.

When a device (re-)boots, if it gives a firmware revision in its logging, how do you check that against the previously known firmware revision (hint: it’s in your logs from the previous boot). Is that the currently recommended firmware? Is there some form of automatic firmware update in place? Is it lower than the previous revision – which could be a forced downgrade to a known buggy firmware.

Summary

Pretty quickly you start to see the complexity, depth and urgency of having a strong logging and alerting in place. Without a trusted base to work from, any workloads in your environment may not be trusted.

Goodbye AWS Ambassador Program

Shortly after I joined my current employer as a cloud engineer, I was invited to participate in the AWS Cloud Warrior program, circa 2016. This was introduced as the AWS program for the best engineers in the AWS partner community, and its inception in Australia meant that it was filled with many significant people in the ecosystem that I knew from my time as one of the first Solution Architects for AWS in Australia & New Zealand.

This program aimed to have these senior engineers collaborate on cloud improvements, as well as engaging those engineers who sought to educate the wider (public) cloud community on how to achieve best practice. It encouraged this collaboration by routinely holding gatherings for the engineers in face-to-face meetings. AWS would fund domestic travel and accommodation, and bring service team members for various AWS Cloud Services to these gatherings. This was the partner community’s chance to provide direct feedback in a workshop-like environment on the improvements and limitations that were being experienced in the real-world, delivering and managing solutions for clients.

Key observations included long-term operations and supportability, continual security uplift, and removing the undifferentiated heavy lifting our form the client responsibility, and behind the dividing line of the Cloud services. This in turn reduced the cost to the end clients, across the board.

The benefits to AWS were huge. The service improvements were significant in the way that cloud was being shaped.

I was fortunate enough that my employer would fund my time to participate.

As a result, many ways to implement, scale and operate digital solutions in Cloud were championed by the Warrior community, who took to blogging, running user groups, and other activities to share this best practice and set the bar high for delivery across the entire global cloud engineering community.

This program got rolled over into the Ambassador program, and expanded to include the rest of the world. The ‘contributions‘ that the participants made were then tracked though a portal for the Ambassadors to submit to, with a simple gamification to encourage participation even further. The reward for this was an annual Ambassador Summit event, held face to face in Seattle, with flights and accommodation covered by Amazon.

I attended this several times, based on the body of blog submissions and community contributions I had made, and was racked #2 Ambassador in Australia and New Zealand for several years.

Again my time was gratefully funded by my employer to participate. These learnings were obviously shared deeply within the technical community within my employer, and over time, several of my colleagues also then joined me in this program, having met the eligibility requires. It was a point of differentiation to claim that my employer had multiple individuals in the AWS Ambassador program.

Then came the change. It wasn’t heralded, but felt. The Australian-domestic Ambassador meetings ceased. The funding for the Global Summit reduced to no longer covering flights to Seattle, but my employer was good enough to cover this for a year.

The following year, no funding at all was available for flights or accommodation, which made this impossible to participate from Australia.

Access to re:Invent, the global cloud conference in Las Vegas, continued as a complementary ticket to attend – but never with flights or accommodation cover.

Despite all of the benefits drying up, remaining in the program required the continuous investment in community support and education activities by the participants.

I looked to the AWS Ambassador program as being the equivalent of a Distinguished Engineer, or Most Valuable Professional as seen with other vendor ecosystems, but AWS had disengaged this demographic that had helped shape their platform.

Questions were asked regularly of the AWS team assigned (which turned over several times) about what was happening to the program. No answers that instilled confidence were shared.

I found I was explaining the existence and value of the program far more than AWS would; none of my global clients knew there was such a group of senior engineers that we were a part of, and so the entire value evaporated.

When the consulting services partner has to introduce and explain the vendors’ recognition program to the client, something is wrong. This was further highlighted multiple times with the lack of mention of ambassadors at the AWS Summits, replaced with the wide Community Heroes program.

I continued to participate in the program, but there was no interaction, no feedback, and no events across A/NZ since around 2023. Then at the end of 2024, I stopped submitting the cloud blog posts I authored into the portal. I wanted to see if would happen, what contact would be initiated.

The answer: Nothing.

Last week I found that I was no longer listed in the Ambassador portal. With no contact, no emails, no notice.

10 years: 2016 – 2025.

Will it change my blogging? Well, I’m much less at the coal-face these days, running a global practice in 14 countries. My focus for a long time has not been on my own delivery and training, but that of 1,800 other AWS cloud capable staff that I work with, for them to be at a level of capability that reduces risk and improves value to clients. This is how I scale myself.

I shared a lot of my perspectives with the other Ambassadors and AWS Service teams, and they shared with me. Rowan, Arjen, Ian, Cristian, Elliot and many more in this group across A/NZ: thank you for your insights and perspectives.

As an ex-AWS employee (one of the first few AWS Solution Architects in Australia, see my CV), I am disappointed and somewhat embarrassed at the way this program has been handled for the last few years as it was defunded and provided less value to the participants.

For interest, this is what this community looked like in 2022 at the Ambassador Summit in Seattle, with the size of the text reflecting the number of Ambassadors:

Some names have changed, some have gone away, and there are likely some new ones now.

A ton of IPv6 innovations in AWS

The last three months have seen a large number of IPv6 announcements from AWS. I’ll recount some of them here:

  • Organisations support IPv6 (link)
  • IPv6 for EC2 public DNS names (link)
  • Transfer family supports IPv6 (link)
  • Managed Service for Apache Fink adds IPv6 (link)
  • Resource Groups adds IPv6 support (link)
  • EFS supports IPv6 (link)
  • EFS supports IPv6 (link)
  • Private CA supports IPv6 (link)
  • Site-to-Site VPN supports IPv6 on outter tunnel endpoints (link)
  • DataSync supports IPv6 (link)
  • SNS expands IPv6 support to include VPC endpoints (link)
  • SQS expands IPv6 support to include VPC endpoints (link)
  • CloudWatch adds IPv6 support (link)
  • EventBridge supports IPv6 (link)

Much of the public (Internet) facing endpoints for these services are now dual-stack, supporting both IPv4 and IPv6 (for now).

But have a think about the VPC endpoints that are now either dual stack, or IPv6 only: this increases the direct integrations for potential IPv6 only subnets, or massive sizes, to integration endpoints such as SQS and SNS. These scale-out VM farms can now have these loosely coupled integrations that can support the sale of millions of virtual machines; the rest of the VPC may be on traditional IPv4 only allocations, but having that layer of messaging is now highly valuable.

We’re at a point now where it is almost commonplace for dual-stack endpoints for most AWS cloud services; and it should be the same for endpoints that customers make on the AWS cloud. There’s very little holding you back – certainly not cost, and in some cases, cost is (or will likely be) the driving factor for the rapid uptake of IPv6, for those that are ready.

Amazon SQS adds IPv6 support

At first glance, this seems like a strange thing to be even mildly excited about.

AWS has been added “dual stack” (having both IPv4 and IPv6 addresses) for their services for some time, and I have blogged about this many times over.

First, lets just go read the brief release, from April 21 of 2025 https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-sqs-internet-protocol-version-6/.

OK, you’re back. First up, how is this working?

Well, the existing API endpoints, such as service.region.amazonaws.com have been extended with a new TLD. While amazonaws.com still exists in documentation, I discovered that dual-stack endpoints are on a different domain (docs), “api.aws”:

{protocol}://{service-code}.{region-code}.api.aws

While most services do not respond to ping, its a handy way of doing a DNS resolution:

> ping -6 sqs.ap-southeast-2.api.aws

Pinging sqs.ap-southeast-2.api.aws [2406:da70:c000:40:e3db:e3b2:7e93:ef41]

Your library (eg, boto) may not be up to date with this change, and even then, this new endpoint may not be in use.

Pro Tip: always update your boto library.

So why is this useful?

Let’s say you have a workload that uses SQS, running from your existing data centre, on a traditional IPv4-only network. Your application uses SQS as a fan out mechanism to despatch jobs to a fleet of worker nodes. Historically, this set of worker nodes, when listening to SQS for messages, would have had to all used IPv4; now they can exist on IPv6 only networks, and still receive their messages.

In effect, SQS as a control mechanism can now also be a bridge between hosts on either IPv4 or IPv6.

I’ve been championing the use of IPv6 with, in and on AWS since 2012; this year (2025) has continued to see additional services – like this – step up to include seamless dual-stack capability. At some stage, this will become table-stakes, required on service launch, and not a future service innovation.

Goodbye Optus

In 2010 my family returned to Australia to raise our child (now children) from the UK. I needed a local mobile phone service, and I selected Optus, as their pricing and offering (included data) was about right.

After a few years, I settled in to a $39/month, 30 GB plan. Around 2024, Optus advised me that the $39/month plan was becoming $49/month, with the same inclusions.

This week, another update from Optus advised this was now going to be $55/month, but the included data would increase to 70GB/month.

These days, I barely use more than 2 GB /month when I am not either at home or in one of my company’s offices… on the WiFi.

Enough.

There’s been very little visible improvement to the Optus network in the 21 years I have been on it. It’s over a decade since their competitor, Telstra, introduced IPv6 for their subscribers, and Optus has done… nothing.

The porting process took less than 30 minutes, and to be fair, the provider I have swapped to doesn’t do IPv6. But they are $25/month for 20 GB of traffic.

So I have just saved $360/year for what is approximately the same service. From complacent customer to ported away in four days end-to-end.