Secure your infrastructure

According to the Australian Dept Defence’s Australian Signals Directorate and their mandate in Cyber Security support to Australian Government and the whole of Australian industry, 55% of the cybersecurity incidents reported to them are for compromised asset, network, or infrastructure.

That dominates the #2 in the list Denial and Distributed Denial of Service, at 21% of incidents.

Securing infrastructure is critical. While this includes physical security, its dominated by virtual access to assets: compromised credentials, flawed firmware with known hard coded credentials, and other attack vectors.

While network restrictions are useful, strong logging and alerting is also critical, as is actually reading those alerts, triaging them and prioritising them.

Every piece of infrastructure in your environment should have some form of remote logging available. Local logging, on a device, is not sufficient. These logs should be treated with the same security deference as your PCI payment credentials, medical information or more.

Step 0: authentication

If your device only permits local username and password, then it should be a unique combination for each device. That could be a large list, so you’ll need some sort of password management in place.

Never use default passwords; and change usernames where possible. If I had a dollar for every time I saw “admin/admin” as the default… please use “${mycompany}admin/device-unique-password” or something unusual.

If the device supports MFA, then (with Step 2: Time configured) you should enable that.

If the device supports RADIUS or other network authentication and single sign-on, then consider using that (but more considerations may exist). Even still, a fallback to local credentials may still exist.

Step 1: Restricted network access

Your devices on your network probably don’t need a whole lot of inbound access, nor outbound for that matter. Lets talk about both.

The admin interface to your device is the most sensitive. It should not face the open internet if possible, and if it does, it should have some level of address range restriction as a rudimentary first step of protection.

IP address range should be on a permit basis: eg, permit only from your trusted range where you expect to admin the device from, including from backup networks in emergencies, and reject everything else. The Internet is full of bots and scripts that scan juicy looking admin ports, testing for zero day exploits, known bad configurations, and hard coded defaults or back doors. Even if you have patched and remediated what you know of, there could be more, as yet undiscovered by you or the vendor, so why take the risk?

If I have to have public facing interfaces then the restrictions that I like to use include reasonably large ranges from the corporate ISP network providers I use, and the well known ranged for cell/mobile phone providers, so that I can tether in an emergency. You may also wish to include your home ISP range, so that in an emergency, you can WFH to fix things.

This isn’t considered trusted, its just more trusted than the open Internet. And even if you have a large internal network where all staff — including admins — work from, its worth rearranging your networks to keep those admins in one subnet, and restricting internally as well, particularly if you have a wide area network, and possibly have publicly accessible ethernet ports that can be accessed by untrusted devices. Yes, 802.1x port authentication is a step up here, but why have that exposure in the first place.

Then think about what egress is needed form the device itself. Probably a remote logging destination (Step 3), which may be over TCP HTTPS, for example. Your device may also need to access internal DNS (UDP and TCP), but probably only to a small, possibly internal, set of ranges. And lastly, UDP NTP (for the next step, Time). Not that UDP traffic typically needs an ALLOW rule on network traffic in both directions.

Step 2: Time

Lets start with the basics: the time. Every device in your infrastructure should have the correct time. They should all be synced to a very high accuracy, using NTP or similar protocols. Its imperative for timestamps between systems to line up so that logs can be correlated. You don’t need to run out and buy a stratum 1 atomic clock, but configure NTP sensibly for your network.

Your Cloud provider may have a scalable, reliable time source that you can synchronise virtual machine clocks with. For your colo or private networks, you may want to configure a set of NTP servers that the rest of your environment can depend upon.

And when I say depend upon, you should monitor the time difference between your NTP servers to detect any drift, and detect if any of your NTP servers are offline. Start with having every device use a private DNS resolver on your network that all devices can use, and publish an internal DNS entries that list your set of NTP servers:

ntp.internal IN A 10.0.0.6
ntp.internal IN A 10.0.0.7
ntp.internal IN A 10.0.1.6

Your internal DNS suddenly became a critical vector for compromise, so ensure that it is also in scope for this advise!

In AWS Cloud, check out the Time Sync service.

Step 3: Logging

Do not log locally. Always send acros the network to a logging endpoint.

Your logging endpoint should be scalable so it doesn’t get overwhelmed or limited to how much logs it can ingest.

It must be encrypted in flight for both privacy and integrity, and it must be authenticated to ensure the right device is sending the right logs.

Logs should contain the timestamp of when they are received, as well as when devices sent them; and there should be minimal difference between these times.

And lastly, logs should be verbose enough that you do NOT need to go back to the original device to get more information. Get everything off the device, and you (or someone else) should never need to access the device itself directly. This handles the case where the device is compromised, no longer accessible, or has been bricked, deleted or otherwise removed.

Now that logs are in a uniform place, there’s two things to do:

  1. Provision authenticated, encrypted access to those logs for the people who need to search them (and log their access to these logs!)
  2. Set up some automated alerts

In AWS, definitely use CloudWatch Logs. And remember, you can use CloudWatch logs from your on-premises networks, over HTTPS, with authentication

Step 4: Alerts

This is where the fun happens. How many things can you think of that would be an indicator of a compromise (IOC). Let’s start with the simple: any access that fails authenticate to the device should be an alert. Your endpoint should not have unfeted public exposure, so the authentication attempts should all be legitimate

Auth Failure: this could be a bot, even on your internal network, probing for access. Or perhaps its just you before a coffee and you mistyped a password. Good to know where these come from as early as possible.

Auth Success: so you know the alerting is working, and have a record of what you are doing, it’s nice to get confirmation to show its you on the device. Or it could be compromised credentials being used. An auth success alert at 3am in your local time could be a sign you’re working late, or… something else.

Timestamp mismatch: the log receive time and the log time from the device could be out by a meaningful amount. This could be indication that submission of logs was delayed for some reason.

Device reboot: why should devices be unstable? Did they just flash a new firmware? Where they replaced/cloned by compromised devices?

Lack of regular log submission: a reliable heartbeat is very useful, but watch out for no longs when you expect at least something.

Config change: for critical components like routers, or other devices that will have a reasonably stable configuration, then alerting on this is a nice feedback confirmation off changes you (or someone else) has done.

Local device password change: if you can’t used centralised access control and single sign on, then you should alert on this. And you should probably alert on this NOT having happened after a year.

Log access: this is becoming a little meta, but having an alert when someone inspects the logging system itself, to view the logs, may be a reason for a notification.

Step 5: Alert Destinations and Escalations

Email is a terrible log destination, but the easiest to set up. Then again, its the easiest to set up a rule to then ignore. Some people use Slack or other instant messenger interfaces.

One thing you will want is a way to determine all the alert that have been triggered historically, filtered by device or device type (all switches), time span (last 7 days, last week), alert type (auth failure & auth success), etc.

Creating a dashboard to show these alerts will help you understand what’s happening.

A single auth failure is an interesting event, but a repeated auth failure, over a relatively small time window (an hour, a day) may be a brute force attack. A repeated reboot may be a device failing.

When a device (re-)boots, if it gives a firmware revision in its logging, how do you check that against the previously known firmware revision (hint: it’s in your logs from the previous boot). Is that the currently recommended firmware? Is there some form of automatic firmware update in place? Is it lower than the previous revision – which could be a forced downgrade to a known buggy firmware.

Summary

Pretty quickly you start to see the complexity, depth and urgency of having a strong logging and alerting in place. Without a trusted base to work from, any workloads in your environment may not be trusted.

Australia to get Top Secret rated AWS Cloud Region

In 2013 I was presenting to representatives of the South Australian government on the benefits of AWS Cloud. Security was obviously a prime consideration, and my role as the (only) AWS Security Solution Architect for Australia and New Zealand meant that this was a long discussion.

Clearly the shared responsibility model for cloud was a key driver, and continues to be so.

But the question came up: “We’re government, we need our own Region“. At that time, the US had just made its first US GovCloud in August of 2011. I knew then that the cost for a private region then was around US$600M, before you spun up your first (billed) workload.

The best thing about public cloud is, with the safeguards in place around tenant isolation, there are a whole bunch of costs that get shared amongst all users. The more users, the less cost impact per individual. At scale, many things considered costly for one individual, become almost free.

Private AWS Regions are another story: there is not a huge client base to share these costs across. With a single tenant, that tenant pays 100% of the cost. But then that tenant can demand stricter controls, encryption and security protocols, etc.

This difference will perhaps be reflected in the individual unit costs (eg, per EC2 instance per hour, etc).

Numerous secret regions have been created since 2013, such as the Mercury Veil Project for the CIA’s secret AWS Cloud Region.

Today we have two more interesting private regions currently being commissioned: the previously announced European Sovereign Region, and today, the Australian Secret Region at an initial AUD$2B cost.

After 11 years, the cost of a private (dedicated) Region has seemingly increased 333%.

If you thought cloud skills were getting passe, then there’s a top secret world that’s about to take off.

Software License Depreciation in a Cloud World

Much effort is spent on preserving and optimising software licenses when organisations shift their workloads to a cloud provider. It’s seen as a “sunk cost”, something that needs to be taken whole into the new world, without question.

However, some vendors don’t like their customers using certain cloud providers, and are making things progressively more difficult for those organisations that value (or are required) to keep their software stack well maintained.

Case in point, one software vendor who has their own cloud provider made significant changes to their licensing, removing rights progressively for customers to have the choice to run their acquired licences in a competitors cloud.

I say progressively, customers can continue to run (now) older versions of the software before that point in time the licensing was modified.

The Security Focus

Security in IT is a moving target. Three’s always better ways of doing something, and previous ways which, once were the best way, but are now deemed obsolete.

Let me give you a clear example: network encryption in flight. The dominant protocol used to negotiate this is called Transport Layer Security (TLS), and its something I’ve written about many times. There’s different versions (and if you dig back far enough, it even had a different name – SSL or Secure Sockets Layer).

Older TLS versions have been found to be weaker, and newer versions implemented.

But certain industry regulators have mandated only the latest versions be used.

Support for this TLS is embedded in both your computer operating system, and certain applications that you run. This permits the application to make outbound connections using TLS, as well as listen and receive connections protected with TLS.

Take a database server: its listening for connections. Unless you’ve been living under a rock, the standard approach these days is to insist on using encryption in flight in each segment of your application. Application servers may access your database, but only if the connection is encrypted – despite them sitting in the same data centre, possibly in the same rack or same physical host! It’s an added layer of security, and the optimisations done mean its rarely a significant overhead compared to the eavesdropping protection it grants you.

Your operating system from say 2019 or before may not support the latest TLS 1.3 – some vendors were pretty slow with implementing support for it, and only did so when you installed a new version of the entire operating system. And then some application providers didn’t integrate the increased capability (or a control to permit or limit the version of TLS) in their software in those older versions from 2019 or earlier.

But in newer versions they have fixed this.

Right now, most compliance programs require only TLS 1.2 or newer, but it is foreseeable that in future, organisations will be required to “raise the bar” (or drawbridge) to use only TLS 1.3 (or newer), at which time, all that older software becomes unusable.

Those licences become worthless.

Of course, the vendor would love you to take a new licence, but only if you don’t use other cloud providers.

Vendor Stickiness

At this time, you may be thinking that this is not a great customer relationship. You have an asset that, over time, will become useless, and you are being restricted from using your licence under newer terms.

The question then turns to “why do we use this vendor”. And often it is because of historical reasons. “We’ve always used XYZ database”, “we already have a site licence for their products, so we use it for everything”. Turns out, that’s a trap. Trying to smear cost savings by forcing technology decisions because of what you already have may preclude you from having flexibility in your favour.

For some in the industry, the short term goal is the only objective; they signa purchase order to reach an immediate objective, without taking the longer term view of where that is leading the organisation – even if that’s backing hem into a corner. They celebrate the short term win, get a few games of golf out of it, and then go hunting for their next role elsewhere, using the impressive short term saving as their report card.

A former colleague of mine once wrote that senior executive bonuses shouldn’t be paid out in the same calendar year, but delayed (perhaps 3 years) to ensure that the longer term success was the right outcome.

Those with more fortitude with change have, over the last decade, been embracing Open Source solutions for more of their software stack. The lack of licence restriction – and licence cost – makes it palatable.

The challenge is having the team who can not only implement potential software changes, but also support a new component in your technology stack. For incumbent operations and support teams, this can be an upskilling challenge; some wont want to learn something new, and will churn up large amounts of Fear, Uncertainty and Doubt (FUD). Ultimately, they argue it is better to just keep doing what we’ve always done, and pay the financial cost, instead of the effort to do something better.

Because better is change, and change is hard.

An Example

Several years ago, my colleagues helped rewrite a Java based application and change the database from Oracle, to PostgreSQL. It was a few months from start to finish, with significant testing. Both the Oracle and PostgreSQL were running happily on AWS Relational Database Service (RDS). The database was simple table storage, but the original application developers already had a site license for Oracle, and since that’s what they had, that’s what they’ll use.

At the end of the project, the cost savings were significant. The return on investment for the project services to implement the change was around 3 months, and now, years later, the client is so much better off financially. It changed the trajectory of the TCO spend.

The coming software apocalypse

So all these licences that are starting to hold back innovation are becoming progressively problematic. The time that security requirements tighten again, you’re going to hear a lot of very large, legacy software license agreements disintegrate.

Meanwhile, some clod providers can bundle the software licence into the hourly compute usage fee. If you use it, you pay for it; when you don’t use it, you don’t pay for it. if you want a newer version, then you have flexibility to do so. Or perhaps event to stop using it.

More TLS 1.3 on AWS

Earlier this week, AWS posted about their expanded support for TLS 1.3, clearly jumping on the reduced handshake as a speed improvement in their blog post entitled: Faster AWS cloud connections with TLS 1.3.

Back in 2017, (yes, 6 years ago) we started raising Product Feature Requests for AWS products to enable this support, and at the same time, customer control to be able to limit the acceptable TLS versions. This makes perfect sense in customer applications (the data plane). Not only do we not want our applications supporting every possible historic version of cryptography, various compliance programs require us to disable them.

Most notable in this was PCI DSS 3.1, the Payment Card (credit card) Industry Association’s Data Security Standard, which drove the nail in to the coffin of TLS 1.1 and everything before it.

Over time, TLS versions (and SSL before it) have fallen from grace. Indeed, SSL 1.0 was so bad it never saw the light of day outside of Netscape.

And it stands to reason that, in future, newer versions of TLS will come to life, and older versions will, eventually, have to be retired; and between those two, is another transition. However, this transition requires deep upgrades from cryptography libraries, and sometimes to client code to support the lower level library’s new capability..

On the server side, we often see a more proactive implementation of what currently supported TLS versions are permitted. Great services like SSLLabs.com, Hardenize.com, and testssl.sh have guided many people to what today’s current state of “acceptable” and “good” would generally look like. And the key item of those services, is their continual uplift as the state of “acceptable” and “good” changes over time.

On the client side, its not always been as useful. I may have a process that establishes outbound connections to a server, but as a client, I amy wan tto specify some minimum version for my compliance, and not just rely upon the remote party to do this for me. Not many software packages do this – the closest control you get is an integration possibly using HTTPS (or TLS), and not the next level down of “yeah, so which versions are OK to use when I connect outbound”. Of course, having specified HTTPS (or TLS) and doing server certificate validation against our local trust store, we then have a degree of confidence hat its probably the right provider, given that one of my 500 trusted CAs signed that certificate. we got given back during the handshake

This sunrise/sunset is even more important to understand in the case of managed services from hyperscaler cloud providers. AWS speaks of the deprecation of TLS 1.1 and prior in this article (June 2022).

If you have solutions that use AWS APIs, such as applications talking to DynamoDB, then this is part of your technical debt you should be actively, regularly addressing. If you haven’t been including updated AWS SDKs in your application, and updating your installed SSL libraries, updating your OS, then you may not be prepared for this. Sure, it may be “working” fine right now.

One option you have is to look at your application connection logs, and see if the TLS version for connections is being logged. If not, you probably want to get that level of visibility. Sure, you could Wireshark (packet dump) a few sample connections, but it would probably be better not to have to resort to that. Having the right data logged is all part of Observability.

June 28 is the (current) deadline for AWS to raise the minimum supported TLS version. That’s a month away from today. Let’s see who hasn’t been listening…

Cyber Insurance: EoL?

The Chief Executive of insurance company Zurich, Mario Greco, recently said:

“What will become uninsurable is going to be cyber,” he said. “What if someone takes control of vital parts of our infrastructure, the consequences of that?” 

Mario Greco, Zurich

In the same article is Lloyds insurance looking for exceptions in Cyber insurance for those attacks that are state based actors, which is a difficult thing to prove with certainty.

All in all, some reasons that Cyber Insurance exists is to cover from a risk perspective the opportunity of spending less on insurance premiums (and having financial recompense to cover operational costs) that having competent processes around software maintenance to code securely to start with, detect threats quickly, and maintain (patch/update) rapidly over time.

The structure of most organisations to have a “support team” who are responsible for an ever growing list of digital solutions, goaled on cost minimisation, and not measured against the amount of maintenance actions per solutions operated.

Its one of the reasons I like the siloed approach of DevOps and Service Teams. Scope is contained to one (or a small number of similar) solution(s). Same tech base, same skill set. With a remit to have observability, metrics and focus on one solution, the team can go deep on full-stack maintenance, focusing on a job well done, rather than a system that is just turned on.

It’s the difference between a grand painter, and a photocopier. Both make images; and for some low-value solutions, perhaps a photocopier is all they are worth investing in from a risk-reward perspective. But for those solutions that are the digital-life-blood of an organisation, the differentiator to competitors, and those that have the biggest end-customer impact, then perhaps they need a more appropriate level of operational investment — as part of the digital solution, not as a separate cost centre that can be seen to be minimised or eradicated.

If Cyber insurance goes end-of-life as a product in the insurance industry, then the war on talent, the focus to find those artisans who can adequately provide that , increases. All companies want the smartest people, as one smarter person may be more cost effective than 3 average engineers.