Amazon Linux, EC2, S3, Perl, SSL Wildcard Certificates

Amazon Linux, one of the distributions that is recommended for Amazon EC2 customers, recently had an update — 11.09. In this there was an update to a whole raft of libraries, including the Perl LWP (libwww) library in perl-libwww-perl-5.837 (previously 5.8.33), and other related modules.

One of the changes that happened is a change of the default for “verify hostname” in the SSL protocol when using LWP::UserAgent; previously verification of the certificate to the hostname given was default disabled, and in an effort to improve security, this was turned on. You’ll see this mentioned in LWP::UserAgent documentation “The no checks behaviour was the default for libwww-perl-5.837 and earlier releases”. What’s unusual is the no-checks behaviour change is DIFFERENT in Amazon Linux’s package of 5.8.37 compared to this statement – I suspect this one line got back ported into 5.8.37 to change this default ‘in the interst of security’.

Unfortunately, this breaks a lot of scripts and other modules/libraries out there, one of which is the Amazon-issued S3 libary. S3 is the Amazon Simple Storage Service (SSS => S3), with which a user (customer) has their data arranged in “buckets”, with data in objects identified by ‘keys’ (like a file name). All data is put to, and read from the S3 service over HTTPS – it’s not locally mounted (though some cleaver fuse stuff may make that look possible – but it is still over HTTPS.

A bucket in S3 has a name, and for the example I have, the name looks like a domain name (images.foo.com). When accessing this bucket, the Amazon S3 Perl library connects to an alias hostname (CNAME) made up combining the bucket name above with “s3.amazonaws.com“, so our example here becomes “images.foo.com.s3.amazonaws.com“. This site is using a wildcard certificate for “*.s3.amazonaws.com” (you can see it as an Alternate Subject Name extension in the SSL certificate). This permits the certificate to be considered as valid for any hostname directly under the s3.amazonaws.com domain. However, subject to RFC 2818, the only thing permitted before “s3.amazonaws.com” is a single name – not a (seemingly valid) dotted domain name. So “com.s3.amazonaws.com” is OK with a wildcard certificate, but “images.foo.com.amazonaws.com” is not.

There are several solutions. The easiest is to turn off SSL certificate verification again in your script. A handy ENV environment variable may be set to do this: $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0. Alternatively, if you are using LWP directly, you can pass an initalisation parameter to LWP of ssl_options => { verify_hostname => 0}. Both effectively abandon any certificate verification.

Somewhat more complicated, you can define a custom validation callback (procedure) to further determine if the certificate is valid. This is in contravention to RFC 2818, and seems like a lot more hassle to work around.

Perhaps the easiest solution here is to avoid using period/dot/’.’ in Bucket Names in S3, thereby removing the conflict between the strict checking.

The most important thing is how lax we have been at verifying SSL certificates, and have come to rely on that just working. It is good to verify the SSL certificate matches the host in scripts: I don’t want to start communicating authentication information over an SSL channel if we can easily see we’ve been duped on the remote end. I was not familiar with wildcard certificates only being valid for one component of a domain name; this kind of reduces their effectiveness in my mind in some sense.They’ve always been more expensive than standard certificates, but being better aware of the FQDNs they will validate on is useful.

I’ve seen several other instances outside of this S3 example where invalid certificates have blindly been accepted by scripts (a CloudWatch example I saw with a redirect ‘hop’ through an SSL site); this default change from lax to legitimate certificates may actually encourage better adoption of the security that SSL can give — when we’re already paying for SSL certs — or lead us (as developers and architects) to acknowledge when we’re actively ignoring that layer of protection.

It’s early days now but as this default change filters into Linux distributions (and Perl distributions on other platforms) then we’ll start to see a lot of FAQs on this.

File::Pid to catch multiple execution

CPAN carries a module called File::Pid, that implements a PID file object that writes to a file. The documentation suggests it be used as:

  use File::Pid;
  my $pidfile = File::Pid->new({
    file => '/some/file.pid',
  });
  $pidfile->write;
  if ( my $num = $pidfile->running ) {
      die "Already running: $num\n";
  }
  $pidfile->remove;

However, if you write() before calling running(), then the PID in the PID file gets overwritten with the current script’s PID, and thus running() always returns the current scripts PID, and thus, if you run the example, it always bombs out.
Instead, you want to put the call to write() after the call to running():

  use File::Pid;
  my $pidfile = File::Pid->new({file => '/some/file.pid',  });
  die "Already running in PID $num" if ( my $num = $pidfile->running );
  $pidfile->write;
  do_seomthing_useful();
  $pidfile->remove;

Easy.

Perl Search and Replace, using variables

Perl is a reasonable scripting language (as are others, so shh!). It has always had strong regular expression support; those regular expressions can also be used to do substitutions, such as:


my $pet = "I have a dog";
$pet =~ s/dog/cat/;

Neat enough. But lets say I want to look up the parts of the “s///” that define my search text, and my replacement text. Easy enough:


my $pet = "I have a dog";
my $search = "dog";
my $replace = "cat";
$pet =~ s/$search/$replace/;

But lets make our substitution a little more complex – I want to match a URL, and have the host and port lower case, but leave the path as the case it comes in, and I don’t want to be entering expressions as the replacement text! Lets try:


my $url = 'http://www.FOo.COm/wibbLE';
my $search = '^([^:]+://[^/]+)/?(.*)?$';
my $replace = '\L$1\E/$2';
print $url if $url =~ s/$search/$replace/;

Sadly, while $search matches, the replace is the string “\L$1\E/$2“. It appears that we need to use a combination of “/e” mode (evaluate as expression) to evaluage this replacement string. Of course, when we’re doing eval, we want to ensure we don’t have malicious content in $replace, such as “unlink()” and friends who could do Bad Things. So my solution was to escape double quotes from my $replace string, wrap that all in double quotes, and pass in “/ee“:


my $search = '^([^:]+://[^/]+(:\d+)?)/?(.*)?$'; # From database
my $replace = '\L$1\E/$3'; # From database
$replace =~ s/"/\\"/g; # Protection from embedded code
$replace = '"' . $replace . '"'; # Put in a string for /ee
print $url if $url =~ s/$search/$replace/ee;

This will give us:

  • Port and host name lower case
  • Hostname (and port) will always have a slash after it
  • Bad code like unlink() won’t be run
  • The expression that we initally set/fetch/got in $replace is just a vanilla replacement term, not arbitary Perl code.

Log3NF – IPv6 support coming real soon

A few years ago, I converted my earlier idea of logging Apache request to 3rd normal form into a fully fledged Mod Perl 2.0 Log Handler – embedding this into Apache. Its essentially a very simple Handler of less than 60 lines, and a stored procedure inside MySQL that normalises the data. Its been running on my personal server since February 2009, and in that time its collected around

  • 1.4 million hits
  • 27,000 unique useragents
  • 57,000 unique paths
  • 9 basic authentiation users
  • 13 HTTP methods
  • 50,000 unique referrer URLs
  • 13 HTTP Status
  • Recorded the transfer of 192,185.9637 Megabytes of (body) data; the average body response is 144 KB.

Wow. The log data on disk is 574 MB, or around 410 bytes per request – including the indexes (this is the size of the MySQL directory containing the data).

All well and good. Now time to get it fit for IPv6, and then improve the reporting. The reporting has two phases:

  • Live reporting from the 3rd normal form for data covering the last few seconds/minutes/hours/days.
  • Summary reporting per day or per month, per statistic, pre-calculated

Anyway, we’re about to pull in IPv6, again storing this as efficiently as possible, and then improve the currently very basi reporting interface…. stay tuned… and see the SVN repository for code…

Logging to MySQL in 3rd Normal Form

I’m at it again with my Log3NF! When last I did this, Debian‘s Perl packages were in no shape for using MySQL stored procedures, but time has passed and everything is ready….

Any web server software, like Apache, can log requests that come in when people browse sites. Typically people record the accesses and do statistical analysis on it – to see visitor numbers, people stealing graphics, preferred browser versions of the visitors, where people are being linked-to from, etc. All of this data can be quite voluminous, and much of it is repetitive.

For a long time there has existed the ability to log this data to a simple flat MySQL (or other) database. However, most of those implementations have used just one table to store all the records in a log line. This means the data still has to be split apart for analysis.

So, what have I done? Well, I have written a bunch of table structures to handle each component of a standard “combined” log file, and a table that joins each of these components of a log line together. Plus I have written some table structures to hold summary data of this, so over time I can delete the original log entries and just keep the summaries. Then I have written some stored procedures to parse the incoming log entry and split it into these tables, and update the summary statistics. Here’s the main table that ties everything together – you’ll see it’s indexed in every way possible, so you cna see the possibilities for reporting from it…

CREATE TABLE Access (
ID bigint unsigned auto_increment primary key,
IPv4 int unsigned not null,
index index_IP(IPv4),
Ident_ID int unsigned,
User_ID int unsigned,
At datetime not null,
index index_At(At),
Protocol_ID tinyint unsigned,
index index_Protocol_ID(Protocol_ID),
Method_ID tinyint unsigned not null,
index index_Method_ID(Method_ID),
Status_ID tinyint unsigned not null,
index index_Status_ID(Status_ID),
Path_ID bigint unsigned,
index index_Path_ID(Path_ID),
Referer_ID bigint unsigned,
index index_Referer_ID(Referer_ID),
UserAgent_ID bigint unsigned,
index index_UserAgent_ID(UserAgent_ID),
Bytes int unsigned,
index index_Bytes(Bytes),
Server_ID smallint unsigned,
index index_Server_ID(Server_ID),
Site_ID smallint unsigned,
index index_Site_ID(Site_ID),
Timezone_ID tinyint unsigned not null
);

This supports having multiple web sites logging to it (think virtual hosting several sites) and server farms (multiple servers for big web sites, distributed global delivery).

Next up, I wrote a small script to load a pre-existing access log using this stored procedure. But thats rather slow, so I have written a “Log Handler” for Apache 2 with Mod_Perl 2. This means that as each access is performed, it is logged live to 3rd normal form in MySQL. The handler is very brief:

package JEB::Log3NFHandler;
use strict;
use warnings;
use Apache2::RequestRec ();
use Apache2::Const -compile => qw(OK DECLINED);
use Apache::DBI;
use Time::Zone;
my $dbh;

sub handler {
my $r = shift;
$dbh = DBI->connect('dbi:mysql:database=' . $r->dir_config("Log3NFDatabase"),  $r->dir_config("Log3NFDatabaseUser"),  $r->dir_config("Log3NFDatabasePassword")||"") unless $dbh;
return Apache2::Const::DECLINED unless $dbh;
my $sql = "call Log3NF(?, ?, ?, from_unixtime(?), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";
my $sth = $dbh->prepare($sql);
$sth->bind_param(1, $r->connection->remote_ip);
$sth->bind_param(2, "-"); # Ident
$sth->bind_param(3, $r->user());
$sth->bind_param(4, $r->request_time());
$sth->bind_param(5, $r->protocol());
$sth->bind_param(6, $r->method());
$sth->bind_param(7, $r->status());
$sth->bind_param(8, $r->uri());
$sth->bind_param(9, $r->headers_in->get('Referer')||'-'); # Referer
$sth->bind_param(10, $r->headers_in->get('User-Agent')); # Useragent
$sth->bind_param(11, $r->bytes_sent()); # Bytes
$sth->bind_param(12, $ENV{'SERVER_NAME'}); # Server name
$sth->bind_param(13, $r->hostname()); # Site name
#tz_local_offset()/60
$sth->bind_param(14, "+0000"); # Timezone
$sth->execute();
$sth->finish;
return Apache2::Const::OK;
}
1; # modules must return true

You’ll notice the Timezone set to “+0000”; while the TZ variable in Mod_Perl says a location (“Europe/London”), it doesn’t give an offset from GMT. I’m also always logging ident as “-“, since I cant see how Mod_Perl makes that available. The configuration of the Database, DB User and Password are all taken from the Apache configuration file from the PerlSetVar directive.

With this data in 3rd normal form, viewing it means several joins, or making use of another of the newer facilities that saw daylight in MySQL 5.1: views. So a couple of views sit around to make this data easily accessible.

With this data being stored as it happens, I wrote a CGI script to render this data – to give me some graphs of the last 5 minutes of activity, in real time. In fact, its dynamic, so I can zoom in to the last 5 mins, or out to the last 800 minutes. This real-time analysis shows HTTP status codes, popular paths being requested (by hits and by bytes), plus a per-minute hits and bytes.

But there’s more… lets to some analysis on where these hits are coming from. MaxMind distribute a free Country CSV database that shows roughly where all these IPs are coming from. We load this CSV into a normalised form, and start to integrate this into the live and summary tables…

… at least, that’s where I am up to now.

I’ve been looking at this approach since around 2002, when I had to perform all the normalisation in client-side Perl. But abstracting away the normalisation into the MySQL stored procedure makes this much neater, and less prone to inconsistencies (the client doesn’t have to update the main table and ensure it puts in the correct foreign keys).

I will put this code up for public consumption soon, so if you’re interested in 3rd normal form logging, drop me an email!