Logging to MySQL in 3rd Normal Form

I’m at it again with my Log3NF! When last I did this, Debian‘s Perl packages were in no shape for using MySQL stored procedures, but time has passed and everything is ready….

Any web server software, like Apache, can log requests that come in when people browse sites. Typically people record the accesses and do statistical analysis on it – to see visitor numbers, people stealing graphics, preferred browser versions of the visitors, where people are being linked-to from, etc. All of this data can be quite voluminous, and much of it is repetitive.

For a long time there has existed the ability to log this data to a simple flat MySQL (or other) database. However, most of those implementations have used just one table to store all the records in a log line. This means the data still has to be split apart for analysis.

So, what have I done? Well, I have written a bunch of table structures to handle each component of a standard “combined” log file, and a table that joins each of these components of a log line together. Plus I have written some table structures to hold summary data of this, so over time I can delete the original log entries and just keep the summaries. Then I have written some stored procedures to parse the incoming log entry and split it into these tables, and update the summary statistics. Here’s the main table that ties everything together – you’ll see it’s indexed in every way possible, so you cna see the possibilities for reporting from it…

CREATE TABLE Access (
ID bigint unsigned auto_increment primary key,
IPv4 int unsigned not null,
index index_IP(IPv4),
Ident_ID int unsigned,
User_ID int unsigned,
At datetime not null,
index index_At(At),
Protocol_ID tinyint unsigned,
index index_Protocol_ID(Protocol_ID),
Method_ID tinyint unsigned not null,
index index_Method_ID(Method_ID),
Status_ID tinyint unsigned not null,
index index_Status_ID(Status_ID),
Path_ID bigint unsigned,
index index_Path_ID(Path_ID),
Referer_ID bigint unsigned,
index index_Referer_ID(Referer_ID),
UserAgent_ID bigint unsigned,
index index_UserAgent_ID(UserAgent_ID),
Bytes int unsigned,
index index_Bytes(Bytes),
Server_ID smallint unsigned,
index index_Server_ID(Server_ID),
Site_ID smallint unsigned,
index index_Site_ID(Site_ID),
Timezone_ID tinyint unsigned not null
);

This supports having multiple web sites logging to it (think virtual hosting several sites) and server farms (multiple servers for big web sites, distributed global delivery).

Next up, I wrote a small script to load a pre-existing access log using this stored procedure. But thats rather slow, so I have written a “Log Handler” for Apache 2 with Mod_Perl 2. This means that as each access is performed, it is logged live to 3rd normal form in MySQL. The handler is very brief:

package JEB::Log3NFHandler;
use strict;
use warnings;
use Apache2::RequestRec ();
use Apache2::Const -compile => qw(OK DECLINED);
use Apache::DBI;
use Time::Zone;
my $dbh;

sub handler {
my $r = shift;
$dbh = DBI->connect('dbi:mysql:database=' . $r->dir_config("Log3NFDatabase"),  $r->dir_config("Log3NFDatabaseUser"),  $r->dir_config("Log3NFDatabasePassword")||"") unless $dbh;
return Apache2::Const::DECLINED unless $dbh;
my $sql = "call Log3NF(?, ?, ?, from_unixtime(?), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";
my $sth = $dbh->prepare($sql);
$sth->bind_param(1, $r->connection->remote_ip);
$sth->bind_param(2, "-"); # Ident
$sth->bind_param(3, $r->user());
$sth->bind_param(4, $r->request_time());
$sth->bind_param(5, $r->protocol());
$sth->bind_param(6, $r->method());
$sth->bind_param(7, $r->status());
$sth->bind_param(8, $r->uri());
$sth->bind_param(9, $r->headers_in->get('Referer')||'-'); # Referer
$sth->bind_param(10, $r->headers_in->get('User-Agent')); # Useragent
$sth->bind_param(11, $r->bytes_sent()); # Bytes
$sth->bind_param(12, $ENV{'SERVER_NAME'}); # Server name
$sth->bind_param(13, $r->hostname()); # Site name
#tz_local_offset()/60
$sth->bind_param(14, "+0000"); # Timezone
$sth->execute();
$sth->finish;
return Apache2::Const::OK;
}
1; # modules must return true

You’ll notice the Timezone set to “+0000”; while the TZ variable in Mod_Perl says a location (“Europe/London”), it doesn’t give an offset from GMT. I’m also always logging ident as “-“, since I cant see how Mod_Perl makes that available. The configuration of the Database, DB User and Password are all taken from the Apache configuration file from the PerlSetVar directive.

With this data in 3rd normal form, viewing it means several joins, or making use of another of the newer facilities that saw daylight in MySQL 5.1: views. So a couple of views sit around to make this data easily accessible.

With this data being stored as it happens, I wrote a CGI script to render this data – to give me some graphs of the last 5 minutes of activity, in real time. In fact, its dynamic, so I can zoom in to the last 5 mins, or out to the last 800 minutes. This real-time analysis shows HTTP status codes, popular paths being requested (by hits and by bytes), plus a per-minute hits and bytes.

But there’s more… lets to some analysis on where these hits are coming from. MaxMind distribute a free Country CSV database that shows roughly where all these IPs are coming from. We load this CSV into a normalised form, and start to integrate this into the live and summary tables…

… at least, that’s where I am up to now.

I’ve been looking at this approach since around 2002, when I had to perform all the normalisation in client-side Perl. But abstracting away the normalisation into the MySQL stored procedure makes this much neater, and less prone to inconsistencies (the client doesn’t have to update the main table and ensure it puts in the correct foreign keys).

I will put this code up for public consumption soon, so if you’re interested in 3rd normal form logging, drop me an email!