Logging to MySQL in 3rd Normal Form

I’m at it again with my Log3NF! When last I did this, Debian‘s Perl packages were in no shape for using MySQL stored procedures, but time has passed and everything is ready….

Any web server software, like Apache, can log requests that come in when people browse sites. Typically people record the accesses and do statistical analysis on it – to see visitor numbers, people stealing graphics, preferred browser versions of the visitors, where people are being linked-to from, etc. All of this data can be quite voluminous, and much of it is repetitive.

For a long time there has existed the ability to log this data to a simple flat MySQL (or other) database. However, most of those implementations have used just one table to store all the records in a log line. This means the data still has to be split apart for analysis.

So, what have I done? Well, I have written a bunch of table structures to handle each component of a standard “combined” log file, and a table that joins each of these components of a log line together. Plus I have written some table structures to hold summary data of this, so over time I can delete the original log entries and just keep the summaries. Then I have written some stored procedures to parse the incoming log entry and split it into these tables, and update the summary statistics. Here’s the main table that ties everything together – you’ll see it’s indexed in every way possible, so you cna see the possibilities for reporting from it…

CREATE TABLE Access (
ID bigint unsigned auto_increment primary key,
IPv4 int unsigned not null,
index index_IP(IPv4),
Ident_ID int unsigned,
User_ID int unsigned,
At datetime not null,
index index_At(At),
Protocol_ID tinyint unsigned,
index index_Protocol_ID(Protocol_ID),
Method_ID tinyint unsigned not null,
index index_Method_ID(Method_ID),
Status_ID tinyint unsigned not null,
index index_Status_ID(Status_ID),
Path_ID bigint unsigned,
index index_Path_ID(Path_ID),
Referer_ID bigint unsigned,
index index_Referer_ID(Referer_ID),
UserAgent_ID bigint unsigned,
index index_UserAgent_ID(UserAgent_ID),
Bytes int unsigned,
index index_Bytes(Bytes),
Server_ID smallint unsigned,
index index_Server_ID(Server_ID),
Site_ID smallint unsigned,
index index_Site_ID(Site_ID),
Timezone_ID tinyint unsigned not null
);

This supports having multiple web sites logging to it (think virtual hosting several sites) and server farms (multiple servers for big web sites, distributed global delivery).

Next up, I wrote a small script to load a pre-existing access log using this stored procedure. But thats rather slow, so I have written a “Log Handler” for Apache 2 with Mod_Perl 2. This means that as each access is performed, it is logged live to 3rd normal form in MySQL. The handler is very brief:

package JEB::Log3NFHandler;
use strict;
use warnings;
use Apache2::RequestRec ();
use Apache2::Const -compile => qw(OK DECLINED);
use Apache::DBI;
use Time::Zone;
my $dbh;

sub handler {
my $r = shift;
$dbh = DBI->connect('dbi:mysql:database=' . $r->dir_config("Log3NFDatabase"),  $r->dir_config("Log3NFDatabaseUser"),  $r->dir_config("Log3NFDatabasePassword")||"") unless $dbh;
return Apache2::Const::DECLINED unless $dbh;
my $sql = "call Log3NF(?, ?, ?, from_unixtime(?), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";
my $sth = $dbh->prepare($sql);
$sth->bind_param(1, $r->connection->remote_ip);
$sth->bind_param(2, "-"); # Ident
$sth->bind_param(3, $r->user());
$sth->bind_param(4, $r->request_time());
$sth->bind_param(5, $r->protocol());
$sth->bind_param(6, $r->method());
$sth->bind_param(7, $r->status());
$sth->bind_param(8, $r->uri());
$sth->bind_param(9, $r->headers_in->get('Referer')||'-'); # Referer
$sth->bind_param(10, $r->headers_in->get('User-Agent')); # Useragent
$sth->bind_param(11, $r->bytes_sent()); # Bytes
$sth->bind_param(12, $ENV{'SERVER_NAME'}); # Server name
$sth->bind_param(13, $r->hostname()); # Site name
#tz_local_offset()/60
$sth->bind_param(14, "+0000"); # Timezone
$sth->execute();
$sth->finish;
return Apache2::Const::OK;
}
1; # modules must return true

You’ll notice the Timezone set to “+0000”; while the TZ variable in Mod_Perl says a location (“Europe/London”), it doesn’t give an offset from GMT. I’m also always logging ident as “-“, since I cant see how Mod_Perl makes that available. The configuration of the Database, DB User and Password are all taken from the Apache configuration file from the PerlSetVar directive.

With this data in 3rd normal form, viewing it means several joins, or making use of another of the newer facilities that saw daylight in MySQL 5.1: views. So a couple of views sit around to make this data easily accessible.

With this data being stored as it happens, I wrote a CGI script to render this data – to give me some graphs of the last 5 minutes of activity, in real time. In fact, its dynamic, so I can zoom in to the last 5 mins, or out to the last 800 minutes. This real-time analysis shows HTTP status codes, popular paths being requested (by hits and by bytes), plus a per-minute hits and bytes.

But there’s more… lets to some analysis on where these hits are coming from. MaxMind distribute a free Country CSV database that shows roughly where all these IPs are coming from. We load this CSV into a normalised form, and start to integrate this into the live and summary tables…

… at least, that’s where I am up to now.

I’ve been looking at this approach since around 2002, when I had to perform all the normalisation in client-side Perl. But abstracting away the normalisation into the MySQL stored procedure makes this much neater, and less prone to inconsistencies (the client doesn’t have to update the main table and ensure it puts in the correct foreign keys).

I will put this code up for public consumption soon, so if you’re interested in 3rd normal form logging, drop me an email!

The intelligent mailbox

Close up of the IR \"beam break\" circuitFor some time I have been looking to create an intelligent mailbox and security entry interface; something that has several functions, and interfaces back to a PC, which will send messages and log data. Now amongst all of this is sitting my Arduino Decimellia microprocessor, which leads me to the electronics that play with these ideas; while this may not be the perfect unit for this, it is an easy start to playing with microprocessors.

Let me start with a dream list of functions to happen at/in/near my mailbox:

  • Detect the presence of snail-mail in the mailbox: give a visual indication on the outside that I should stop and empty the mailbox, and send an alert to the network of the arrival of mail (and an alert when the mailbox is emptied).
  • Optionally count the number of deliveries of mail I have waiting, keeping count until I empty the mailbox
  • Read an RFID card to authenticate someone at the gate, and present this serial number to the network
  • Optionally read a fingerprint scan from someone at the gate, and present this as a unique serial number to the network, just like for the RFID card above
  • Read an RFID card from the vehicle in the driveway, and present this to the network
  • Detect if the gate is fully open or fully closed
  • Close the circuit to activate the gate to move (just like pressing the remote control button)
  • Detect via PIR any movement near the mailbox, and report that to the network

The next bit gets a little tricky to do with a microprocessor, so probably involves putting components in/on the mailbox, but taking the feed back to a PC for processing:

  • Have a microphone, speaker, and video camera capture the audio and video of the person, and essentially hook them up to a SIP phone that will call a SIP destination if the person is not authenticated (ie, they are a visitor pressing a traditional “doorbell”).

In theory, someone who fails authentication or presses a traditional doorbell would be connected (as in, a SIP call with the camera/mic/speaker to a call group that starts with video phones and displays in the house, and then (depending on rules such as time-of-day, etc) cascade to a 3G video cell-phone, potentially. Thus if I am home, I can see who is at the gate, and then give a response to the control system to tell the Intelligent Mailbox to throw the circuit on the gate to open it (if it is not open already), and furthermore, close the gate again.

Indeed, let’s go one better:

  • Detect the presence of an object in the path of the gate, so we know if it can be safely opened/closed
  • Detect any PIR movement inside the gate so it can send an alert to the network

I’m ignoring the “one better” bits for a “phase 2” while I concentrate on the “easier” bits for “phase 1”. So what have I got thus far:

  • An IR transmitter and receiver pair from Maplin; CH10L and CH10M
  • A pair of LEDs to visually debug if the send signal is being sent and the receive path is activating
  • A transistor, a few resistors, and voilà…

… it flashes the IR LED, reads the high values from the IR receiver and averages them a bunch of times, then goes low, reads the low values a few times and averages those, and then looks for significant deviations during the “on” cycle; a simple “beam break detection” circuit; aside from the obvious GND and +5V connections, we have one PIN of the Decimillia being used to turn the IR LED on and off, and one analogue pin used to read the IR receive circuit.

In ignoring the “count” of the mail deliveries, and just opting for “there is mail” or “there is no mail”, I can place the IR emitter no the base of the mailbox (inside, of course) and the IR receiver on the roof (inside, obviously), and any letters that are sitting in the mailbox should obscure the light path. When that happens, the Arduino does a serial write, with an API I am yet to think through, basically saying “You’ve got snail mail”.

The Arduino could live in the mailbox, and have a long, long USB or serial cable going from the mailbox, under the ground, and up the driveway (in conduit), to the house, up to the roof, and across the crawlspace to a silent low power PC (Eee box, Mini ITX, etc). Hence when the driveway comes up for replacement (its sinking all over the place, I am thinking of some heavy duty conduit runs under it (as Anand, one of my mates, has previously done). Perhaps a concrete pipe with multiple plastic conduit runs? Only needs to be around 25 metres long…

What’s next; well, I have just purchased a 555 timer and a 339 voltage comparator IC. Now, this seems geeky, even for me, but it may be possible to make the 555 and the 339 do all the logic of the timing of the flashes and the detection of the beam break, and reduce this down to just one DIGITAL pin on the Microprocessor; if the voltage is high, then the beam is broken. That reduces down the amount of code I will have loaded into the Arduino, which is a bit limited anyway, especially if I want to cram a whole bunch of other functions (see above) on the other pins.

And through all this, while I (and my family) have been having a terrible cold, all over Christmas. Oh well, only human.

Another year ends

So as 2008 winds down, whats been happening in my Tech world?

Well, Open Source electronics got a kick along with the Arduino Decimelia – both the spec of this little microcontroller. With some very easy tutorials, its sure to kick-start a whole generation of hardware hackers. It’s got me re-invigorated in circuitry.

Openoffice.org 3.0 came out, with support for some of the more recent MS Office file formats. Firefox 3 arrived, but not Thunderbird 3 (though Alpha’s of the Firefox 3.1 and Thunderbird are pretty useful already). MySQL 5.1 arrived (after being sold-out to Sun), but not without some controversy over its suitability as a stable and finished release! Python 3000 arrived, and Perl 6 did not as did not Debian Lenny, but there are no surprises there.

The Netbook made a big splash – low budget, but reasonably capable ultra-portable workstations, often powered by Linux, but still with WinXP support — notably not MS Vista support. Vista SP1 turned up, but didn’t fix the Win32 landscape enough to cause businesses to rush to it; indeed, most still stick to WInXP and Office 2003. While a Vista-like episode from any other vendor would have sunk them, MS appears o be riding this out, using their war chest to coast through until Windows 7 later in 2009.

So, what are the hot Open Source projects now, and will make the big impacts for 2009? Well, Asterisk PBX continues to survive well against the incumbent vendors, who all overcharge for their products. With SIP products starting to get more advanced, such as my new Seimens Gigaset SIP and PSTN handsets, we’ll likely continue to see evolution of telephony to reduce costs.Skype continues to be the main video calling solution for the moment, but as Asterisk development continues along the SIP video road, this may slowly change — a SIP to Skype bridge will be a useful step.

Mobile data networks continue to offer more speed, at more reasonable prices. Many domestic tariffs now include unlimited* (obligatory fair usage applies) data plans, but continue to charge like wounded bulls when roaming. In Europe, the European Commission continues to make a level playing field across the region by enforcing fairer charges. On the fixed line, in the UK Virgin Media started to offer Fibre to the home, but at highly restricted speeds (for fibre) of just 50 MB/s. On the WiFi front, 802.11n finally saw the light of day, and several commercial connectivity providers started to offer solutions for business around this — a Disaster Recovery like secondary circuit that does not rely on local cable connectivity.

Solid State hard drives started to enter the mainstream with Apple releasing their MacBook Air with an SSD option. While the pricing was high, it was a start. SSD disks promise to improve throughput greatly, but require a re-think of the IO subsystem as suddenly the disk can do multiple reads and writes simultaneously (whereas spinning disks can only fetch or write one operation at a time). Capacities of SSDs are starting to scale up, and hopefully they will be available in large SAN enclosures so the option of massive-and-blindingly-fast storage can become a reality.

Hard drives in general started to hit 1.5 TB, but with the same MTBF for reads, are starting to show their age. A report showed that in a RAID 5 configuration using 1.5 TB disks, you’re likely to see a secondary failure (read failure) while trying to rebuild a RAID 5 array. If drives continue to grow in size but not reliability, then the time will come that RAID1 will be virtually gauranteed to fail in this way when rebuilding.

HD video from set-top boxes is on the agenda! Via announced its Mini ITX 2.0 spec back in June, but its difficult to see what vendors are meeting this spec. Either this, or the upcoming Asus Eeebox 204/206 units, HDMI will start to be common place on set top boxes. Couple this with a web cam and we start to have almost commercial like video conferencing in the home, and high def video playback. Link it to the arrival of high(er) speed broadband, and we’ll see more iPlayer and downloadable content. This probably spells the end of the DVD, and possibly Blueray (which is struggling to go mainstream in the UK). HD TV is slowly starting to take off with FreeSat in the UK, while Freeview (terrestrial DVB) is struggling to get the bandwidth for HD broadcast.

As the economy worldwide tries to recover, more pressure is put on implementing cost-effective IT solutions. Hence the economics demand Open Source, even if the people who are in control of some of this organisations have pre-conceived ideas of the value or reliability of it (my award for Plonker Of The Month goes to the person who said “Open Source is shit”, and doesn’t realise that over 50% of his organisation’s IT equipment runs Linux). While the Linux desktop still is not as common as speculated last year by many, its making inroads from the bottom end of the market (via the aforementioned Netbooks).

Lastly, to bring back a blunt point, a current campaign for one OS vendor says “Life without Walls“. For over a decade Linux people have been saying “In a world without walls, who needs Windows“. Lets stop paying to reinvent the wheel every 24 months (Vista’s visual improvements have changed nothing in the world), and start moving on to new problems. Cancer, AIDS, Alzheimer’s, space exploration, better education, basic water supplies for the third world, energy efficient transport (electric cars?), the decline of terrorism and war; surely these are more worthy goals for our funds than “buying Larry a new yacht”.

Keep scratching your own itch.

On the variance of price

I want to play with an RFID reader. I found a nice one from Parallax here. So, how much does this cost? Well, the Parallax web site says $39.99. Shipping and tax are all extras, but lets look at base price.

So a UK supplier quotes £37.50 for the unit. Using today’s FX conversion rate, which will vary wildly, the US$ equivalent is US$54.83. That represents a markup of almost 40% compared to the US listed price. Now taxes and delivery included, it looks like I’d be around the same percentage better off if I order in the US and pick it up next time i am there, than get it her in the UK.

*sigh*

Open Source after the apocolypse

Looks like Red Hat’s CEO is making a point about Open Source being dominant in the wake of the global financial crisis.  So what does this mean for the people who write, code, and distribute Free software?

Hiring and Keeping FLOSS Developers

Hiring open source developers is a tricky role. They need the freedom to contribute to their project, which is probably outside of your business goals. If they are unable to gt enough time to do their goals, then two several things happen:

  • Their open source project suffers through lack of time devoted to it
  • Their enthusiasm for their commercial goals suffer, since their open source effort is waining
  • The open source project may indirectly feed back to the company goals, so the company suffers from Free projects it can use commercially
  • The employee leaves, and you have to train a new employee (and pay recruiters, etc).

So, Google has a “20% time” programme, where staff can work on (approved) projects. This avoids the above conflict. Ergo, Google has some very enthusiastic and dedicated staff. And a lot of them.

Clearly there is a cost uplift, but the feedback loop does exist, if the management framework can understand it (and the purse strings can afford the 20% cost overhead of having enough staff to cover the workload).

Where do the current FLOSS developers end up?

Well, with a head start on a large number of technologies, and an understanding of the way projects come / go / survive / fork / die, established FLOSS practitioners end up… leading, from the front. Which means becoming management. Which means reporting, directing, and not actually doing the work, which for some is a motivation in itself.

So, shake the dust off and move away from the coal face. Start pointing at the seams from the back, and watch others try to tackle it.

Where will we start seeing Open Source in the Enterprise?

With major releases in the last few weeks of OpenOffice.org, and Gimp, and a slew of other projects, there is now a set of tools that cover the majority of what the corporate desktop actually needs. And these are on the Win32 platform, which the enterprise is comfortable with.

OpenOffice.org 3 now can read Microsoft Office 2007 file formats. Admittedly its “import” only, but it can save to Microsoft Office 2003 file formats, and thats enough of a bridge. Besides, if someone wants to save to any other format, they can contribute to the project to make it happen.

Gimp is starting to look more polished. What it needs is more examples to get the users who think of Photoshop as a verb to understand that photo editing doesn’t start and end with one commercial product.

Getting Group policy support for these projects on their Win32 is a “nice to have”, but probably wont happen (unless someone feels stronly about it enough to do it themselves), but moving this to the Linux environment should become easier.

Update 2008-10-22

A contrary opinion.