« Microsoft Rant | Main | Why I like Movable Type »

Analysing Apache log files - Part 1

After getting my CentOS server up and running it's time to load up my Apache access log files. To do this I needed a perl script to load the data. And some SQL queries to answer some questions like who's hotlinking our images or hogging our bandwidth.

In setting up CentOS I had already ensured that the MySQL server was installed. The next step was to log in as root and set a MySQL password for root. The MySQL folks have a getting started with MySQL page that can get you going.

Then I created a database to store the log files in. Next create a table to store the log files.

drop table log;
create table log (ip char(15),
date_time datetime,
cmd varchar(250),
rc smallint,
sz int,
refer varchar(250),
agent varchar(250));
I decided to use the DATETIME field to store data, it's only 8 bytes instead of 20 characters. The other advantage is using SQL date and time arithmetic for selecting records and maybe even doing some time period analysis to see how often our 1.5mbps bandwidth pipe gets congested.

Then I created a perl script to load the data:

#!/usr/bin/perl -w
# this program reads from STDIN and attempts to insert into the "log" table.
# lines that don't match the regular expression are printed so they can be examined.
use DBI;

my $dsn = 'DBI:mysql:webalizer:localhost';
my $db_user_name = 'some-user';
my $db_password = 'some-password';
my ($id, $password);
my $dbh = DBI->connect($dsn, $db_user_name, $db_password)
 or die "Couldn't connect to database: " . DBI->errstr;
my $insert_handle =
 $dbh->prepare_cached("INSERT INTO log VALUES (?,STR_TO_DATE(?,'%d/%b/%Y:%T'),?,?,?,?,?)");
while (<>) {
 if (/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - \[(\S+) -0800\] "((\\.|[^\\"])*)" ([\d]*) ([\d]*) "((\\.|[^\\"])*)" "((\\.|[^\\"])*)"/) {
  #print "found: $1, $2, $3, $4, $5, $6, $7\n";
  $insert_handle->execute($1, $2, $3, $4, $5, $6, $7)
   or die "Couldn't insert row: " . DBI->errstr;
 else {
  print "$_\n";

I created this in steps starting from a Hello World example. Then adding lines to read and process the input. Next build up the pattern matching string needed to get the strings I'll insert into the database. Finally, using perl DBI to actually connect to MySQL and insert the records.

Why not try and load the data directly as raw text? Well, I tried that in Access, and it worked ok. But from that I know there's a lot of data to load. So I don't want to load anything I don't need. I also want the ability to convert some fields into more space efficient fields. Finally, Access complained about a few malformed or excessively sized fields on loading. I want perl to kick those records out into an error file.

I loaded up a week's worth of data for a test. It took about half an hour to load a million rows. I ran my hotlinking query and it took 5 minutes. I'm a little disappointed, I had thought it might run a bit faster. On the other hand, it works and it's not tying up my main PC, so who cares how long it runs.

I did notice why Access was kicking out records. Turns out there are some escaped double quote characters in some of the data: - - [03/Dec/2006:12:00:41 -0800] "GET /activities.shtml\" HTTP/1.0" 302 560 "-" "msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)"
Notice the \" characters after .shtml. I noticed some referrers had escaped quotes also.

Fortunately I found a solution to parsing escaped quotes in perl:

I don't read perl well enough to truely understand this. The outside " characters are obvious. The rest, well at least it works. I did add a another pair of parenthesis ( ) inside the quotes so that perl would remember the string inside the quotes.

Next I thought I'd improve the hotlinking query. I wondered why it used a Having clause instead of a Where clause. Since none of the tests required a Having clause I changed it to a Where clause. I also added Count(*) to find out how many times the image was used, not just the bandwidth used. I also excluded what look like Google search referrers. After loading up another weeks data, this revised query actually ran faster at just over one minute elapsed.

SELECT cmd, refer, Sum(sz) as sum_size, Count(*)
FROM log
where (cmd Like "%.jpg %" Or cmd Like "%.gif %") AND (refer <> "-" and refer <> '' And refer Not Like "http://www.creativekidsathome.com%" And refer Not Like "http://creativekidsathome.com%" And refer Not Like "http://www.sciencekidsathome.com%" and refer not like "http://www.google.%")
GROUP BY cmd, refer
ORDER BY 3 DESC limit 50;

The next trick will be to automate this process somehow.

© 2016 Mike Silversides


This page contains a single entry from the blog posted on January 23, 2007 4:34 PM.

The previous post in this blog was Microsoft Rant.

The next post in this blog is Why I like Movable Type.

Many more can be found on the main index page or by looking through the archives.