Tuesday, June 12, 2012

Putting your Web hits on the map with libferris and XQuery

This was originally published on LDN back in 2009. Unfortunately the full original article is no longer available, its link now redirects to a generic page instead. Since it's useful content, I've included it here too.

Learn how to take the IP addresses and other information from an apache access.log file and see those IP addresses as place marks with Google Earth and Google Maps. We'll use XQuery to do this instead of Perl or your favourite scripting language. We won't have to worry about breaking up the lines in access.log or opening connections to relational databases to lookup the location of IP addresses in a table -- with the recent advancements in libferris both access.log and relational databases are both already at your fingertips from an XQuery.

Because this article touches on many technologies, I'll omit an introduction to XQuery. You should be able to get the gist of the XQuery code shown at the end of the article, which I'll also describe. But the details of XQuery are left for you to explore in other tutorials.

Because we want to deal with geographic data, specifically IP addresses placed at locations and cities, using KML format XML files makes sense. KML files can be loaded and saved by Google Earth and Google maps, making them a good format for showing graphically on maps where IP addresses are. This is another reason that XQuery works well, you can easily generate an XML document directly from XQuery.

The libferris project started about 10 years ago with the goal of creating a powerful virtual filesystem. Along the way it has gained the ability to mount XML, relational databases, Berkeley db and other ISAM files, and anything else that makes sense. While having everything available as a virtual filesystem is nice, at times you might prefer to use a different interface than a filesystem. So you can now access any libferris filesystem as a virtual XML Document Object Model (DOM), an XQilla document, an SQLite virtual table, or through XSLT.

The traditional "mounting" filesystem operation is mostly handled for you by libferris. If you access a directory and libferris knows what to do to mount it in a meaningful way then it does so. For example, listing postgresql://localhost/mydb will list all of the tables in the mydb database on localhost. No mounting is needed, just grab the data you want. This works well when using libferris with XQuery because you can just peek into any libferris filesystem directly.

I'll talk about how to map an IP address to city and a latitude, longitude pair first, then turn attention to getting at the data contained in apache's access.log files from libferris, and finally turn to XQuery to bring it all together.

IP to Location


To resolve an IP address to the city, country and digital longitude and latitude values I'll use the free IP address geolocation SQL database. The free IP address database download is targeted to a MySQL database. I used the mysql2pgsql tool to convert the SQL file and import into a PostgreSQL database. The main reasons for this are that I prefer PostgreSQL and libferris has better support for mounting PostgreSQL as a filesystem than other relational database servers. Since I am the author of libferris, these two reasons are not unrelated. Of particular interest for this article, you can call a PostgreSQL database function through the libferris filesystem interface which is not currently implemented for other databases.

The IP address database uses the MySQL inet_aton() function. A replacement for PostgreSQL can be found here. The final touch is to wrap up the IP to geographic information into a custom PostgreSQL function as shown below. First the return type of the function is defined, cc is the country code, reg is the region, ipn is the numerical version of the dotted IP address and ip is the dotted IP address. The first select statement illustrates this, 69.90.118.0 is the dotted IP address and 1163556352 would be the numerical version (ipn).

The ipgeo function takes a dotted IP address and returns the location information if available as shown in the final example query.

select inet_aton('69.90.118.0') as ipn
------------
 1163556352

create type ipgeoresult as 
  ( cc varchar, 
   reg varchar, 
  city varchar, 
   lat double precision, 
  long double precision, 
   ipn bigint, 
    ip text );

CREATE OR REPLACE FUNCTION ipgeo( ip varchar ) RETURNS ipgeoresult AS '
 select country_code as cc,
    region_code as reg,
    city,latitude as lat,longitude as long,
    ip_start as ipn,inet_ntoa(ip_start) as ip  
 from ip_group_city 
 where ip_start <= inet_aton($1) 
 order by ip_start desc 
 limit 1;
'
LANGUAGE 'SQL';

select city,lat,long from ipgeo('69.90.118.0');
   city   |  lat  |  long  
----------+-------+--------
 Edmonton | 53.55 | -113.5

The ferrisls command should now be able to find geoinfo from an IP address by calling the ipgeo() PostgreSQL function through the filesystem interface as shown below. The ferrisls command works like the normal ls(1) command with some extensions. As you can see, using --xml tells ferrisls to print its output as an XML document.

$ ferrisls --xml  "pg://localhost/ipinfodb/ipgeo('69.90.111.0')"
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ferrisls>

  <ferrisls cc="" city="" ip="" ipn="" lat="" long="" name="ipgeo('69.90.111.0')" primary-key="cc-reg-city-lat-long-ipn-ip" reg="" url="pg:///localhost/ipinfodb/ipgeo('69.90.111.0')">
    
     <context cc="US" city="New York" ip="69.90.111.0"
     ipn="1163554560" 
     lat="40.6888" long="-74.0203" name="US-36-New York-40.6888--74.0203-1163554560-69.90.111.0" primary-key="cc-reg-city-lat-long-ipn-ip" reg="36"/>
  </ferrisls>

</ferrisls>


Breaking access.log
Now that we have IP address to geoinfo stuff sorted out, we need to be able to get at the information in an access.log file. XQuery is not well suited to breaking up plain text files though. Luckily we can shift that work onto libferris, getting a much more XML agreeable version of the information in access.log for free.

As detailed in a previous article on libferris and SQLite, libferris can mount log files as filesystems. And as mentioned above, once you have a libferris filesystem, you also have a virtual document that can accessed from an XQuery. The below commands tell libferris that the access_log file in the current directory has the structure of an apache access.log file and that it should try to mount it that way if you treat the file like it was a directory.

$ echo -n apache_access_log \
   | ferris-redirect -T -x --ea=ferris-recordfile-metadata-url access_log
$ echo -n recordfile \
   | ferris-redirect -T --ea=ferris-type access_log
The ferrisls command can be used to peek into the log file to make sure your libferris is setup correctly and mounting the log file as expected. This is shown in the command below. Notice that the pieces of information from a line in access.log are represented as XML attributes and form a single XML element. The XML attributes all have useful names, complements of libferris using the metadata associated with the apache_access_log file type.

Also notice that there is a date-epoch field in the output which is not explicitly stored in the access.log. As part of the apache_access_log settings, libferris recognizes the date field as a timestamp and creates date-epoch which contains the value of date as the number of seconds since 1970. Having the epoch as an integer can be convenient when you are importing the data into another tool.

$ ferrisls --xml access_log 
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ferrisls>
...
    <context date="17/Mar/2009:12:24:15 +0100" date-epoch="1237256655"
      ip="78.111.111.111"
      logname="-" name="4" referer="-" req="GET /.../" 
      response="400" sz="385" user="-" user-agent="-"/>
...
For more details on mounting log files, see the previous article on libferris and SQLite on linux.com.

XQuery -- Bringing it all Together

To access a document from an XQuery, you use the doc() function. If you are resolving your XQuery with libferris, there is the ferris-doc() function which makes the named libferris filesystem available as though it was XML.

The entire XQuery to generate a KML file from an access_log is shown below. The core of the query is the first 4 lines, the rest just picks out data from the results that were found in the first lines and creates the correct KML file structure.

The first for loop iterates over each line in the log file, for each of these lines the $ip is set to just the value of the dotted IP address from the log file. Using the data() function casts away the fact that $res/@ip is an XML attribute in the source. After the data() function is applied, $ip is just a string containing the IP address. The next line uses the PostgreSQL database to lookup the geoinfo for the IP address. Notice that we are actually grabbing an entire tuple into $ipinfo from the relational database here, including the city, country, longitude and latitude. The compare line silently ignores IP addresses where we do not have the valid geoinfo. You might like to flag these IP addresses with another XQuery because if they are unknown they might be worth investigating.

<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
 {
  for $res in ferris-doc( "/.../access_log" )/*
	let $ip := data($res/@ip)
	let $ipinfo := ferris-doc( concat( "pg://localhost/ipinfodb/ipgeo('", $ip, "')" ) )/*
	where compare($ipinfo/@city,"") != 0
       return
    <Placemark> 
	<name>{ $ip }</name>
	<description>
		A hit from {data($ipinfo/@city)}
		<br/>
			{ $ip }
		<br/>
			At: { data($res/@date) }
		<br/>
			user-agent: { data($res/@user-agent) }
		<br/>
		{
			let $r := data($res/@referer)
			where $r != "-"
			return <a href="{ $r }">{ $r }</a>
		}
	</description>
	<Point>
		<coordinates>
			{concat(data($ipinfo/@long),",",data($ipinfo/@lat))}
		</coordinates>
	</Point>
    </Placemark>
 }
</Document>
</kml> 

You might like to improve these examples to take the path of the access_log file as a parameter instead of hard coding it. To transform the access_log file referenced in the above XQuery into a KML file simply issue the below command.

ferris-xqilla apache-access-log-to-kml.xq >| access_log.kml
The above XQuery will generate a valid KML file when used against an apache access.log file. When you load the KML file into Google Earth, after exploring around and clicking on place markers a few times you might notice the flaw to the above query. Each individual access by an IP address causes a new place marker in the KML file. Thinking about this with our SQL hat on, what we would like to do is "group by" IP address first.

This SQL "group by" is expressed in the XQuery below. Firstly we pick out all the unique IP addresses and perform a single geoinfo lookup for each of those IP addresses. As you can see in the creation of the description element, the query searches over the log again to make the click bubble for a place marker show information about how often and when each IP address has accessed our site.

<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
 {
  for $ip in distinct-values( ferris-doc( "/.../access_log" )/*/@ip )
  let $ipinfo := ferris-doc( concat( "pg://localhost/ipinfodb/ipgeo('", $ip, "')" ) )/*
  where compare($ipinfo/@city,"") != 0
  return
    <Placemark> 
	<name>{ $ip }</name>
	<description>
		A hit from {data($ipinfo/@city)}
		<br/>
			{ $ip }
		<br/>
		{
		for $res in ferris-doc( "/.../access_log" )/*[@ip = $ip]
		return <p>At: { data($res/@date) }
			{
				let $r := data($res/@referer)
				where $r != "-"
				return <a href="{ $r }">{ $r }</a>
			}
			</p>
		}
	</description>
	<Point>
		<coordinates>
			{concat(data($ipinfo/@long),",",data($ipinfo/@lat))}
		</coordinates>
	</Point>
    </Placemark>
 }
</Document>
</kml> 

Wrap up

In the past, the focus has been on giving libferris the ability to see and interact with more data sources as filesystems. Recently that focus has also extended to making any libferris filesystem available to other systems, be it SQLite, XQuery, FUSE, or as a massive, virtual, Xerces-C document. The ability of libferris to get its hands on data from a wide array of places frees you from having to pick things apart in a script, perform temporary storage management for the data and bring the data together yourself in a that script. Just use SQL or XQuery and let libferris do the heavy lifting of data access for you.

Other possibilities include extending the scripts to include details of user logins and session times and possibly combine the Timeline widget with Google maps to investigate changes over time. For example, using timemap. Work on the XQilla integration for resolving XQueries in libferris is still under development. In particular, query times are likely to improve as bottlenecks are found and fixed.



No comments: