Dr. MonkeyIQ: October 2009

Saturday, October 31, 2009

White lightning in triplicate

Recently I started hacking on a memory mapped, multi_index soprano backend. While adding triples, and using listStatements() should work fine, implementing SPARQL is making for interesting times.

I started out allowing a single triple match with a filter(regex()) to restrict results. And this worked rather well, making the first one free as they say. So, noticing the little white rabbit that seemed to disappear into the SPARQL bushes, I decided to join in the high tea and mercury sniffing that so induces sanity. Over the course of version 0.0.1 to 0.0.5 the SPARQL code is becoming better, little by little. The code is up at my sf.net page. But don't blame me if the your SPARQL is not implemented yet or your triples somehow disappear.

Anyway, here is a little benchmark session. I'm using the data set generator and queries found here. To make the data I use


$ cd /usr/local/java/bsbmtools
$ cat run.sh
#!/bin/bash
java -cp bin:lib/ssj.jar benchmark.generator.Generator "$@"
$ ./run.sh -fc -pc 1000 -s nt
$ mv dataset.nt  thousand-prods.nt
$ mkdir -p /tmp/RDFBENCH
$ cd /tmp/RDFBENCH
$ mkdir mmap redland

Queries are run multiple times to ensure a hot disk cache. This is on a 3 disk RAID-5 and an Intel Q6600 with 8gb RAM.
The last query is not optimized properly in boostmmap yet, so its far slower than it rightly should be. For benchmarking the boostmmap backend...


$ cd /tmp/RDFBENCH/mmap
$ time sopranocmd --backend boostmmap \
  --serialization ntriples \
  import /usr/local/java/bsbmtools/thousand-prods.nt >|out 2>&1

real    1m49.642s
210M     triples.mmap*

$ time sopranocmd \
  --backend boostmmap \
  list "" '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>' \
  '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/product>' \
   >| /tmp/out 2>&1

real    0m0.103s
grep Product /tmp/out | wc -l
1001

## based on Query 6
$ time sopranocmd \
  --backend boostmmap query \
"
select ?what ?lab
where
{
  ?what http://www.w3.org/2000/01/rdf-schema#label ?lab .
  filter( regex( str( ?lab ), 'excites' ))
}"
?lab -> <yawned%20excites%20deflower>;
  ?what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature295>
?lab -> <goofs%20excites%20enigmata>;
  ?what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature3276>

real    0m0.091s


$ time sopranocmd --backend boostmmap query \
"
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix xsd: <http://www.w3.org/2001/xmlschema#>
prefix dc: <http://purl.org/dc/elements/1.1/>
select ?offer ?price
where {
    ?offer  bsbm:product http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer1/Product5 .
    ?offer  bsbm:vendor ?vendor .
    ?vendor bsbm:country http://downlode.org/rdf/iso-3166/countries#ES .
    ?offer  dc:publisher ?vendor .
    ?offer  bsbm:price ?price .
}"
0.93sec

Note that this 0.9seconds is shameful and needs to be optimized back to <0.1sec.

For redland,


$ cd /tmp/RDFBENCH/redland
$ time sopranocmd --backend redland \
 --serialization ntriples \
 import /usr/local/java/bsbmtools/thousand-prods.nt \
 >|/tmp/out 2>&1

real    38m34.735s
480mb

$ time sopranocmd --backend redland \
  list "" \
  '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>' \
  '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/product>'  \
  >| /tmp/out 2>&1

real    0m0.096s
grep Product /tmp/out | wc -l
1000

So for just listStatements() redland and mmap are fairly equal in performance. Which, for a single indexed lookup, you might expect. In libferris I had restricted RDF usage to raw triple probes like this because I used redland directly prior to version 1.4.x of libferrris.

So for SPARQL,


## based on Query 6
$ time sopranocmd --backend redland query \
"
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?what ?lab
where
{
  ?what rdfs:label ?lab .
  filter( regex( str( ?lab ), 'excites' ))
}"
what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature295>;
   lab -> "yawned excites deflower"
what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature3276>;
   lab -> "goofs excites enigmata"
real    0m3.855s

Gah, and I didn't slip up and put the 3 on the left side of the dot there. We are talking about 0.1 seconds for boostmmap against 3.86 seconds for redland.


$ time sopranocmd --backend redland query \
"
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix xsd: <http://www.w3.org/2001/xmlschema#>
prefix dc: <http://purl.org/dc/elements/1.1/>
select ?offer ?price
where {
      ?offer bsbm:product <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/datafromproducer1/product5>
      ?offer bsbm:price ?price .
      ?offer bsbm:vendor ?vendor .
      ?offer dc:publisher ?vendor .
      ?vendor bsbm:country <http://downlode.org/rdf/iso-3166/countries#es> .
}"
real    0m7.134s

Since this query doesn't work well on boostmmap it only goes from 1 to 7 seconds. But I think I can resolve it in much much less time than 1 second. This is not meant to make redland look bad, it's SPARQL implementation is much more complete than boostmmap will likely be any time soon. Creating an optimal query plan for the full SPARQL language will be an interesting challenge.

Development might be bursty as I don't know what time I can spare for improving the SPARQL completeness in the short term.

I give you the CLAW!

Clawmotia, an edje MythTV remote control.
Sources hot off the presses: clawmotia-0.0.1. You'll also notice the diablo binary for clawmotia on there so you don't need a scratchbox to get going. I put the clawmotia.edj up too so you don't even need to compile that if you don't wanna.

Dependencies on the device: Qt, qedje. You probably have most of the requirements for the latter installed if you have canola2 on the device. There are deb files for qedje on its web site. Qt is also packaged for the n810. I don't have debs for clawmotia yet, feel free to send the debian directory to do that.

You'll obviously need a MythTV server (and client) to use the remote. It uses the web interface to mythtv so you'll want that installed on the server machine too. The various configuration settings are passed in using environment variables in version 0.0.1. See the script start-clawmotia.sh for these settings. You will also need to be in the same directly as the .edj file when you start clawmotia as it expects that at this point. I install the script, binary and edj into /usr/local/bin on the device.

Setting CLAWMOTIA_FS=1 will start clawmotia full screen. There is a bug in version 0.0.1 where it will start but not be shown if started full screen. You have to switch to another app and back to clawmotia to see it. I was planning to fix this before release, but it seems some folks want to have a tinker with it so I'll fix that bug soon.

If you want to change the buttons or layout, you only need to edit the clawmotia.edc file. If you are good with the gimp, you might like to reate a theme at 800x480. As long as you use layers it should be a fairly mechanical process to create an edc file for your theme.

To make a button, define its image in the images section and call the three macros COMMAND_BUTTON, COMMAND_BUTTON_ICON, and COMMAND_PROGRAMS. COMMAND_BUTTON sets up the area of the screen that will react to your button with an optional text label for the button as arg2. COMMAND_BUTTON_ICON links a COMMAND_BUTTON with an image file. And COMMAND_PROGRAMS links a button with the exact, HTTP encoded string to send to the MythTV server. HTTP encoding doesn't effect single keys much, but Ctrl+w becomes Ctrl%2Bw. The first argument to all the three macros is the button name and should be the same for the same button. Taking a look at what is there and remembering to use all three macros to give (a) initial setup, (b) image file, (c) action should get you one the way.

You can update the theme by compiling it with edje_cc and copying it to the device. The application itself shouldn't need any changes.

Note that the bevel buttons are separate to the actual button function image. So if you like you can remote the button by setting 255 to 0 in the lines
color, 0 0 0 255
in the COMMAND_BUTTON definition in clawmotia.edc

I'll probably move to cmake or autofools at some point. If, on the odd case that there are many patches flowing in, then I'll throw the code into subversion somewhere.

The Rasterman has recommended cloning a few hardware remotes as themes. I'm not sure when I'll have the time for that but it should be interesting.

Oh yeah, and if you like this then you feel free to contribute toward a memory card or IR transmitter for my future tinkering. Or send any old n900's you have laying around ;) Hey... I can dream right?

Thursday, October 29, 2009

I want my MythTV: Clawmotia

I noticed that there are many packages which turn a maemo device into a remote control. Unfortunately, the MythTV ones I saw either didn't install or were not what I was looking for. Thus clawmotia was born yesterday.

I'm using edje for the interface and qedje to actually create the UI. The program logic uses Qt to talk to a MythTV server and send the commands to it. This relies on knowing there your server is, and what client you want to control which are both passed in as environment variables. Really simple stuff, but it works well.

You'll need the Web interface on the MythTV server machine too. The upshot is that you only need a wifi connection and you can control any MythTV client you want. No bluetooth or IR dongles needed.

As you can see, there are some nasty graphical artifacts on the buttons which I'm not sure if its the evas engine or something to do with qedje. The black parts on the buttons are not there on the desktop.

As it uses edje, you can create different themes and layouts, compile the edje on your desktop machine and just scp it to the device to change the layout and button functionality. Including cute little slide in and out panels for the more rarely used controls. Hopefully I can convince somebody with more artistic ability than me to do just that. Source will be released in the next few days.

Welcome to the machine...

Warning: this is a hello p.kde post, so now I can read minds; the page down key never looked so appealing!

OK, so I've been working on filemanagers for "a long time now". Back in the days when the Amiga was hot and hardware memory managers were a wonderful thing that would make a segv not take down the whole machine. Over the last roughly ten years I've been hacking on libferris. In that time it has expanded from a virtual filesystem to include index and search, virtualized extended attributes, RDF support, FUSE and rsync support and other party tricks.

Those with a keen eye will be asking why I am on this KDE planet when libferris "competes" with KIO, Strigi, Soprano etc. I have quoted compete because there is no clear winner and looser in projects that you do mainly for technical enjoyment. And there is no reason that libferris can not be used in a mixed fashion with other projects.

The last major release of libferris moved its RDF handling from redland to soprano, so at least the triples can be freely intermixed between projects now. On that note, I've recently created a new backend for soprano to use a memory mapped file with multi indexing as the triple store. See the ferris sf.net page for the boostmmapbackend tarball. The backend is in early days, but I'll take it for a walk on the n810 soon to see how well it works in that environment.

I am planning on adding spatial indexing and other tricks to the soprano backend which I've used for my maemo libferris metadata index module:

libferris maemo audio search by regex on URL from Ben Martin on Vimeo.

Friday, October 23, 2009

RDF on the device

I've started work on a memory mapped soprano RDF backend. Given that mobile devices use flash for their permanent storage, an RDF backend designed using primary storage algorithms should work well on maemo.

While the SPARQL implementation is "far from complete", it can already evaluate some common queries very very quickly. I have some triple matching and the ability to have multiple regex filter statements. Other filters and more complete SPARQL lanugage support should follow in time, as time permits... patches accepted etc and so on.

For those interested, see soprano-boostmmapbackend on my main sf.net page.

Wednesday, October 14, 2009

Libferris, Soprano, Extended Attributes... the Ménage à trois

If you want to store metadata in a filesystem, there are Extended Attributes (EA). The kernel interface allows you to store key=value metadata in these EAs for each file on your filesystem. The catch is that kernel EA are limited in size, sometimes performance is poor, and some systems do not support EA either natively or in the kernels that Linux distributions ship. The classic example of the latter is NFS, which can have EA support patched in, but many distros do not do that.

In libferris the EA interface is virtualized along with the filesystem. So if you are using XFS (or ext3/4 with the right options) then libferris will let you read and write EA to the kernel filesystem. For other filesystems, libferris stores the EA behind the scenes in RDF for you. The difference is no seen by applications, its all just EA... magically every filesystem supports read/write EA.

Version 1.4.0 and above of libferris use Soprano and optionally Nepomuk for RDF support. To take RDF for a spin from the filesystem lets use FerrisFUSE and the normal console tools...

$ mkdir -p /tmp/RDFTESTING/backing /tmp/RDFTESTING/fs
$ date >| /tmp/RDFTESTING/backing/df1.txt
$ ferrisfs -u /tmp/RDFTESTING/backing /tmp/RDFTESTING/fs

As you can see, backing is where the filesystem is and fs is where you can access backing through libferris. You could just as easily use a HTTP server or emacs as your backing filesystem, anything that libferris can see is up for grabs.

The below uses the attr command to set and get and Extended Attribute. Assuming that the /tmp kernel filesystem does not allow EA to be set by users. If in doubt, use an NFS directory and you'll almost certainly not be able to attr -s directly on the backing filesystem.

$ cd /tmp/RDFTESTING/fs
$ cat df1.txt
Wed Oct 14 22:33:04 EST 2009
$ attr -s foo -V bar df1.txt
Attribute "foo" set to a 3 byte value for df1.txt:
bar
$ attr -g foo df1.txt
Attribute "foo" had a 3 byte value for df1.txt:
bar

So, you might ask where does all this metadata go and come from. And what does the RDF schema look like... The best solution would be to use SPARQL to query the data, but the default store is still a redland one with libferris 1.4.0 and its sparql is very, very slow. 1+ minutes for a simple query on this data vs <2 seconds using the sesame2 soprano backend. So the fastest way to explore the redland RDF store is to export the whole RDF store and grep it for now. Hopefully virtuoso and/or my own soprano backend will save the day in the future :/

$ cd ~/.ferris/rdfdb
$ time sopranocmd --backend redland \
--settings name=myrdf export t

I have changed the URIs to use prefixes in the grep output... as you would expect, there is data attached to the df1.txt URL.

$ grep RDFTE t ferris:uuid ferris:93f22bd8-b8be-11de-8e06-001bfc4f043c .

That UUID node has an mtime and a out-of-band-ea bnode.

$ grep 93f22bd8-b8be-11de-8e06-001bfc4f043c t
ferris:93f22bd8-b8be-11de-8e06-001bfc4f043c ferris:mtime "1255523953"^^ .
ferris:93f22bd8-b8be-11de-8e06-001bfc4f043c ferris:out-of-band-ea _:r1255523601r5448r1 .

And the bnode has the EA foo=bar set on it.

$ grep r1255523601r5448r1 t
_:r1255523601r5448r1 ferris:user.foo "bar"^^ .

As you see, the UUID node has a mtime assoicated with it, this way libferris can tell if you have updated any RDF values for a file, and it becomes like an additional ctime check available to libferris and used for example when indexing files.

The gain of having the UUID node use a bnode is that many files can share the same RDF metadata. This is useful if you can access the same file from multiple paths or if files get moved on file servers and you want to relink the old RDF metadata to the file with the new path. I mention file servers here because libferris will track the metadata for you if you use ferrismv/ferriscp, but if somebody moves a file on the file server you've got to have a way to tell libferris about that change.

Fileserver movements are a common enough thing that libferris can automatically relink RDF nodes for you. There are also smushing tools available to help with the task. But that's a story for another post.

Tuesday, October 13, 2009

libferris 1.4.x - Nepomuk!

So finally RDF handling is done using Soprano (The core of KDE4's nepomuk RDF library). Initially, ferris defaults to using redland in-process for RDF, so the code paths are very similar to what 1.3.x of libferris was using.

But the huge gain is that you can use the nepomukserver or the sesame2 soprano backends if you like. I'd also dearly love to write my own soprano backend, and am bashing out a design for it, but that's a story for another post.

I've run some initial benchmarks, and because much of the speed dependant code paths hit triples directly, giving S+P and looking up Object I suspect I'm not hitting the main benchmarked paths. For example, I've found that some SPARQL queries that perform set based S+P->O lookups run much faster on the sesame2 soprano backend.

Now if only OBS would hurry up, I'd have Fedora 11 packages of ferris 1.4.x available... but in the meantime, the steaming source tarball is on sf.

Thursday, October 8, 2009

User overlay virtual softlinks

This was introduced in libferris 1.3.8, The feature lets you have libferris create virtual softlinks for you when a directory is read. Overlay links are setup in the config file ~/.ferris/user-overlay-links.xml. The file sets up regex matches, when the regex is matched to a directory, a virtual link is created to a target location. The target location must actually exist at the moment, no dangling links are permitted. There is an example XML file in the dot-files subdirectory of the tarball.

For the gphoto:// filesystem I have this which creates the link G7 -> Canon PowerShot G7 (PTP mode) in the root of the gphoto:// filesystem.


<user-overlay-links>
  <link-by-regex>
    <vlink>
        <match>^gphoto:[/]+$</match>
        <target>Canon PowerShot G7 (PTP mode)</target>
        <link-name>G7</link-name>
    </vlink>
...

If the camera as a stable subdirectory that new images are written too, then a link right into that can be made in the root, "latest" takes me right to the images subsubdirectory.


...
    <vlink>
        <match>^gphoto:[/]+$</match>
        <target>gphoto://G7/store_00010001/DCIM/102CANON</target>
        <link-name>latest</link-name>
    </vlink>
...

It seemed like a good idea to abstract this out of any specific filesystem like upnp and gphoto so virtual links can be made anywhere. In particular, for upnp a/v devices sometimes the server likes to use an insanely long name for itself, with virtual soft links, the client machine can give the server a nice short name of it's choosing.