Fork me on GitHub
Blog

Efficient Time-based Range Queries in CouchDB using GeoCouch

Original article in german: ioexception.de

When it comes to queries, it is important to repress the mental model of any SQL databases you might have used so far. CouchDB provides a completely different way where you have to write MapReduce-based functions that create ordered lists as view to your data. For more complex queries, including some join-like requests, view collation provides a powerful yet tricky solution. One weakness of CouchDB that can’t be solved even using view collation is the absence of any range queries in continuous spaces. We faced this problem while working on the diretto service implementation and eventually found a proper solution.

In diretto, all documents have a moment where they have been created, or – for continuous media like videos – a moment where the creation has been started. However, it is hard to tell the exact time without properly synchronized clocks in cameras, recorders etc. When a document is received in hindsight by a third person, telling the correct moment might even be harder. So beside of accepting an exact moment, we also provide support for periods, like “that photo has been taken between 9:11 and 9:23″. Now users of the platform can not just suggest alternative values, they can also vote for the information they most likely believe in. Eventually, this should provide a collaboratively created approximation of the correct information.

… with CouchDB

Sounds good in theory, but when it comes to the queries like “give me all documents that have been presumably taken between 9:00 and 9:30″, you’re stranded with regular CouchDB. There is a workaround that divides the time into distinct periods and then emits a key for each segment the document is inside. This might work in our example, but it is obvious that this will emit a lot of keys and might not scale when there are long durations and fine-grained partitionings.

Here is an example document:

{
   "_id": "s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de",
   "begin": "2010-08-05T09:11:52.156Z",
   "end": "2010-08-05T09:23:13.457Z"
}

Map function for a key emission every 5 minutes:

//length of time segment (here 5 min)
var periodLength = (60*5);

function(doc)
{
	if(doc.begin && doc.end)
	{
		//start and end time as UNIX timestamps (seconds, not milliseconds)
		var begin =  Math.round(new Date(doc.begin).getTime()/1000);
		var end =  Math.round(new Date(doc.end).getTime()/1000);

		//calculate first matching segment of period
		var p = (begin - (begin%periodLength));

		//emit key for each matching period
		while(p<=end)
		{
			emit([p, doc._id], null);
			p = p + periodLength;
		}
	}
}

The view then correctly returns:

http://localhost:5984/entries/_design/entries/_view/docsByPeriodList?startkey=[1280998800]&endkey=[1281000193,{}]

{"total_rows":3,"update_seq":2,"offset":0,"rows":[
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","key":[1280999400,"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de"]15,"value":null},
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","key":[1280999700,"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de"],"value":null},
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","key":[1281000000,"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de"],"value":null}
]}

There are the matching 5 minute slots beginning at 9:10, 9:15 and 9:20.

The following sketch illustrates the B-tree containing entries emitted by period (note that the grayed-out key is an exemplary entry for other documents in the view and is not shown in other example outputs.):

… and way better with GeoCouch

We have the same problem for locations, where we also have various suggestions and inaccuarte data. Due to the fact that two-dimensional data (like a WGS84-encoded position) also can’t be indexed efficiently in CouchDB’s B-Trees, we switched to a fork by Volker Mische that is additionally backed by a R-Tree that enables multi-dimensional indexing (currently, it only supports two dimensions, but hopefully it will support more in the future). As a nice side-effect, GeoCouch also allows for spatial searches using bounding boxes. So you can search for spatial features within a limited area. This gave us the idea to apply these functions to our temporal indexing problem. !GeoCouch’s R-Tree is not bound to geographical values (however to two dimensions), so an entry could be any point or rectangle (bounding box) indexed by numerical positions.

GeoCouch’s spatial index expects GeoJSON entries (which are also not bound to any distinct format like WGS84), so we emit our time periods as positions with bounding boxes. The longitudinal values are ignored and set to 0 and the latitudinal values represent the temporal start and end point encoded as time stamps. Thus, we are now able to efficently query all entries within a given period.

Period spatial-based query:

function(doc)
{
	if(doc.begin && doc.end)
	{
		//start and end time as UNIX timestamps (seconds, not milliseconds)
		var begin =  Math.round(new Date(doc.begin).getTime()/1000);
		var end =  Math.round(new Date(doc.end).getTime()/1000);

		emit(
			{
				type: "Point",
				bbox : [0,start,0,end]
			}, null
		);
	}
}

Spatial query using start and end timestamp as range delimiters:

http://localhost:5984/entries/_design/entries/_spatial/entriesByPeriod?bbox=0,1280998800,0,1281000600

{"update_seq":16,"rows":[
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","bbox":[0,1280999512,0,1281000193],"value":null}
]}

As you can see, GeoCouch does not only allow us to store such entries with ranging values, it also also to query them easily.

 

The technologies behind the diretto Platform – Part 4: PubSubHubbub

HTTP is an exemplary request-response protocol based upon a client/server architecture. Clients, for instance web browsers, dispatch requests to identifiable resources (URIs). In most cases, they want to obtain a representation of a resource as a response from the server. But clients can also submit data to these resources or even create resources as a result.

But one important restriction to that model is that there is no other message exchange pattern than request/response. Also, such an interaction can only be initiated by the client. As long as there is no incoming request from a client, the server is not able to transer any data to this client. There are several workarounds where a server can push content to a client within an existing application session like long-held unclosed HTTP requests. Also the upcoming WebSocket standard aims at this problem by introducing an easier two-way communication between client and server.

In the real world, there are often scenarios where a server knows that a resource has changed and it might even know which of its clients might be interested in this change, but it cannot publish the information. The only way to this with plain HTTP so far is to configure clients to poll actively. Each client interested in changes must periodically poll the server for changes. A task that almost all of our RSS feed reader do all the time. It is obvious that this neither is performant nor scales very well.

Other technologies like XMPP or most of the message oriented middlewares don’t restrict nodes to the strict roles of requestors and replier. These technologies are predestinated for publish/subscribe flows.

For diretto, we have been facing the problem that on the one hand, we wanted to build our whole architecture on HTTP. On the other hand, we wished real-time notifications for our clients. Thanks to a Google-initiated project called PubSubHubbub, there is a simple, open, publish/subscribe protocol only based on vanilla HTTP.

The trick to PubSubHubbub is the fact that all nodes – publisher, subscriber and relaying hubs are both – HTTP server and client. Thus, the distinction between clients and servers is overcome. Details to the PubSubHubbub technology can be found here on the project’s site.

In the diretto architecture, we implement a publisher service in our feed node, which publishes the creation of new documents and the completed uploads of attachments. And a special client extension provides access to the subscriber functionality, which triggers callbacks for such events.

The following video from Google describes the PubSubHubbub idea in short: http://www.youtube.com/watch?v=B5kHx0rGkec.

 

GPS and Java: Parsing NMEA Data

So, we got our images and are ready to upload, if it were not for one thing: Where the hell are we anyway? Introducing: The GPS receiver and Java. A romantic drama in three acts.

Step 1: Acquiring Data

Practically all GPS receivers I know share their data with the rest of the world via a serial interface — the more modern ones with integrated serial-to-usb-converters, of course. Though your mileage may vary, omitting the java.io package and going straight for gnu.io.rxtx will prove likely to save you quite some frustration.

Since you want your GPS information in the NMEA 0183 format, you will need to set the serial interface properties accordingly:

setSerialPortParams(4800, SerialPort.DATABITS_8,
  SerialPort.STOPBITS_1, SerialPort.PARITY_NONE);

Theoretically, the NMEA data now rushing through your console is all you need to parse. To make things more orderly, though, I can recommend the Java NMEA API, which allows targeting of specific NMEA sentences. Reading their documentation is recommended at any rate, if you’re new into GPS data acquisition.

Side note: Should you experience troubles with USB GPS receivers being recognized by your Linux system, but !RxTx shelling out errors like “No such port”, make sure to set the permissions accordingly — chmod 666 /dev/ttyUSB0 and chgrp tty /dev/ttyUSB0 might help.

Step 2: What do we need, anyway?

Once connected, your console will be flooded with NMEA data, probably looking like this:

$GPRMC,101836.000,A,4825.1856,N,00956.8032,E,0.23,226.38,230710,,*07
$GPVTG,226.38,T,,M,0.23,N,0.4,K*68
$GPGGA,101837.000,4825.1857,N,00956.8023,E,1,04,1.4,567.4,M,48.0,M,,0000*5E
$GPGSA,A,3,17,05,08,18,,,,,,,,,8.6,1.4,8.5*36
$GPRMC,101837.000,A,4825.1857,N,00956.8023,E,0.57,292.18,230710,,*09

Intimidating as this might look, it’s all you need to determine your position. Anything starting with $GP is GPS data according to the NMEA 0183 specification, and the following three characters further specify what kind of information the sentence contains.

All we will look into from now is the GGA sentence, since it contains everything we need. Seperated by commas, we find:

  • Current time in UTC
  • Latitude and identifier N or S
  • Longitude and identifier E or W
  • Quality of measurement (0 == invalid, 1 == valid GPS, 2 == valid DGPS)
  • Horizontal Dilution of Precision
  • Height above sea level, with unit
  • Height over geoid, minus height over ellipsoid, with unit
  • Checksum

Apart from the date, the GGA sentence delivers everything we want. All that’s left to do for us is splitting and translating the data.

Step 3: Translating the coordinates

Translating? Yep. The NMEA sentences contain information like 4825.1857,N,00956.8023,E, which translate to 48° 25.1857′ north latitude and 9° 56.8023′ east longitude. We, however, do not want to calculate in degrees, minutes and seconds, but decimal fractions, like 8.4197616 and 9.946705. Hence, translating. Yes.

First of all, we need to split our sentences into their respective parts. Some tutorials recommend the use of !StringTokenizer to do so, which is not the best of ideas. When your GPS receiver lost its fix — for instance in a tunnel — the GGA sentence looks like this:

$GPGGA,103927.819,,,,,0,00,,,M,0.0,M,,0000*58

To avoid having to handle situations like this, the sentence can be split along all the commas with split(",").

After this operation, the position of all information bits is well defined. Should we operate without a valid GPS fix, sentence_parts[6].equals("0") is true.

Let’s assume we have a valid GPS fix and want to translate our coordinates. Despite their using StringTokenizer, this tutorial helps us with the calculation. “raw_latitude” is sentence_parts[2], “lat_direction” sentence_parts[3], and so on.

So, now we have our position. An approximation, at least. More to come about how to find out the precision of our fix. Stay tuned ;)

 

The technologies behind the diretto Platform – Part 3: Restlet & client helper libraries

Beside the reference server implementation, we also provide a client library that facilitates the development of client applications. It is written in Java and also compatible with the Android runtime environment. Thus, the client core can be re-used for several implementation scenarios using a high-level Java API.

The client library comes in two flavors – a basic version and an extended one that is PubSubHubBub-enabled. The latter requires a public HTTP server endpoint but allows to register callback functions. These functions will be executed almost in real-time as soon as the server publishes certain events, like the complete upload of a new document.

While most of the regular client applications – including mobile clients – will only need the basic client library, there are also usages for the !PubSubHubBub-aware version. For instance, dedicated headless clients might subscribe for new documents and directly download and process them in some way. This takes off a lot of work from the server and provides a clear path for feature extensions. Also our rich web application takes advantage of this mechanism in order to display new documents on the maps in real time.

The Java client is basically a client-side implementation of our RESTful web service. Therefore, we used Restlet, a REST framework for Java. It provides useful abstractions of the low-level HTTP communication and maps the representations/resources of a RESTful architecture to appropriate classes. For the underlying HTTP communication, we use Apache’s HttpClient library. The JSON serialization/deserialization relies on Jackson, a high performance library that allows various mappings, including a mapping between unnotated POJOs and JSON structs. Beside that, we use a lot of other libraries, like many of the Apache Commons libraries and JodaTime for handling time data.

For the subscriber-side implementation of our PubSubHubBub-enabled client, we developed our own subscriber library, because the existing one did not fit our requirements. It uses Jetty as lightweight internal webserver for receiving notification pings. This library has been developed apart from the main client source and has been released on github: java-sub-pubsubhubbub

 

The technologies behind the diretto Platform – Part 2: CouchDB & GeoCouch

Although documents are not stored directly in a database, the diretto platform needs a robust persistence layer for saving metadata information and user-generated content. Therefore, we have chosen to use a non-relational database in our platform reference implementation due to several reasons. The strong focus on documents in diretto almost implies directly a document-oriented database. We also searched for a database that scales well (both vertically and horizontally), which is highly available and provides easy replication. Furthermore the database had to match our HTTP-only approach as well as to fit seamlessly in our !JavaScript-coined architecture.
We finally have taken CouchDB, an open source, eventual consistent, document-oriented database written in Erlang. In a nutshell, CouchDB, is a storage where you can save arbitrary JSON entities, that are uniquely indexed via a key. Thanks to MapReduce-based view functions, you can create complex lists (in fact, ) of the documents or partial document data that are incrementally updated on changes. Although not exactly comparable to SQL, these view functions thus provide a substitute for regular database queries. The database access is provided via a HTTP RESTful web interface, like it is in the diretto service itself.

One major problem we ran into was the difficulty of mapping our geo-data into B-Tree-based views for location searches and proximity lookups. It is not just nasty to use for this, but also highly inefficient. Thankfully, there were already other people having this issue and , a student from the University of Augsburg, is currently developing a CouchDB extension for spatial queries as part of his diploma thesis. In his fork, he has implemented a spatial query mechanism based upon a R-Tree, a highly efficient index for multidimensional queries, and range queries. GeoCouch and diretto both follow the GeoJSON specification for representing positioning data in JSON. We are looking forward to see the cool spatial features in the main branch of CouchDB soon!