Efficient Time-based Range Queries in CouchDB using GeoCouch
Original article in german: ioexception.de
When it comes to queries, it is important to repress the mental model of any SQL databases you might have used so far. CouchDB provides a completely different way where you have to write MapReduce-based functions that create ordered lists as view to your data. For more complex queries, including some join-like requests, view collation provides a powerful yet tricky solution. One weakness of CouchDB that can’t be solved even using view collation is the absence of any range queries in continuous spaces. We faced this problem while working on the diretto service implementation and eventually found a proper solution.
In diretto, all documents have a moment where they have been created, or – for continuous media like videos – a moment where the creation has been started. However, it is hard to tell the exact time without properly synchronized clocks in cameras, recorders etc. When a document is received in hindsight by a third person, telling the correct moment might even be harder. So beside of accepting an exact moment, we also provide support for periods, like “that photo has been taken between 9:11 and 9:23″. Now users of the platform can not just suggest alternative values, they can also vote for the information they most likely believe in. Eventually, this should provide a collaboratively created approximation of the correct information.
… with CouchDB
Sounds good in theory, but when it comes to the queries like “give me all documents that have been presumably taken between 9:00 and 9:30″, you’re stranded with regular CouchDB. There is a workaround that divides the time into distinct periods and then emits a key for each segment the document is inside. This might work in our example, but it is obvious that this will emit a lot of keys and might not scale when there are long durations and fine-grained partitionings.
Here is an example document:
{
"_id": "s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de",
"begin": "2010-08-05T09:11:52.156Z",
"end": "2010-08-05T09:23:13.457Z"
}
Map function for a key emission every 5 minutes:
//length of time segment (here 5 min)
var periodLength = (60*5);
function(doc)
{
if(doc.begin && doc.end)
{
//start and end time as UNIX timestamps (seconds, not milliseconds)
var begin = Math.round(new Date(doc.begin).getTime()/1000);
var end = Math.round(new Date(doc.end).getTime()/1000);
//calculate first matching segment of period
var p = (begin - (begin%periodLength));
//emit key for each matching period
while(p<=end)
{
emit([p, doc._id], null);
p = p + periodLength;
}
}
}
The view then correctly returns:
http://localhost:5984/entries/_design/entries/_view/docsByPeriodList?startkey=[1280998800]&endkey=[1281000193,{}]
{"total_rows":3,"update_seq":2,"offset":0,"rows":[
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","key":[1280999400,"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de"]15,"value":null},
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","key":[1280999700,"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de"],"value":null},
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","key":[1281000000,"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de"],"value":null}
]}
There are the matching 5 minute slots beginning at 9:10, 9:15 and 9:20.
The following sketch illustrates the B-tree containing entries emitted by period (note that the grayed-out key is an exemplary entry for other documents in the view and is not shown in other example outputs.):
… and way better with GeoCouch
We have the same problem for locations, where we also have various suggestions and inaccuarte data. Due to the fact that two-dimensional data (like a WGS84-encoded position) also can’t be indexed efficiently in CouchDB’s B-Trees, we switched to a fork by Volker Mische that is additionally backed by a R-Tree that enables multi-dimensional indexing (currently, it only supports two dimensions, but hopefully it will support more in the future). As a nice side-effect, GeoCouch also allows for spatial searches using bounding boxes. So you can search for spatial features within a limited area. This gave us the idea to apply these functions to our temporal indexing problem. !GeoCouch’s R-Tree is not bound to geographical values (however to two dimensions), so an entry could be any point or rectangle (bounding box) indexed by numerical positions.
GeoCouch’s spatial index expects GeoJSON entries (which are also not bound to any distinct format like WGS84), so we emit our time periods as positions with bounding boxes. The longitudinal values are ignored and set to 0 and the latitudinal values represent the temporal start and end point encoded as time stamps. Thus, we are now able to efficently query all entries within a given period.
Period spatial-based query:
function(doc)
{
if(doc.begin && doc.end)
{
//start and end time as UNIX timestamps (seconds, not milliseconds)
var begin = Math.round(new Date(doc.begin).getTime()/1000);
var end = Math.round(new Date(doc.end).getTime()/1000);
emit(
{
type: "Point",
bbox : [0,start,0,end]
}, null
);
}
}
Spatial query using start and end timestamp as range delimiters:
http://localhost:5984/entries/_design/entries/_spatial/entriesByPeriod?bbox=0,1280998800,0,1281000600
{"update_seq":16,"rows":[
{"id":"s-ffc0b6b0-59d4-4a3b-ad36-7ec05e7db1de","bbox":[0,1280999512,0,1281000193],"value":null}
]}
As you can see, GeoCouch does not only allow us to store such entries with ranging values, it also also to query them easily.




2013 diretto Project Team
[...] which actually maps well to this data model. (I didn’t come up with this approach on my own: see Efficient Time-based Range Queries in CouchDB using GeoCouch for the [...]