Basic analytics on our data set
In part one of the series here we covered setting up our Couchbase node and populating it with a sample of documents. As a quick refresher here is what one of our sample documents looks like.
The query we wrote in the last article allowed us to group new users by year,month,day and so on but also allowed us to retreive a count of how many users had joined between specific dates.
We are going to do something similar in this article while expanding slightly on the complexity of our queries.
Active users by origin and date query
Our latest imaginary requirement from our imaginary bosses is that we want to report back active users within a timeframe broken down by origin.
So let's count all our active users by origin. Here is our first take on the query:
As ever we check that the document type is correct and exists alongside any fields that we'll be emitting (such as the origin field here). We then emit the origin field. You'll need to set the reduce to count and and group to true and group_level to 1.
This will produce a breakdown of our user base like so:
It's pretty useful as it gives us an idea of the global distribution of our user base, but we can make this query better! At the moment we are limited to retreiving the count for ALL active users since our application began, it'd be much easier if we could filter the data based on time.
Many times as queries become more complex you'll need to do finer filtering in the application layer, however where possible we should strive to do as much processing as possible in the map reduce phase.
Enter the Couchbase compound key
To allow us to be able to filter our data more intelligently we are going to make use of a compound key, basically this is where we emit multiple elements as a key and thus they can be filtered via the start and end keys within the query. Let's look at our adapted query now!
As you can see now we emit both the origin and the last-active field together, surrounded by . With no reduce this view produces keys that look like this:
Again we select count for the reduce function,group to true and level to 1. If we run the query now we'll see that the output is identical to the first iteration of our query. This is because the group level now works from the left hand side of the compound key to the right. Therefore level 1 groups on origin, level 2 would group on origin AND then last-active. Sadly we can't fine grain group anymore on the last_active field to group by minute etc, it is a limitation of Couchbase. (If anyone knows a way to achieve this feel free to comment!)
Where our new query really shines is for more fine grained specific searches. Let's count all the users from Brazil that have been active in the last 2 days. To achieve that we would use a start key like so:
and an end key of:
which will output something like this (remember that the ruby script that we used to set up our test data generates dates randomly, so your result may differ to mine):
Now this means we can get any amount of granularity on our active user base both by time and origin. Perhaps your application queries the view once every five minutes with each of the origin keys you are interested in and the time keys being of five minute intervals...
In the next article we'll look at persisting view data allowing us to graph out trends over time on more complex queries.
Any comments,errors or suggestions feel free to post something on this article or reach out to us on Twitter