Why We Chose Druid to Power CoolaData’s Real-Time Analytics

Yair Harel, Senior Software Architect at CoolaData

In this dynamic digital age, when we all need to keep the pace and move fast, businesses are becoming more agile, and are chasing quick answers, immediate results, and fast growth.

It all needs to happen fast – in real time, actually.

Businesses might need to handle big data infrastructure and time series analysis, and to facilitate quick insights about customer behavior. We realized that CoolaData, a modern behavioral analytics platform, must support these requirements.

In addition, businesses need to answer the more complex business questions that rise above the level of sessions, pageviews, or clicks, to the more granular level of questions like: “How many users that came from campaign 1, and registered on the same day, viewed at least three product pages, added two items to the cart, and then abandoned in the same session?”

Such critical insights need to occur in real time. Analysts can’t afford to spend time on performing long queries or on data management. Event user data should stream in from all sources and be ready for immediate querying. For example: “select action where event_name = “install” and time is within last 1 hour and group by os_type”. Or in plain English: How many install events were sent in the past hour? Return the result by OS type groups.

To enable this type of real-time data that's ready for immediate querying, we at CoolaData needed a real-time OLAP solution. The base requirement was to digest all our customers’ event data, which meant about 500M events daily, in real-time, and to provide the proper visualization.

Real-time counters were not a barrier, as we already implement this technology for our internal monitoring (using ElasticSearch). So we considered two additional options: either to develop a solution consisting of existing components, or to take the shortcut and use an existing third-party product.

We chose the fast track again. And found two, very similar solutions: Druid and Pinot.

Druid was better documented and friendlier, so we picked it.

Meet the Druid

Druid is an open-source analytics data store written in Java, designed for performing business intelligence queries on event data. Druid has 3 key important characteristics, which matched our needs perfectly:

Druid is column oriented, and it enables complex multi-dimensional filtering and scanning
Druid has real time streaming ingestion
Druid is scalable

In addition, we asked ourselves the following questions while analyzing Druid:

Does it allow real-time aggregations queries? (Yes)
Does it has a single point of failure? (No)

Our Alpha

So we went on with the Alpha product, which was quite simple. We only wanted to show our users two types of metric counters: an events counter, and a unique users counter.

We decided to include only a few limited properties of our customer events: {userId, eventName, os, device, country, browser}. In addition, we allowed for each customer to define up to four more personal properties.

Because we only cared about the past 24 hours, this meant we could drop any data beyond that (historical segments). We needed to allow the system to accept events even if they are older than 24 hours in the past (window period).

Because we cared about one-minute granularity, we couldn't query Druid on milliseconds/seconds, because our data would be aggregated in ingestion time (query granularity).

For the alpha version, we had a single DataSource ( = table) and we did not create multiple data sources per customer or customer group. In addition, our segment granularity is one day (which means that each segment file contains data of a single day).

Our Druid Cluster

To meet all the requirements while keeping the solution cost-effective, we added the granularity and property limitations so our Druid cluster size could be kept relatively small.

The impact of the number of properties and their cardinality on Druid segments is huge. In a sentence, a segment is the data structure(s) which Druid uses to save the data. The recommendation is that a segment is not bigger than 700MB.

Therefore, the limitation we introduced into the solution impacted our cluster in the following ways:

bigger query granularity = smaller segments
bigger segment granularity = less segments

This meant more index service (real time ingestion) resources

bigger window period = more “open segments”

This meant more index service (real time ingestion) resources

single datasource = bigger segment size

This meant less index service (real time ingestion) resources

How Many Segments Did We have On Our Cluster in a Single Moment?

24H window + 1 day segment granularity = 3 segments (1 for the last day, 1 for the current day and 1 for the next day).
We used a replication factor = 2 and a partition factor = 4. This means that every single segment is replicated for 2 and partitioned to 4.
Therefore, a single segment is actually 4*2=8 segments.
In total, we had 8*3 = 24 'live segments'.
A 'closed' segment is a segment which cannot accept more data and exists only on 'historical' nodes. Segments are moved to historical nodes once they are out of the 'window period'.

How Does Druid Perform the Rollup in Real Time?

Here is a simple example:

In a single minute, we got the following 3 events:

{timestamp: 1490184436429, country : “israel”, os : “linux”, eventName: “coola”, userId : “1234”}
{timestamp: 1490184436430, country : “israel”, os : “linux”, eventName: “coola”, userId : “1234”}
{timestamp: 1490184436431, country : “israel”, os : “linux”, eventName: “coola”, userId : “5678”}

Druid rolls up those 3 events in real time (assuming that those 3 events went to the same partition):

{country : “israel”, os : “linux”, eventName: “coola”, userId : “1234”, events : 2}
{country : “israel”, os : “linux”, eventName: “coola”, userId : “5678”, events: 1}

If the query granularity was defined for one millisecond, the segment was bigger by one event, as no roll up was made by the index service.

Some downsides:

Lack of developer community, which makes it hard to tackle issues in development time.

Lack of any monitoring and managing tools. The coordinator and overload tools are far from advanced.

Summary and Further Reading

Real time analytics is crucial for many businesses. We were able to easily implement Druid to provide a real time infrastructure for our own behavioral analytics platform. This powers some cool features, including real-time behavioral insights on product engagement, content virality and purchasing patterns, and a Customer/Partner Portal enabling our users to embed real-time dashboards and data widgets for each of their customers.

Learn more about Druid and real time analytics:

Real time analytics patterns, case studies and tools on our BI & Analytics Wiki
A high level overview of Druid
Druid alternatives and related services, such as this Druid vs Redshift comparison, or other servers with OLAP capabilities, such as Redshift or PostgreSQL.

Written by Yair Harel, Senior Software Architect at CoolaData

Yair is a Senior Software Architect at Cooladata, With over 10 years in the tech industry, Yair has primarily been involved in designing and implementing high scale microservices SaaS applications, and big data solutions.

Analytics