Data brings in an intentional, methodical, and systematic way to make decisions.
At Confluera, we are continually identifying and collecting signals from customers’ environments to protect them from potential threats. We wanted to create Security Insights and provide Observability into the customers infrastructure surfacing novel patterns, allowing them to tighten their security poster.
To enable this feature we went looking for a datastore that can power such insights in a timely manner, which eventually led us to the Apache Pinot project. This blog post will talk a bit about the architectural needs that led us to select Pinot and how it helped us build our product’s required capabilities.
To give a high-level overview, Confluera offers a threat-detection and response platform. It maps activities happening in an infrastructure environment into graphs by consuming system-level events from hosts that are part of the infrastructure.The graph is further partitioned into subgraphs based on individual intents, which is then fed into our detection engine for finding malicious activities.
This proactive real-time graph building helps our detection engine in a couple of ways:
The above capabilities help reduce noise and save time by pre-building the attack narrative and providing a surgical response
See blog post to understand more about how we do this.
Once Confluera identifies a threat, we wanted to aid the investigation by providing a platform to search through all the events and at the same time provide a way to answer a query through slicing/dicing over the events through this threat-hunting platform.
Below is a screenshot of our product summarizing a multi-stage attack into a coherent storyline. In addition to this suspicious behavior, there’s a lot of benign activity that is not flagged. We wanted to close the gap by giving users the ability to dig through the entirety of potential attack activity.
In addition to the above threat-hunting feature that is mainly suited for “war-time,” we wanted to build a view into the infrastructure’s “peace-time” behavior by surfacing interesting security insights through analytics. Example insights are infrastructure wide behaviors such as dns requests happening, programs connecting from outside, login failures etc.
With the above product goals in mind we wanted a datastore which supports following features:
Based on this high-level understanding of our requirements, we found OLAP datastores to satisfy most of the above requirements. We spent time comparing Apache Druid and Apache Pinot for our use case.
We chose Apache Pinot primarily because of the availability of star-tree index, low latency over a set of queries we were interested in, and a very responsive and active community.
In this section we present benchmarks done to compare Apache Pinot and Apache Druid in terms of latency and throughput. Apache Druid was set up with 2 data nodes (historicals and MiddleManagers), 1 query node and a master node (Coordinator and Overlord) with historical and broker caches disabled. Apache Pinot was set up on 3 nodes, with 2 server nodes and 1 Node containing Broker & Controller. Each node is of aws instance type m5a.2xlarge (8vCPUs and 32GB of RAM). We loaded both the systems with the same dataset which has 700 million rows, enabling similar indices.
Below is a latency comparison on some of the queries we tried on Apache Pinot vs. Druid. Latency numbers shown below are obtained from the query metadata reported by both the systems as part of their dashboard. Results were captured once the latency stabilized for both the systems to avoid cold data or indices.
As shown above, Apache Pinot was able to support lower latencies for all these queries.
Throughput test is done using python clients of pinot (pinotdb) and druid(pydruid). The following graph displays the throughput observed on pinot vs druid with basic setup (i.e. no tuning). The query used for this throughput test is as follows: “Select count(*) from table where col_A = X”
As seen in the graph, Pinot is able to sustain a higher throughput albeit with increased p99 latencies.
Star-tree index provides a way to trade off storage for latency by doing smart pre-aggregation over a set of columns of interest. On top of that, enabling this index is just a config change away and reloading the table makes the index available for older segments.
Queries that are part of our security insights and analytics queries are pre-defined, and by using this index, we were able to optimize the latency into the levels we desired. The below chart shows the speedups we achieved on one such aggregate query over data corresponding to different periods.
Query latencies with star-tree index are ~60x times faster compared to inverted index. Query is of the format “Select col_A, count(*) from table where <time period=""> and col_B filter group by col_A order by count(*)”</time> with col_A, col_B part of star-tree index.
Please note : the latencies shown above are in log scale.
Star-tree index has been handy for us in introducing new aggregate queries into the product in a relatively short period of time — since the amount of time required to tweak the latency to reach our requirements is just a config change away, provided the data is already consumed by and is in Pinot.
Currently, Pinot’s setup consumes data from the graph-building engine through Apache Kafka, and we use S3 as a deep store. Data correction or backfilling of the data is done using an offline spark job. We use a TTL on the data stored in Pinot, which leads to an auto clean-up of older data while checking on the resources we need. If Pinot requires additional resources (for example, a new server), we introduce a new server, tag it with the appropriate tenant, and trigger a rebalance that would distribute the consuming and rolled out segments among the new servers.
With this setup, we closed gaps in our data infrastructure and helped enable following capabilities into our product and our team:
We started actively using Pinot a few months ago, and in this relatively short period, it has been part of multiple feature rollouts. Our experience so far has been great. We have found Pinot to be operationally simple to manage, scaling to our event ingestion rate. Indices made available by Pinot have been useful for us in experimenting and rolling out new features relatively quickly and at the same time meeting our latency needs. We just started using Pinot in our infrastructure and are excited about expanding usage in our product and inside Confluera.
We (Confluera) are passionate about security & infrastructure, if you would like to learn more about the product or see what Apache Pinot has enabled us to build into our system or just want to talk about cybersecurity, shoot out an email to email@example.com