Using Streaming Integration to Drive Real-Time Analytics in Apache Kudu

2 Minute Read

Apache Kudu was created, as part of the Apache Hadoop family, to enable fast analytics on fast data. But how exactly do you provide analytics in Apache Kudu with fast data in the first place?

Apache Kudu was designed specifically for use cases that require low latency analytics on rapidly changing data, including time-series, machine data, and data warehousing. Its architecture provides for rapid inserts and updates coupled with column-based queries – enabling real-time analytics using a single, scalable, distributed storage layer. 

However, the latency and relevancy of the analytics are only as good as the latency of the data. In order to gain the most out of the speed of Kudu, you need to deliver data to it in real time, as soon as possible after the data is originally created. And this is the very essence of streaming integration.

Whether the data is coming from databases, machine logs, applications, or IoT devices, it needs to be collected in real-time, micro-to-milliseconds after its genesis. This means utilizing the right technology and techniques to achieve very low-latency continuous data collection, regardless of the source. The data may also need to be processed, enriched, correlated, and formatted before delivery to Kudu to further optimize the analytics in Apache Kudu.

The Striim platform can help with all these requirements and more. Our database adapters support change data capture, or CDC, from enterprise or cloud databases. CDC directly intercepts database activity and collects all the inserts, updates, and deletes as they happen, ready to stream them into Apache Kudu. Adapters for machine logs and other files read at the end of multiple files in parallel to stream out data as it is written, removing the inherent latency of batch. While data from devices and messaging systems can be collected easily, independent of its format, through TCP, UDP, HTTP, MQTT, AMQP, JMS, and Kafka adapters.

After being collected continuously, the streaming data can be delivered directly into Apache Kudu with very low latency, or pushed through a data pipeline where it can be pre-processed through filtering, transformation, enrichment, and correlation using SQL-based queries, before delivery into Apache Kudu. This enables such things as data denormalization, change detection, deduplication, and quality checking before the data is ever stored.

In addition to this, because Striim is an enterprise grade platform, it can scale with Kudu and reliably guarantee delivery of source data while also providing built-in dashboards and verification of data pipelines for operational monitoring purposes.

Striim can change the way you do analytics in Apache Kudu , with Kudu providing real-time insight to the real-time data provided by Striim. After all, if you want fast analytics on fast data, you need fast data in the first place.

To learn more about how you can use Striim for real-time analytics in Apache Kudu, read the press release, schedule a demo with a Striim technologist, or download the Striim platform.