Making the Most of Apache Kafka – Ingestion into Kafka

Steve Wilkes
August 29, 2017 · 3 minute read

Ingestion into Kafka

In Part 1 of this blog series, we highlighted how Striim’s SQL-based platform makes it easy to deliver processing and analytics of Apache Kafka data. We will now turn our focus toward real-time data ingestion into Kafka from a wide variety of enterprise sources.

Integrate Data into Kafka

When you are considering how to connect data sources to Kafka, you need to determine how to collect the data continuously in a streaming fashion, and how to “massage” and process Kafka data into the required format. Neither of these steps should require any coding, yet should be flexible enough to cover a wide range of use cases.

Streaming Data Collection

The Striim platform ingests real-time streaming data from a variety of sources out-of-the box, including databases, files, message queues and devices. All of these are wrapped in a simple easy to use construct – a data source – that is configured through a set of properties. This can be done through our TQL scripting language, or the UI. We also provide wizards to simplify creating data flows from popular sources to Kafka.

The way Striim collects data varies depending on the source, and databases require special treatment.

Most people think of databases as a record of what has happened in the past, with access to that data through querying. However, you can change that paradigm using a technology known as Change Data Capture (CDC). This non-intrusively listens to the transaction log of the database and sees each insert, update and delete as they happen. Striim makes use of CDC for the database sources, and through configuration can stream out each database operation in real time.

Striim takes a similar approach with files. The file reader does not wait for files to be complete before processing them in a batch-oriented fashion. Instead, the reader waits at the end of the file and streams out new data as it is written to the file. As such, it can turn any set of log files into a real-time streaming data source.

A range of other sources are also available, including support for IoT and device data through TCP/UDP/HTTP/MQTT/AMQP, network information through NetFlow and PCAP, and other message buses like JMS, MQ Series, and Flume.

As we will discuss later, the data from all these sources can be delivered “as-is,” or go through a series of transformations and enrichments to create exactly the data structure and content you need. Data can even be correlated and joined across sources, before delivery to Kafka.

Data Formatting 

Different Kafka consumers may have different requirements for the data format. Since Kafka deals with data at the byte level, it is not aware of the format of data at all, but consumers may need a specific representation. This could range from plain text, or delimited data (think CSVs), to structured XML, JSON or Avro formats.

When writing to Kafka in Striim, you can choose the data format through a simple drop down and optional configuration properties, without a single line of code.

For more information on Striim’s latest enhancements relating to Kafka, please read this week’s press release, “New Striim Release Further Bolsters SQL-based Streaming and Database Connectivity for Kafka.” Or download the Striim platform for Kafka and try it for yourself.

Continue reading this series with Part 3: “Making the Most of Apache Kafka,” – Delivering Kafka Data.