The Power of Streaming SQL for Real-Time Data Solutions

In this video, Striim Founder and CTO, Steve Wilkes, discusses streaming integration, the need for stream processing and streaming SQL, and why they’re essential to real-world real-time solutions.

This was originally published as a blog post here.

To learn more about the Striim platform, go here.

 

Unedited Transcript:

You’ve heard about streaming integration, the need for stream processing, and often hear the term streaming SQL. But what is streaming SQL, and why is it so essential to real-world real-time solutions?

IBM created the Structured Query Language, or SQL, in the 1970s as a declarative mechanism for working with relational data. It has been used for four decades as a way of creating, modifying and querying data in almost every database on the planet. However, because databases store data before it is available for querying, this data is invariably old.

In the world of real-time data and streaming systems there is also a need to work with data, and Striim chose 5 years ago to use a variant of SQL for stream processing. This streaming SQL looks very much like the static database variant, but needs new constructs to deal with the differences between stored and real-time continuous data.

Database SQL works against an existing set of data and produces a result set. If the data changes, the SQL needs to be run again. Streaming SQL receives a continuous and never-ending amount of data, and continually produces new results as new data arrives.

The simplest things that can be done with this data are filtering and transformation. These operations work event-by-event with every input potentially creating zero or one output.

For example, if we want to limit data moving from one stream to another to a certain location, we could write a simple WHERE clause.

SELECT *
FROM OrderStream
WHERE zip = 94301

And if we want to combine first and last names into full name, we can use concatenation, with other, more complex, functions of course available.

SELECT *,
       FirstName + ‘ ‘ + LastName as FullName
FROM OrderStream
WHERE zip = 94301

However, because streaming queries receive events one-by-one, additional constructs are required for aggregate queries that work against a set of data, so windows and event tables need to be introduced.

A window contains a set of events bounded by some criteria. This could be the last 5 minutes worth of data, last 100 events, or hold events until no more arrive within a certain time. Windows can also be partitioned, so the sets are based on the criteria per some data value, for example last 100 actions carried out per customer. Event tables hold the last event that occurred for some key, for example the last temperature reading per room.

Streaming SQL can work against windows and event tables and will output results whenever there is any change. Aggregate queries against windows will recalculate whenever the window is updated, giving running counts, sums over micro-batches, or activity within a session.

For example to create a running count and sum of purchases per item in the last hour, from a stream of orders, you would use a window, and the familiar group by clause.

CREATE WINDOW OrderWindow
OVER OrderStream
KEEP WITHIN 1 HOUR
PARTITION BY itemId
 
SELECT itemId, itemName,
       COUNT(*) as itemCount,
       SUM(price) as totalAmount
FROM OrderWindow
GROUP BY itemId

Enriching data is just as easy, it uses the standard notion of a JOIN. The Striim platform supports all types of joins familiar to database users including inner, outer, cross and self-joins through nested queries.  Striim enables users to load large amounts of data into in-memory caches and event tables from databases, files, hdfs and other sources. This can be reference, context or historical data, and can be updated through the incorporation of CDC.

For example, if we want to enrich the orders stream to include details about customer and location, we can join with reference data loaded into caches from the customer table and location database.

SELECT o.orderid, o.itemname,
       o.custid, o.price, o.quantity,
       c.name, c.age, c.gender, c.zip,
       z.city, z.state, z.country
FROM OrderStream o,
     CustInfo c, ZipInfo z
WHERE o.custid = c.id
AND   c.zip = z.zip

Of course, this just scratches the surface of what can be achieved through Streaming SQL. Production queries can be much more complex, utilizing case statements and even pattern matching syntax.