insideBIGDATA: Streaming Integration with Intelligence
insideBIGDATA podcast with Steve Wilkes, Co-founder and CTO at Striim presenting: Striim – Streaming Integration with Intelligence.
Welcome to The Rich Report, a podcast with news and information on the world of big data. Today my guest is from Striim. We have the CTO and cofounder of the company, Steve Wilkes. Steve, welcome to the show today.
It’s great to be here. Rich.
You know, I was looking at your bio and I see we have something in common. We both worked for companies that were acquired by Oracle in the past years being in the area of data analytics of course. But, Steve, why don’t we start at the beginning. Can you tell me about the problem that you were trying to solve when you cofounded Striim?
Yeah, absolutely. And you know, being acquired by Oracle doesn’t really narrow it down too much. So, yeah, the four of us that founded Striim were all executives at Golden Gate, and Golden Gate was, as you say, acquired by Oracle in 2009. And the business of Golden Gate was really moving data from one database to another at a high speed way to enable these high availability solutions, et Cetera. And is, you know, very innovative in what actually did, but the customers would ask us, uh, at Golden Gate, you know, if you’re moving this data, can we kind of look at it while it’s moving? Can it be analyzed? Can we get value out of it, you know, rather than just after it’s landed somewhere, you know? So it kind of stuck in our minds and when we were thinking about what we wanted to do next, um, it kind of late 2011, that kind of idea resurfaced.
And so the whole goal of Striim was to enable organizations with a single platform to be able to get value out of high speed, real-time data, and not limited to collecting, you know, change data from databases like Golden Gate, but any data that is created at high speed, being able to kind of get that in real time and kind of work with it. So, you know, the goal of the company was we wanted to build a platform that enables you to collect data as it was being created to be able to process that data in memory, to be able to integrate it with other data, deliver it to whatever targets you wanted and to be able to do analytics on it and kind of visualize results of those analytics, you know, so rather than customers having to use multiple products and try and piece them together or build stuff from open source building blocks and piece all that together to give them everything they needed out of the box in one platform to be able to do that. And that’s what we set out to do. And that’s what we’ve actually achieved with Striim.
Well, that’s great, Steve. You know, I love this, the slogan you guys are using, they’re not getting value from the data now when it’s born. Right. And not some time later. Right.
So, oh, that’s right. Yeah. So Steve, I brought your slides up. Why don’t we go through the deck and we’ll do a Q and A at the end.
Oh, awesome. That sounds great, Rich. So today I’m going to be talking about the latest three-eight release, but in order to give you some context behind that, I’ll first start off by introducing you to Striim and kind of talking about the platform in general. Then go through the release, what the main features or benefits are, and then kind of how you can kind of get started with Striim. So as I mentioned, there are four founders of Striim, we were with GoldenGate, which was acquired by Oracle, and we founded Striim in 2012. It’s backed by some really great investors and we have customers in most industries and, and that obviously is continually growing. It’s the nature of integration software.
And the space that we’re in is really to provide a streaming integration platform that has this added intelligence and analytics on top of it. And that software is something that’s genuinely useful and it is used by say, customers in most industries. So you know, what is streaming integration? Well, it’s all about continuously moving any enterprise data, you know, from wherever it’s created to wherever it’s delivered, but while also being able to handle large volumes of data at scale, be able to scale out, and handle really high throughput, and to be able to process and analyze that data in flight, in memory, um, as it’s moving to also correlate it together and is all to be able to get value out of it. So to get make that data valuable and to have a process that is very favorable so you can trust the results that they just given you and to visualize it, to be able to have dashboards that give you visibility into your data in real time.
Not tomorrow, not at the end of day, but as the data’s being created, you can get value out of it immediately. So you know, some of the uses of this streaming integration and analytics, you know, cross all industries somewhat around the notion of a technology adoption and data modernization, but then also obtaining that value and doing the next generation real time analytics and real time alerting. So we have customers for example, that have adopted Kafka, adopted Hadoop, and need to be able to feed that continuously to have real time information in Kafka and Hadoop. And that information could be coming from databases, you know. So part of what we provide within our platform is change data capture that enables you to see inserts, updates and deletes on databases as they’re happening. And to push that out in real time. For example, some are using Kafka as a distribution bus, other customers of ours are pushing real time security data add onto Kafka.
We have other customers that are using us to adopt cloud technologies. And that could be cloud databases, it could be cloud storage. It could be cloud data warehouses or even message buses, The other customers that are expanding the database infrastructure. So be able to create reporting instances or you know, read only instances in real time and keep those up to date. So it really varies. You know, streaming integration is a kind of very useful technology that has a lot of different use cases associated with it. And so the platform itself starts off with continuous data collection and that means being able to kind of get to data as soon as it’s born. And with some data it’s already kind of pushed to you. You know, sensors might push you data if you’re utilizing Kafka or a message bus, it can push you data.
So those things are already inherently event driven. But then other things like reading log files, for example, a lot of people would just ship them to Hadoop and then read them, analyze them afterwards. But if you have adapters like ours that read at the end of the file, you can stream that out in real time. So as new entries are written to the log, you stream it out. And similarly with databases, instead of doing SQL statements against the database, you use change data capture and you see all these inserts, updates and deletes in real time as they’re happening. So there’s continuous data collection independent of the source. This is the fundamental number one thing that you have to do in order to get streaming integration. And so, you know, we talk about streaming first. This is streaming first. You need to, you can’t have a streaming technology, a real time infrastructure unless you can get to your data as it’s being created wherever it’s created.
So once you have real-time sources of data, then you can do lots of interesting things with them. The simplest thing you can do is push them somewhere else. So you can take real-time data coming from a database using change data capture and push that out onto Kafka or push that into blob storage in the cloud or push it into Hadoop. And we have customers that are doing this very simple type of integration where you’re just going from one source and delivering into a totally different target or in certain cases with a number of customers to multiple targets. At the same time we have subsets of data going into Kafka, into say Amazon Redshift into the Azure SQL DB into blob storage into Hadoop all in a single data flow. So those are the simplest things you can do, but pretty often you need to be able to manipulate the data before it lands somewhere.
So that’s where being able to do real time processing the data through SQL-based continuous queries and utilizing time windows really comes in useful and the types of things that you can do with this. So very varied, you know, so you can do basic stream processing where you can filter, transform, aggregate the data in real time and also enrich it, load reference data into memory and join that in real time. But on top of that you can do more intelligence. So you can do things like anomaly detection, a correlation of data across multiple data streams. So joining data together based on some identifiers within the data to correlate it. And we have customers that are doing that with security data. For example, correlating data across lots of different log files by say IP address to see if multiple things are happening to the IP.
And the things happening within the antivirus log, within the firewall, within the VPN, within the network routers, the old point to this IP address behaving badly. And then pattern matching, which is complex event processing, you are looking for sequences of events over time. This indicates something interesting. Okay. And so that’s the type of processing that you can do with kind of streaming integration and analytics. You can also integrate with machine learning software. So deliver preprocessed features into machine learning so that you can learn and then take the results of that model that are built in your third party machine learning and promote those to real-time to operationalize machine learning. So we have customers that are doing that today as well. On top of that, you need to be able to visualize the data to make it visible. So you can have dashboards.
We have a built-in dashboard builder within the product that generates alerts and triggers external workflows. And the key thing here is to do all of it in a real time, continuous and heterogeneous fashion using an enterprise grade platform that is inherently distributed, scalable, reliable, and secure. And talking of integration, this is a bit of an eye chart, but this represents the things we can work with. You know, so we have a large number of data collectors across lots of different technologies, a lot of different ways you can deliver data to lots of different targets. So the goal is to kind of make this easy to do by having a drag and drop UI that you can build data flows very easily and also enable you to evaluate your data by visualizing it as well. So that’s streaming integration with Striim.
That’s the platform in a nutshell. And from there we can go on and kind of talk about some of the new innovations that we’ve added in our latest release. So as with any release of our platform, we are kind of focusing on four major areas. One is kind of the enterprise-grade of the platform. So we’re always continually improving performance of our software, getting more performance out of things we integrate with. And in this release it was a focus on along Kafka and getting really high performance out of Kafka, ease of integration, the broad support for different things that we can integrate with the fact that we are an end to end platform. So always making us more end to end and adding additional capabilities that you might find in other software to kind of preclude the necessity to try have to integrate with a lot of other things.
And then on ease of use side just generally make things even easier to use. Add additional features that allow users to get value very, very quickly and improve the manageability of things. So if we start with Kafka, we added quite a few additional features around Kafka that we’re kind of driven by experience with customers and some of the issues that the customers were facing around performance. The first thing we did was add multithreaded delivery, and basically that means that if you have a single writer to Kafka within that single variety, you can partition it in, used to be a single thread note can be multiple threads that are partition basically a layer to get more performance when you’re writing to Kafka. And suddenly when you reading from Kafka, we now offer automatic mapping of partitions. What that means is that if you have a single Striim node and you’re reading from a Kafka topic that has say eight partitions, um, that single node would read from all eight.
But now if you add another Striim node to get more performance or scale out, it will automatically map the partitions so that four will be red on one node and four for another. If you add another two Striim nodes, then you’ll now have two partitions read on each node. And as you add them, as you remove them, that automatic mapping happens immediately. So you don’t have to do a lot of additional configuration. You just add more Striim nodes if you’re not sealing the scalability that you need when you’re reading from Kafka. We recognize that, you know, things don’t always work out and you might have performance bottlenecks maybe even in your configuration of Kafka, so we’ve added a lot of monitoring metrics that enable you to pinpoint performance bottlenecks. So by looking at those metrics or sending the metrics to us we can analyze them, we can tell people what you need to tweak your Kafka configuration in such a way in order to get the performance that you’re looking for.
And we are always continually adding support for the latest Kafka releases as customers demand them. So once they start to see them in production, the customers, we will add release for those additional releases. So the API has changed a lot prior to Kafka going GA, hope it won’t change as much with that is Janae, but there’s no guarantee there. And the goal there is really to get high performance scalability and optimize resources. This is just a quick view of some of the screens around Kafka monitoring. So you can see a lot of the additional metrics that we’re gathering here. And we also expanded things we deal with in the cloud. So we added an Amazon S3 reader. We could already write Amazon S3, now they can read from it in real time. So you can build solutions that utilize blob storage.
You could also have multithreaded delivery like with Kafka into Amazon S3. We also had support integration into Amazon Kinesis. If people are using Amazon Kinesis in the cloud, then we can write into that as well. And we have direct integration into Azure HDInsight offering. And for those of you not aware, HDInsight is the kind of big data offering on Azure that includes Hadoop and Kafka and lots of nice things like that. So people are utilizing that in the cloud and we now have direct integration into that through Azure configuration. And, you know, it’s always important to us to be able to support the technologies our customers are using. And so we’re, we will obviously be continually adding new integrations as things become popular. Additional integration features include data masking capability that allows you to in real time mask data, to pseudonymize it.
And that’s is seeing some uptake in GDPR type of projects. We support OPC UAA, which is a IoT protocol, so it’s an additional, a source that we can read from. We previously supported MQTT and AMQP and TCP and things like that, but now we support OPC UAA directly. We make deployment and delivery into SQL even easier. So it’s now a fully fledged to target within our platform. You could previously write to it using JDBC and how it’s, more optimized and tested and so you can deliver data straight into Mem SQL. And we also have log based CDC capture from Maria DB. In addition to Oracle SQL Server, MySQL and HP NonStop, we’ve expanded the types of analysis you can do with new visualization capabilities. And these are some things that you see in kind of full-blown visualization products.
So you can know in addition to just looking at real time data, you can now rewind and set date ranges and look at data generated in the past. You have very granular filtering at a chart level and page level so you can see exactly what you’re looking for and you can search across the entire dashboard and it goes back into the underlying store and does time searching and filtering on real-time data. And also on the rewind data just makes it easier for people to work with the dashboards without having to get into the SQL queries that are kind of behind them. So when people develop the dashboards, they build visualizations that use SQL to talk to the backend to our server and now people can just search in the dashboard without having to even touch the SQL that someone else may have built.
It’s just kind of what it looks like as the rewind feature. This is the searching and filtering. If you look at the previous site, it’s a kind of a dark dashboard. That was our inherent theme. We’ve added a light theme cause some people like white dashboards and also that you can now embed the stream dashboards directly in other pages. So there’s a little, a way of getting an embedded tag that you can then put into your own webpages. And so you can see our dashboards as part of your portals. For example, just a quick zoom through some other new features. We now monitor the data flow pipelines in real time and highlight any areas that are backlogged. So spot bottlenecks to help you understand which processes might benefit from scaling out. For example, if the queries are taking a lot of time to process data, you may need to have run them on multiple nodes in order to get more throughput.
We’ve put in APIs that allow you to create Striim applications and data flows and deploy them. And do all that without even using our command line or UI. And that really allows you to kind of integrate into other solutions and for third parties to be able to build a stream across processing as part of their platforms. We’ve incorporated a file lineage and management. So any of our sources and targets who work with files. We record for every output file we record the ranges of the source, data that was used to generate that file so you can prove lineage of data. And it also gives you insight into fall usage and storage. Um, so you can better manage things. And we also added in an additional window in type. We already had the ability to do a time windows that were either jumping, so they would fill up and then all the data would go into a query, they’ll process, then they’d fill up again.
You keep on doing that and sliding windows where every time the window changed the query would run and output new results. And now we have session based windows that are kind of akin to the way a web session works where people come and go and are usually there just for a short time, but you want to keep everything, you know, that time’s kind of unknown and you want to keep everything they’re doing together until they leave. So session based Windows is akin to kind of real life scenarioswhere the window sizes unpredictable variable ones based on some actions that the users are taking. We’ve enhanced our platform so you can now upgrade it without pausing or stopping applications that are running within the stream. And that allows our customers a 24/7 availability because the Striim cluster can now be completely up and running even while you upgrade it to the latest version.
Okay. We’ve improved the way the load balancing happens with an applications. So Striim will now check the resources being used across Striim nodes and optimize the performance and kind of balance out where applications and processing are running. If you have a Striim cluster, we’re certified Docker. So we have a Docker image that you can very easily get to and download and also creating a Striim cluster in a Docker cluster is know really simple now. And so it really improves the productivity of people that are using us and Docker together so that, you know, it increases the different places we can be deployed. We already can be deployed on servers through, you know, various installs on VMs in the cloud through Docker. We have a marketplace on AWS and we have images and a marketplace entries on Azure as well.
So it’s really kind of increases the way people can get to Striim. Then finally we have this ability to kind of preview data going in and out of continuous queries so you can debug them and kind of check that they’re working and diagnose any issues. That’s a very quick run through of the Striim platform and the new releases. This is kind of a summary of everything that I’ve just talked about so really range from performance and scalability of the platform, making it even more enterprise grade. More support for clouds and other systems. Increasing the integration capabilities of the platform in addition of a data exploration, rewind the dashboard, filter things at a chart and page level and search and usability side, adding in all these different capabilities that you know from filed any edge to data masking that really make the platform more usable. And if people want to kind of get started with Striim, you can go to our website, click on free Striim and download the product today. I’m going to start testing it out. It’s really simple and easy to use. So that’s the end of my presentation. Any questions?
Great, thanks for that Steve. You know the question that came to mind when you were setting this up, you have these two elements, right, that you have the data integration and you have the real time analytics up on the real time site. What happens if when one of those streams become stale of this does, does it break down the whole system or how do you deal with that?
So now obviously the data is only as up to date as the sources, right? So I mean, what customers will actually do is, on top of integration, you know, so there’s a lot of our customers that using us for integration, moving data from one place to another, correlating it, et cetera. They will actually build a small analytics applications not to analyze the content of the data, but to analyze the data flows themselves and do things like do a count of how much data’s moving from one place to another. Do some analysis on that workout, say a moving average,standard deviations and then a trigger alert if the data flow speed drops below, say a standard deviation below the mean or something like that. Right? So something very simple and there are other use cases are more complex than that where the comparing current data rates with historical data rates and seeing if you’re even taking into account, daily fluctuations based on era of the day, the day of the week, holidays et Cetera, is my transaction rate looking normal and with um, change data capture type of scenarios, they will do that.
Not just a database level but even at a table level. Right. Are we seeing the normal amount of changes coming from this table? Has it suddenly increased way beyond what we normally expect or has it suddenly dropped off to nothing? So building those types of analytics applications where you are monitoring the data flows, it’s actually pretty common and it does help you spot things like that. If you know at least streams suddenly stops giving you data.
So Steve, what is a typical customer engagement look like? You know, say I’m a new customer, I tried it out. Uh, it seems like there’s a lot of moving parts here. Is this more like a consultant kind of a role for streamer or how do you go about it?
It really depends a lot on the use case. And I guess also on the confidence of the, uh, of the people doing the downloading. I mean, we’ve had customers that downloaded the product, built applications as a way of justifying budget into real time analytics. Right? So it’s like the engineers were convinced this will be a great thing to be able to bring value to the company, but they couldn’t get budget for it just by doing a PowerPoint. So they thought, well, why don’t we just show some results and then get better budget that way, which is great, but then, you know, other customers do need more assistance in getting started. We do a POC with customers that’s necessary, you know, to kind of a proof of concept that, you know everything they want do with the platform they can actually do. Some of those are very straight forward, while some are more in-depth. It really depends on the sophistication of the customer, but in general it is a platform that you can get up and running yourself pretty easily.
And what we have tried to do is embed more and more help within the product in order to guide people through things. So an example is with, you know, change data capture. When we first provide a change data capture capabilities in the product, you would go into the data flow, you’d drag in, say an Oracle CDC reader, and then you’d have to configure it. Then you’d start it up and it would go wrong and you have to try and work out what was going wrong, et cetera. In one of our previous releases, we added in a wizard that step you through that process. And so now you start off with a wizard, you enter some connection information to the database and if anything’s wrong, they’ll tell you. And if it connects the database and it’s a cs that you don’t have the right privileges that a say with articles, supplemental logging is not turned on.
So you can’t do change data capture. And other things that could go wrong, we will check for all of those things and not only tell you if they’re not going to work, but also what you can do to fix it. Obviously all of that information is in the manual, but not everyone reads manuals. Right? Right. So by moving more and more stuff into the wizards and the UI and helping people through the process, you don’t completely eliminate the need for support and for engineers to go out and kinda help with help customers in all cases. But you can allow customers to be a lot more self sufficient. We have lots of samples of writing stream processing through queries and I think the only real new thing to most people, because most of the queries, they just like SQL.
Anyone that’s really with SQL, they can do this, right? Yeah. But the only really new thing is windows and how do you best utilize time windows over data streams to get the value that you need. And so that can take a little bit of thought. And so that might need some help there sometimes, but we’ve had customers that build full analytics apps on the platform without any help at all. And we first hear from them from support when they actually want to go ahead and buy it. So it can be easy to use if you have the right mindset. Right.
Well, Steve, I can’t let you go without asking. What’s next. I mean, you mentioned AI for example, where do you go beyond real-time? Right? Well, what’s the next step? You know, more and more things are becoming real time and there are always new technologies to integrate with.
That’s the great thing about having a streaming integration platform is there’s no end of things that people want to integrate. You know, so almost every release contains new sources, new targets. There’s always new stuff that people want to work with. Machine learning is definitely becoming more and more important to customers. And you know, while we’re not a machine learning platform, we can integrate with machine learning, especially from the operationalizing it perspective. And when we aim to make that even simpler come down the road as well. The hot things that people are talking about right now is the cloud option, and almost everyone is doing IoT. And we have solutions for that. That’s something we’re seeing more and more of.
As more industry start adopting IoT, machine learning is something that again, we’re seeing a big uptick in as well. And then there are newer technologies that we’ll always, everyone will have started thinking about things like blockchain and distributed ledgers that may end up being becoming important as well. So, you know, as a technology company, as an integration company, we always have to look at what technologies are being adopted, what technologies are likely to be adopted and incorporate those into the platform while continually making it a more enterprise grade and even easier to use.
Well, great. Steve, you know, this has been fascinating and, uh, I really like to thank you once again for coming on the show today.
It’s been a pleasure, Rich. You bet.
Okay, folks, that’s it for the Rich Report. Stay tuned for more news and information on the world of big data.