The Critical Role of a Streaming First Data Architecture
Steve Wilkes, Striim Co-founder and CTO, discusses the need for a “streaming first” data architecture and walks through a demo of Striim’s enterprise-grade streaming integration and streaming analytics platform.
To learn more about the Striim platform, go here.
Today we’re going to talk about what is a streaming first architecture and why is it important to your data modernization efforts. We’ll talk about the Striim platform and we’ll give you some examples of what customers are doing on the solutions they are building using our platform, but take some time to give you a demonstration of how it all works and then open things up at the end for questions and answers.s
So most of you are probably aware of this, the wealth news fast and it’s not just the world is the data that moves fast. Data is being generated largely by machines now. And so businesses need to run at machine speed. You need to be able to understand what’s happening right now and react with immediately and also that it’s not too late. Yeah. At the same time, customers and employees expectations are rapidly rising and positive reason for this is the smartphone boom and people’s access to realtime information, the ability to see what’s happening to your friends, what’s happening in the world. Uh, instant access to news and some access to, uh, messages, communication, um, because the consumer world is instant. Those consumers are also employees and executives and the, they expect instant responses and insight into what’s happening within the enterprise. And the quality of applications used to deal with is also driving desire for similar quality business applications.
The other side of the coin is that businesses needs to compete. Um, technology has always been a source of that competition. And as technology is rapidly changing and we’re getting more and more data. Businesses that are more and more data-driven have a competitive edge and this is now have almost all departments ranging from engineering or manufacturing all the way through to marketing being incredibly. And so the survival of most businesses depends on the ability to innovate and utilize new technologies and data itself is also massively increasing. Almost everything can generate data, it could be trucks or your refrigerator. Um, TV is a wearable devices that you have, health care devices that oh becoming more and more portable. Even things like Istomin making its way from being caught to a restaurant. It’s tracked and produces large amounts of data and this data is growing exponentially.
IDC did this study like a couple of months ago and they’re estimating that today the 16 Zettabytes of data by 2025 is going to increase 10 fold and all of that data around 5% of it now is real time increasing to 25% of it in 2025 and by real time they mean is produced and needs to be processed and analyzed in real time. So that’s 40 zettabytes 21 zeros of data that will need to be processed in real time. And of that 95% it will be generated by devices. The kick that really is the only a small percentage of the state it can ever be stored there physically, not enough hard drives being produced to store all this data. So if you can’t store it, what can you do with it? Well, the only logical conclusion is that you need to process [inaudible] analyze this data in memory in a streaming fashion close to where the data’s generated. It may be you’re turning the raw data, you know, thousand data points a second into aggregated data that is less frequent but still contains the same information content. Yeah. So that kind of thing is what people talk about is age processing was really trained to handle these huge volumes of data that people see coming down the line.
And it’s not just IoT data that the rise in streams, you know, okay, everything, every piece of data is generated because something happened, some kind of event, someone was working on the enterprise application, someone was doing stuff on a website or using a web application and machines were generating logs based on what they were doing. Applications generation, those databases as generating logs and network devices, everything generating logs. But they’re all based on what’s happening based on events. So if the data is created based on events in a streaming fashion, then it needs to be processed and analyzed in a streaming fashion. If you collecting things in batches, then there’s no way you’ll ever get to a real time architecture and a real time insights into what’s happening. But if you collect things as streams, then you can do these other things. You can do batch processing on the streaming data, you can deliver it somewhere else, but at least the data needs to be streaming.
So your stream pro so thing as he merged as a major infrastructure requirements and is helping drive enterprise modernization. So moving to a streaming fest, data architecture readings that you’re transitioning at these data collection to real time collection. You’re not doing batch collection of data. You’re doing real time collection of data, whether it’s from devices, files, databases, uh, wherever it’s originating and you’re doing this increments that you’re not trying to boil the ocean and replace everything in one go. You’re doing it. Use case by use cases, proud of your data modernization projects. And this means that things that have high priority to become real time to give you real time insights and a potential business competitive edge or better support for your customers or reduce the uh, amount of money you’re spending on manufacturing and by improving product quality, any of these things. Um, can we drive as today to modernization and your doing it use case by use case, you’ve placing pieces of it, bridging the old a new worlds of data.
Now some of the things that our customers are telling us, uh, and we have these legacy systems and by legacy that can mean anything that was installed, you know, over a year ago and they can’t keep up or they don’t predict that they’ll be able to keep up with the large amounts of volumes of data that they’re expecting to see with the requirements for low latency, kind of real time insights into data and with the ability to rapidly innovate and rapidly produce new types of analysis, new types of processing give you new types of insights into what’s happening in your business. Okay. They’re also telling us that we can’t just rip and replace these systems and the need to have the new systems and the old systems work together, uh, with potentially fail over from one to the other. While you’re doing this replacement. It’s a Striim has been around for around five years now.
Uh, we are the providers of a platform, the Striim platform that does streaming integration on analytics. The platform is mature. It’s been in production with customers for more than three years now. There’s customers all in a range of industries from financial services, Telco, healthcare, retail. We’re seeing a lot of activity in Iot. Striim is a complete end to end platform that does streaming integration on, on the mystics across the enterprise, cloud and IoT. We have an architecture that’s very flexible that allows you to deploy data flows to bridge enterprise, cloud and Iot. You can deploy pieces of an application at the edge close to where the data is generated. And that doesn’t have to just be IoT data. It can be any data that’s generated but close to that data. Other pieces that are running on premise, doing some processing and other pieces are in the cloud.
Also doing processing our analytics. It’s very flexible how you can actually deploy applications using our platform because applications consists of continuous realtime data collection. And that can be from things that you may think of as kind of real time, uh, sensors. Sending events, a message cues, et cetera. Are there things like files that you may think are AAP is batch processing? Um, we could read at the end of the file and as new records are written to the file stream that has immediately turning files, the rollover, et Cetera, into a source of streaming data. Suddenly with databases, um, most people think of databases as a historical record what’s happened in the past. But by using a technology called change data capture, you can see the inserts, updates, deletes, everything that’s happening in that database in real time. So you can collect that nonintrusive Lee from the database on stream.
That’s it. So now you have a stream of all the changes happening in the database. Okay. So all of the applications built with that platform use some form of continuous data collection. On top of that, you can then do real time stream processing and this is through a SQL based queries. There’s no programming involved in the Java, no c sharp, no Java script. You can build everything using SQL and this allows you to do filtering transformation aggregation of the data. Yeah. By utilizing data windows. So you can say what’s happened in the last minute, uh, you can look for a change in data and only send that out and et cetera. And then enrichment of data, which is also very important. And that is the ability to load large amounts of reference data into memory across the distributed cluster and join that in real time with streaming data to add additional context to it.
So an example would be if you have device data coming in and it’s device x, Y, z value one, two, three. Okay, that doesn’t mean much to things downstream that might be trying to analyze it. But if you join that with some context and you say, well, device Xyz is this sensor on this particular motor, on this particular machine, now you have more context. And if you include that data, you can do much better on top of the stream processing. You can actually do streaming analytics that can be correlating data together, joining data from multiple different data streams and looking for things that match in some way. So maybe using a web blogs and network logs and you’re trying to join by IP address. Um, and you’re looking for things that have happened on either side in the last 30 seconds. That kind of correlation to complex event, a processing which is looking for seek because of events over time, the mass, some kind of pattern.
So if this happens, followed by this, followed by this and it’s important you can do a statistical analysis on anomaly and integrate with third party machine learning. Yeah. We can also generate alerts and trigger external systems and build these really rich streaming dashboards or later visualized results of your analytics. Yeah. And any of the data that’s initially collected, the results of processing, the results of analytics that can all be delivered somewhere and you can deliver to lots of different targets in a single application. So you can push stuff to enterprise and cloud databases, files or do, uh, Kafka, et cetera. Okay. As a new breed of middleware that supports streaming integration analytics, it’s very important that we integrate with your existing software choices. So we have lots of data collectors and data delivery that work with systems you may already have. It wasn’t the big data systems, enterprise databases, open source, um, pieces we can integrate with it and do all of this in a enterprise create fashion that is inherently clustered, distributed, scalable, reliable and secure as a general purpose piece of middleware.
We support lots of different types of use cases from real-time data integration, uh, analytics and being able to build dashboards and monitor things. And these use cases across all different industries and they can range from a building your data lake and preparing the data before you land it. I’m doing tag migrations, I’m doing iot edge processing. And then on the other lytics and patterns side, there’s things like fraud detection, uh, predictive maintenance, uh, anti money laundering is some of the things that we received from customers, uh, around that. And then if you want to build these dashboards and monitor things in real time and look to see whether things are, for example, meeting SLAs or meeting expectations and yeah, we’ve done things like call center quality monitoring, SLA monitoring. Yeah. Looking at the, that worked from a customer perspective and being able to alert when things aren’t running normally. And I use cases across many different industries. There’s a lot of texts on here. Yeah. The takeaway is that we have used cases in a lot of different industries.
Yeah. One of the examples is, uh, using Striim for hybrid cloud integration. And that’s really where you have a database on premise and you want to move it or copy it to the cloud. And it’s one thing to just like take the database and put it in the cloud, but that will miss anything that’s happening while you’re doing it or miss things have happened since you’ve done it. So it’s really important that you include a change data capture in this to continually feed your hybrid cloud database with new information. And so by using a set of wizards, you can build this really quickly that allows you to join a on premise, for example, oracle database and deliver real time data from that into, uh, for example, Azure SQL DB. So you now have an exact copy that is always up to date of the on premise database.
Another totally different example is uh, using us for security monitoring, which is where you have lots of different logs being produced by VPNs firewalls, network routers, individual machines, essentially, uh, microcontrollers. Anything that can produce a log and you recognize a unusual behavior, um, is most often seen by affecting multiple systems, no security unless they get a lot of alerts from all these logs and all these systems all the time. But a lotamz of those are false positives. So the goal of this was to identify things that were really high priority for them to look up first by seeing what’s the activity happening that was affecting multiple things. So for example, if you have a port scan from a network row to the same, this guy’s looking at other stuff. Is there any activity on the other machines that he’s looked at? Okay. Are they doing port scans?
Are they connecting to external sites and downloading malware? So by doing this correlation in memory in real time, you can spot threats at a higher priority. And also by pre correlating all the data together and providing that to the analysts, they can see immediately the information they need rather than having to go and manually look for this across a whole bunch of different bugs. And this really increases the ominous productivity. So a couple of other examples from our customers. One is a very simple, uh, realtime data movements where data from, uh, HP nonstop and SQL server databases is being pushed out into, uh, multiple targets, uh, whether it’s Hadoop, HDFS, Kafka, HBase, and they’re using a as a analytics hub for their communities. So basically ensuring that wherever they want to put the data, that’s always up to date and that is always containing real time information from there and they can see on the other databases.
And then the glucose monitoring company, uh, are using us to see, uh, events coming in from, uh, devices on these implantable devices, uh, real time monitoring of glucose. And it’s really important that these things work. So they are looking at the, whether the device is having any errors, whether it’s suddenly going offline and being able to see in real time any of these devices not working properly. And this is really important to their, their patients because their patients rely on these devices to check their glucose glucose levels. So this has really reduced the, uh, times detect that there’s an issue going on and has improved patient safety massively. Okay. We have recognized generally by a lot of the analysts in both the in memory computing and the streaming analytic landscapes. And we’re also getting a lot of recognition from various publications and [inaudible] a trade show organizers and then also very importantly, one that best places to work, uh, which is really vindication of, you know, us being a, a really great company, a key differentiation.
Striim’s end to end platform does everything from collection and processing, Oh, lytics delivery visualization, the streaming data that is easy to use, uh, with the SQL language for building a processing and analysis that allows you to build and deploy applications in days. Um, and we’re enterprise grade, which means that we are inherently a scalable in a distributed architecture, reliable and secure. Okay. And that you can integrate us. We’re easy to integrate with, uh, your existing technology choices. So those are the kind of key things to remember about why we’re different. So with that we’re going to go into a demonstration. Sothe first part of this, um, basically going to show you how to do the integration rather than to type a lot of things. Uh, we’re just going to go through, uh, how to build a change data capture a into Kafka and do some processing on that and then do some delivery into other things.
So this is pure integration play. You start off by doing a change data capture from SQL, in this case, my SQL and okay, build the initial application and then configure how you get data from the source so we can figure the information to connect into my sequel. When you do this, we’ll check and make sure everything is going to work, that you already have. Change data, capture, configure properly. And if it wasn’t with how you had to fix it and how to do it, you don’t select the tables that you’re interested in. We’ve got to collect the change data from, and this is going to create a data stream, that data stream. Yeah. But then go to two different to Kafka. So we’re going to configure how we want to write into Kafka. Um, and that’s basically setting up what the broker configuration is, what the topic is and how we want to format the data.
In this case we’ve got the right to add as JSON, when we save this, this is going to create a data flow and the data flow is very simple. In this case it’s two components. We’re going from my SQL CDC source into a Kafka writer. We can test this by deploying the application and it’s a two stage process. You deploy first, um, which we’ll put all the components out over the cluster and then you run it and now we can see the data that’s flowing in between. So if I click on this, I can actually see the real time data. And you see there’s a data and there’s it before. That’s basically the four updates. You get the before image as well, so you can see what’s actually changed. So is real time data flooding through [inaudible], um, um, my sequel application. Okay. But it doesn’t usually end there.
Uh, the raw data may not be that useful. And one of the pieces of data in here is um, a product id. Uh, and that probably is, it doesn’t contain enough information. So what we’re going to do first is we’re gonna extract the various fields from, from this and those various fields include the location id, product Id, how much stock there is, et cetera. This is a inventory monitoring table and we’ve just turned that from kind of a rural, a format into a set of name fields. So I’ll make it easier to work with later on. You can see the structure is very different. Now what we’re actually seeing in that data stream. If we then, uh, once add additional context to this, what we’ll be able to do is join that day. There was something else. So, first of all, we’ll just configure this so that instead of writing the raw data at Cafca, we’ll write that process data ad and you can see all we have to do is change the input stream. So that will change the data flow. Now to right that uh, process data at into Kafka.
But now we’re going to add a cache and this is a distributed in memory data grid that’s going to contain additional information that we want to join with a raw data. And so this is product information. So every product ID is a description and price and some other stuff. So first of all we’ll just create a, a data type that corresponds to our database table. Yeah. And configure what the key is. And the key in this case is the product Id. Then we specify how we are going to get the data. And it could be from files, it could be from acfs. Yeah. We’re going to use a database reader to load it from my SQL table. So especially specify all the connections and the query we’re going to use. And we now have a cash of products information. So use this, we modify as sequel to just join in the cache.
So anyone that’s ever written any secret before knows what a join looks like. We’re just joining, uh, on the product Id. So now instead of just the raw data, we now have these additional fields that we’re pulling in in real time from the product information. So if we start this and look at the data again, you’ll actually be able to see the additional fields like description, um, and brand and category and price that came from that other type that’s all joined in memory. There’s no database lookups going on is actually really, really fast. So that’s where I seem to Kafka. If you already have data on Kafka or another message bus or anywhere else for that matter is new files. Um, you may want to kind of read it and push at some of the targets. So, well we’re going to do now is we’re going to take that data that we just wrote to Kafka.
We’re going to use a Kafka, a Rita in this case. So we’ll just search for that and tracking the capital sauce. And then we can figure that with the properties connected to the broker that we just used. So the uh, and because we noticed JSON data, we’re going gonna use it Jason Pasta. I was going to break it up into a adjacent object structure and then create this data stream. Okay. When we, uh, deploy this and uh, start this application, it’ll start reading from that Kafka a topic and we can look at that data and we can see, uh, this is the data that we were writing previously with all the information in it and it’s adjacent full Max. You can see the adjacent structure though. So the other targets that we go into right to, uh, the Jason Structure might not work. So what were you going to do now?
Is We got after in the query that’s going to pull, uh, the various fields edit that Jason’s structure and creates a well-defined, a data stream that has various, um, individual fields in it. So we’ll write a query to do that. That’s directly accessing the JSON dSata and save that. And now instead of original data stream that we had with the JSON in it, when we deploy this on, uh, start it up and look at the data. And this is incidentally how you would build applications, looking at the data all the time, um, as you’re building and adding additional components into it. Um, if we’re, uh, look at the data stream now, then you’d be able to see that, uh, we have those individual fields, which is what we had before on the other side of Costco, but doesn’t forget that, um, it may not be stream rights into Catholic. It could be anything else. And if it, you were doing something like we just did with CDC into Kafka than Kafka into additional targets, you don’t have to have Kafka between, um, you can just take the CDC and push it out to the targets directly.
So, uh, what are we gonna do now is going to add a simple target, which is going to write to a file. And, uh, we do this by choosing the file. So the file writer and especially finding the formats we want. So we are going to write this. I’ve seen the CSV format. Um, we actually call it DSV because it’s delimiter separated, right? Um, and the, the limits can be anything. It doesn’t have to be a coma and save that. And now we have something that’s going to rotate to the file. So if we deploy this and start this up, then we’ll be creating a file with the real time data.
And um, it, yeah, after a while it’s got some data in it and then we can use something like Microsoft Excel twice. She viewed the data, um, on checks. That is kind of what we wanted. So let’s take a look in XL and he can see, uh, the data that we initially collected from a, my SQL be written to capita slightly from Kafka and then being risked back out into this CSV file. But you don’t just to have, they just have one target and a single data flow. You can multiple targets if you want. We’re going to add to in rising into Hadoop and into a zoo of blob storage. So what we do is, uh, in the case of Hadoop, we don’t want all the data to go to Duke. So we’ll either the simple CQ to restrict the data and do this by location id.
So when location 10 is going to be written to her, that’s so some filtering going on there. And now we will add in the Hadoop target. Uh, so you’re gonna write to HDFS as a target. Uh, drag that into the data flow and see you. There’s many ways of working with the platform. We also have a scripting language by the way, that enables you to do all of this from vi or emacs or wherever your favorite tech status or is. Um, and we are going to write to HDF, let’s see, an Afro format. So it will specify the scheme of file. And then when this is started up, we’ll be writing into HTFS as well as to this local file system. And similarly, if we want to write into a zoo of blob storage, we can take the adaptive for that and just search for that and drag that in from the targets. And we’ve got to do that on the original source data, not that query. So we’ll drag it into that original data stream.
Yeah. And now we just configure this with information from a sewer. So you need to find out, you know, uh, what is the a server, a URL, what is, and you should know what your key is and we use name and password and uh, things like that. You’re going to, uh, collect that information, uh, if you don’t have it already. And then add that into the, uh, target definition for as your blob storage. I’m gonna write that out in Jason Format. So that’s kind of very quickly. Hey, you can do data integration, real time streaming data integration with our platform. Yeah. And all of that data was streaming. It was being created by doing changes to my SQL. Uh, well, no, see some analytics. I have a couple of applications I’ll show you very quickly. Um, the applications are defined through Ah, data flows. Data flows typically start at [inaudible], the data source.
They’re doing a whole bunch of processing and you can have them in subflows as well. And the each one is suppose can be doing, um, you’re reasonably complex things with, you know, nested data flows. So if I deploy this application and then we go and take a look at a dashboard, you’ll be able to see how you can start visualizing some of this data. So this data is, uh, coming from, uh, ATM machines and other cash point taught the ways of getting cash. And the goal of the application is to try and spot if the decline rates for a credit card transactions, et Cetera, is going up. So what it’s doing is it’s taking the raw transactions and then it’s slicing and dicing it by a whole bunch of different dimensions and it’s trying to spot has the decline rate increased by more than 10% in the last five minutes and is doing that across, you know, generally and kind of across all of the different dimensions as well.
And each one of these visualizations and nothing’s hard coded in, in here. It was all built using our dashboard editor where you can drag and drop. Yeah. The visualizations into the dashboard. Each visualization is configured with a query that tells you how to get the data from the backend. And then set of properties that tell you how to get the data from the query into the visualization and obviously other configuration information. So that’s kind of an example of one, a analytics application, uh, built using our platform. And well just go and take a a look at a totally different one that does something completely different. I’ll just stop this one. And this one is tracking passengers on employees, a SNF port. And so the data was coming from a location monitoring devices that, you know, I see tracked Wifi. Okay. And if we take a look at a dashboard for this, you can see, you know, it’s still rich dashboard that have lots of information on it.
Um, and the data here is coming from ah, location information joined with zones that have been set up. So these zones, uh, represents different airline ticketing. And what we’re doing is we’re tracking the number of employees that are in different crisis. And, uh, if the number of passengers goes up too much and you need additional employees and it’s going to flag it by turning red and it will send out a request for more employees. And the red dots here are the employees of white dots for the passengers. So as more red dots arrive in this location, then it will basically notice that new employees Reuter its euro and uh, the that will go away because you know, things are actually, uh, okay. Now and the other thing that is tracking is, you know, individual locations of all the passengers. And this passenger over here, uh, just walks into a presumed prohibited zone.
So, uh, what that’ll send out on alert and now an employee will try and track that guy down and remove him from the heritage zone. So that’s a totally different analytics application. But again, this one was also built using our dashboard builder. So I hope I kind of gives you an idea of the kind of variety of different things that you can do using the Striim platform and yes, to finalize things, the, you know, stream, um, okay. Can Be easily fitted into your existing data architecture. You can start to take, take some of the load away from um, what were maybe existing ETL jobs and try and move those into real time. And we can integrate with all of the sources that you may be using for existing ETL. But we can also integrate with a lot of other sources as well. Um, and pull data out of things you may not get enough access before.
Well it could also integrate with your things you have already, maybe you have an operational data store, enterprise data warehouse maybe already writing data into do we can integrate with that data, maybe use that for context information. Well we can also write that. So you can use us to populate your operational data store, enterprise data warehouse or I do. And we can also integrate with your machine learning choices as well. So these real time applications play a really important part. They’re a new class of application to give you real time insights into things, but we can also be part of driving your big data applications and legacy applications.
As I mentioned before, the platform consists of real time data collection across, you know, very many different types of data sources and then a realtime data delivery into a lot of different targets. The example you saw originally you seems to capita was just this very simple no processing, right? We then added in processing and that’s all done through these in memory continuous queries that are written for SQL, like language that can incorporate a time series analysis through windowing that allows you to do things like the transformation of data that you saw filtering a enrichment of data by loading large amounts of data into memory, um, through as an external context and also aggregation of data in order to kind of get a while, I haven’t been last minute, et cetera. Well, aggregating by dimensions as you show them the transaction log six. The other piece is the let’s take x where we can do anomaly detection, uh, pattern matching through complex event processing and very importantly the correlation that it was totally important to the security application. And then on top of this you can trigger alerts, uh, external workflows. You can run ad hoc queries against the platform if you want to see what’s going on right now in a data stream and also build these a realtime dashboards and we can integrate with your, your choices for machine learning and do realtime scoring very close to where the data’s generated.
We integrate with most of your existing enterprise software choices, uh, being able to connect to a whole bunch of different sources and a whole bunch of different formats. Yeah. Delivered to a whole bunch different targets, but then also integrate with your choices of big data platforms. For this map, our Hortonworks Cloudera, the choice of cloud, whether it’s Amazon, Microsoft or Google, um, and then run on operating systems and virtual machines so that we can run on premise at the edge and in the cloud. Okay. Well, so baby, well fitted for Iot, we have a separate cut down edge server that’s designed to run on a gateway hardware that may not be as powerful as what you’d be running for a Striim cluster processing analytics can happen at the edge. Uh, we can deliver that data directly into the cloud or into the streams of on premise.
Um, and you can have these applications that span a on premise, uh, edge processing, uh, on premise, on the six through the Striim server and also expired. And typically the amount of data and the speed of the data is crater on the left hand side and you’re reducing that data down to the important information. Okay. But it’s covering a lot more territory. So you may have, you know, I’m single, I just as like a single machine in the factory, but then you can have a Striim serve that covers the factory and then the cloud that covers all the factories that an organization my own.
Yeah. And we believe that Iot is not a siloed technology. It shouldn’t be thought of separately, especially Iot data. Iot data needs to be integrated with your existing enterprise data, uh, and is part of your enterprise data assets. So as part of your data modernization, as you’re thinking about Iot, don’t think about it separately. Think about how do I integrate this Iot data and get most value added that because you’d be much more valuable if it has more meaning and it can have more meaning by correlating it and joining it with your other enterprise data. Okay. We are one consistent, easy to use platform that only do we have converged in memory architecture that combines, uh, the memory caches and the high speed messaging and uh, and memory processing. But we also have a single UI that allows us to design, uh, analyze, deployed, uh, visualize and monitor, um, your streaming applications.
The key takeaways from this, I hope. Yeah. Is that okay, you really need to start thinking about streaming first architecture. Right now you need to start thinking about how do I get a continuous data collection from my important data sources and consider that from a point of view of what do I require immediate insight into, uh, how do I increased the competitiveness in my company or operational efficiency or uh, whatever reason you may have to doing it. Whatever pushes you might have for real time applications. How do I go about doing this on a piece by piece basis? Um, and start by streaming those important data. Also consider sources where your data volumes are going to be growing and you may need to do a pre processing in flight before you store any data. Um, and that’s another area where streaming first subsidy essential, um, we believe that Striim is the right solution for this because we have a streaming architecture that addresses both of these concerns and other a concerns you may have as well, especially can kind of be enterprise grade being able to run mission critical applications and you shouldn’t be kind of ripping and replacing everything.
Your Mongo, this has to be kind of use case driven. Right? And it’s probably everyone out there has a use case that they need to get some real time insights into something and that’s a really good place to start. So if you want to find out more about Striim and go to the striim.com website, you can contact us, uh, the email or the support thing on there in tweet us a check out Facebook page and linkedin pages as well. And with that I will open it up for any questions.
Thanks so much Steve. I’d like to remind everyone to submit your questions via the Q and a panel on the right hand side of your screen. While we’re waiting, I’ll mention that we’ll, we will be sending out a followup email with the link to this recording within the next few days. Now let’s turn to our questions. Our first question is what’s the difference between streaming first and an event driven architecture?
Okay, that’s a great question. A venture [inaudible] has been talked about for a long time. Um, I know they were kind of like gotten a category all the way back in like 2002, 2003. The emphasis there. It was on the data movement, it was on the enterprise data bus of that move things around. And so that was where you put your events. And so the whole kind of SOA, event driven architecture, um, that bus was the crucial thing. Um, that technology has matured. Now we have had messaged presses around for a long time. We have new ones coming up, have come out recently like cafca that really caught everyone’s imagination. I’ll technology kind of is found. The importancy we talking about with streaming first is the data collection piece that you want to put old as much of your data as you can into data streams so that you can get to real time insights.
Uh, if you want to, you could just take those data streams that you’ve collected in real time and deliver them out. So my advice to your data warehouse or cloud database or whether they just become storage, um, if that’s kind of what you want to do with the data. But as long as you’re collecting the data in real time, you’re not going to get real time insights after you put it in a database or in Hadoop is always going to have some latency involved by reading stuff from storage. But as you start to identify applications where you do need real time insights, you can move them into acting in memory straight on the data streams. So what we really mean by streaming first is you are employing a enterprise driven architecture, right? But you’re focusing on ensuring that you at least do the data collection first.
That’s great. Thanks so much Steve. Um, the second question is how do you work with machine learning? So there are a couple of different ways in which we, we’ve worked with machine learning. I can do this kind of by way of example, the we, we actually have a, a customer who is integrating machine learning is part of the overall data flow and use of Striim. And the first piece is essential is being able to prepared data and continuously feed some storage that you go into. Do machine learning on. So machine learning requires data and it requires a lot of data but also needs that data to be refreshed so that if you’d need to rebuild the model, if you started to identify the model’s no longer working, then you need to be able to have up to the minute data. And so with Striim you can collect the data, you can maybe uh, join it, it um, and Richard, um, you can pivot it so that you can end up with uh, a multivariable a structure they suitable for training, machine learning before you write that ad to files to do, um, database.
And then outside the stream you can usual machine lending software on this customer. The data they were collecting was a web activity and VPN activity and other stuff around users and they pushed all of this add into a file store and then use h two o there choice of machine learning software to build a model. And the model was modeling user behavior. Do not use as normally do what is the usual pattern of activity for each one of our users. You know, when do they logging in when they accessing things? What applications are they using, what order of applications are there? They have in, they built this machine learning model. They expressed all of that. They then exported that model, uh, as a jar file and they incorporated it into a data flow straight from the raw data in our platform. So the raw data was going through the processing into the store. But then we’re also taking that raw data in memory in a streaming fashion and pushing it through the model and checking to see whether it matched the model and then alerting on any anomalous behavior. So the two places where we really work with machine learning are delivering data into the stories of machine learning combined. And then once it’s learned taking the model and doing real time inference or scoring in order to detect anomalies, make predictions, et cetera.
That’s a great example. Thanks. Um, we, I think we have time for just one more question. Uh, this one is a confluent just announced SQL on Kafka. How does that compare to what you do?
Okay. Well, SQL on data streams is a great idea. And obviously we’ve been doing that for, since homelessness inception of the company. Um,
I like it. It’s brand new, right? So it’s not going to be mature for awhile, but it’s only looking at a small part of everything you need in order to try actually do all things I showed you today. Um, being able to run SQL against the stream, uh, I think it’s, you know, a window is, is, is one thing. Um, but there are other types of things we need to incorporate. For example, we can incorporate data in a distributed data grid, so Kashi Information, um, data and results, storage and feedback results into processing a whole bunch of other things. Um, but the, I think the primary thing that I see is that, that focusing on interactive ad hoc queries against streams, and that’s good. Um, being able to just, you know, see what’s going on in the stream and I analyze it, but the power of our platform is combining your sources.
Query is, target’s caches, results everything into a data flow that becomes an application, right, that you can deploy. Um, uh, as a whole. And so it’s gonna take a while until all of the things that we spent the last five years working out. Um, like yeah, I could security with role based security model for uh, these types of queries. How did you integrate them into a whole application? How do you to play the application across the cluster? Um, all of those kind of things that are essential for mission critical applications that we support our customers, um, that utilize SQL. So I think being an end to end platform, we can do all of that and then having to combine all the pieces together so the SQL may be useful, might be harder with the, the key SQL that was announced earlier this week.
Great. Thanks so much Steve. I regret that we’re out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. Okay. On behalf of Steve Wilkes and the Striim team, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.