Streaming-First Strategies for Managing the Oncoming Tsunami of Data
In this webinar, join Alex Woodie, Editor-in-Chief of Datanami, and Steve Wilkes, founder and CTO of Striim, for an in-depth look at some strategies that companies can begin to implement today to address these needs. Learn ways to start the migration toward a data architecture capable of handling the oncoming tsunami of data through a “streaming-first” approach. Topics including data processing and integration at the Edge; migrating existing systems and processes toward a streaming architecture; and strategies for avoiding data storage altogether, without losing information or intelligence.
To learn more about the Striim platform, visit the platform overview page.
Welcome and thank you for joining us for today’s webinar. My name is Katherine and I will be serving as your moderator. The presentation today is entitled Strategies for Managing the Oncoming Tsunami of Data. We all know that over the next several years, data volumes will skyrocket. What has not been made clear until recently though is that there simply won’t be enough storage on earth to store it all. Over the next 55 minutes, we will discuss strategies for addressing this day two day lose by leveraging a streaming data architecture. We are honored to have as our first speaker today, Alex Woodie, managing editor of Datanami. Alex has covered the high tech and IT industry as a technology journalist for more than a decade, focusing on emerging trends and systems storage software, business intelligence, cloud and mobility.
Joining Alex is Steve Wilkes Co founder and CTO of Striim. Steve has served in several technology leadership roles prior to founding Striim, including heading up the advanced technology group at GoldenGate Software and leading Oracle’s cloud data integration strategy. Throughout the event. please feel free to submit your questions in the Q and A panel located on the left hand side of your screen. Alex and Steve will be answering questions throughout the presentation and we’ll address as many as possible after the presentation wraps up. With that. It is my pleasure to introduce Alex Woodie.
Thank you for that introduction. Katherine. We’re all tired of the term big data, bu it accurately describes the problem as well as the technology solutions that we’ve created to deal with it. Striim asked me to provide some background in the history of where we are today and that’s what I intend to do. So it all started back about 25 years ago and in the mid nineties with the growth of the Internet, it was clear at the time that it would eventually have a major effect on our lives, but we didn’t know exactly how it would all transpire, but we knew that it would be big. Looking back on the past 25 years, a pattern has emerged as more people built things on the Internet. It drove a need for bigger and better technology, bigger disks to hold all the data, better processors to process the data and faster networks to move all the data.
By 2014 that’s where we got in, right in retrospect, this time presented a great flowering of technology and it was in my view, the origin of today’s big data tools as the ecommerce engine started to move ahead, it exposed gaps in the technology of the day. The web giants that Silicon Valley saw it first and it all started with an extension to cutting flagship search engine, Doug Cutting and his colleague Mike Cafarella set out to build an automated web crawler to index the Internet to improve search results. The resulting product called Nudge could crawl a hundred pages a second. That was lightning fast at the time, but because it was limited to running on a single machine with about a terabyte of disk and a hundred gigs of Ram, which was pretty beefy for the time, it had a hard limit of about a hundred million webpages. It soon became clear that that wouldn’t be nearly enough.
So cutting and Cafarella decided to parallel paralyze it. They managed to expand Nudge to four nodes before the system became too complex to handle and it still wasn’t enough to handle the expected growth of the web. The developers weren’t sure how to proceed until they finally stumbled across an obscure paper written by Google that described the Google File System. Here they found the blueprints for solving the same problem that they were dealing with in 2004. Using the paper as their guide, cutting and Cafarella developed a Java version of the Google file system, which they called the Nudge distributed file system guided by another Google paper they described. They described that a system for parallel processing called MapReduce, Cutting and Cafarella created the first processing agent to work with the new file system which Cutting would soon rename. When Yahoo caught wind of the developers work in 2006 it hired Cutting to help me compete against the upstart Google.
Yahoo finally went live with Hadoop in 2007 I do certainly days where a mix of success and failures the new system could scale like nothing before it, but it was a different beast entirely and it required constant attention. While there were other distributed computing frameworks, inactive development at the time, Yahoo gained the lion share of the attention from prospect of users. It also had a liberal license at the Apache Software Foundation, which probably helped it spread. By late 2007 most of the web giants in Silicon Valley had heard of Yahoo’s success with Hadoop and they were starting to use it, Facebook, linkedin, Twitter. They all innovated atop Hadoop and developed products that would soon become top level Apache projects of their own including hive h base and storm. Facebook also developed. Cassandra had documented oriented database based on the Diet on Dynamo, a key value database created by amazon.com in 2004 to handle holiday shopping traffic.
The development of those SQL databases has largely paralleled the rise of Hadoop. It flourished in the ensuing years. Cloudera was founded in 2008 and with it, the big data concept of bringing the compute to the data migrated out of silicon valley and into enterprises around the world. Cutting would join Cloudera a year later to help guide the development of the fledgling computer system that was so green and yet had so much potential. The Hadoop dream of providing a central data store for a variety of engines was a tantalizing one. Since the dawn of time, we’ve been struggling with the need to move data to the computing resources. Storage servers remain separate from processing servers and never the two would meet except perhaps over expensive 10 gig networks or if you happen to work in the HPC field, infiniband network. Hadoop appeared on the scene like a silver bullet prep to tackle every data storage and processing task we could throw at it.
It’s a parents in 2007 occurred just in the nick of time to handle huge uptake and unstructured data different by the mobile web that occurred just after Apple introduced the iPhone in 2007 and a searching success of social media sites like Facebook, Twitter, and Linkedin, or perhaps it was presence enabled. This explosion of video chat logs, media files and cat pictures. It’s really a chicken and the egg problem and nobody will ever know if the explosion of data and consumer based information sharing would ever have happened. Have you not suddenly been graced with the ideal combination cheap storage and compute embodied by? I do. In any event, while the web giants were the biggest early buyers of Hadoop style computing and the biggest contributors to the growing Hadoop stack, they were soon followed by their enterprise brethren, banks, retailers, travel companies, insurance companies, and manufacturers soon wanted processed data like the web giants did.
After all, why should Google and Facebook get the consumer data and therefore drive all the transactions? The idea that we’re all software companies now soon to a cold and a big data explosion continued to roll. Armed with nearly unlimited data storage, measuring into the petabytes, data science and data scientists trained on machine learning techniques, companies were determined to find hidden in fights in the form of unexpected correlations or anomalies that they could turn into business advantage. Harvard Business Review declared the data scientists the sexiest job at the 21st century. Hadoop for better for worse, was basically synonymous with big data. If you’re quote doing big data unquote, then you’re probably using it with this novel Schema on read approach. Hadoop was hailed as a new data warehouse that didn’t penalize you for ingesting and processing huge sums of semi structured and unstructured data that would never fit into a relational data warehouse like the ones from the old guard.
Companies like Teradata, Oracle, Microsoft, IBM, and HP. However, Chink started to appear in the Hadoop armor. All of the platform could ingest huge sums of unstructured data like nothing else and survive failures of multiple nodes. The design of HDFS supported real time or near real time workloads. It was for better or for worse and batch only paradigm Apache hive and other SQL based warehouses for Hadoop. Introduce Interactive SQL processing into the ball game but that didn’t eliminate all performance concerns in the new data lakes but and in early interactive processing simply didn’t work for certain types of workloads. Fraud detection on credit card transactions for example, requires some second responses to queries surfacing a product recommendation on an ecommerce site and also requires having to answer within a certain amount of time early versions of hive with this MapReduce batch dependency is was ill suited for these types of workloads.
The demand for fast data analysis spurred the creation of a new class of data processing engine. Separate from into the pack was led by Apache storm, which Twitter developed to analyze huge numbers of tweets in something closer to realtime. Linkedin had its own real time processing engine called Sam Zone, which had developed alongside its real time DNF pipeline called Costco Yahoo, who developed its own projects called Samoa. In as far the parallel development of these real time, big data engines created a problem. Of course if I do put the single version of the truth as it was advertised to be, then how do you manage the separate streaming platform? The problem gave rise to a new architecture by Nathan Mars, the creator of storm called lambda. Put simply the lambda architecture simultaneously splits all data and feeds it into two separate systems. One that flowed into the real time streaming system for real time decision making and another one that flowed into Hadoop for the end of the day.
Batch processing to account for late arriving files in any errors that cropped up with the real time system you list to say stitching together two separate systems based on different frameworks that use different programming paradigms increased the complexity level immensely despite the fact that the lambda was seen as the only way to satisfy the competing requirements of processing data in a way that simultaneously fast, thorough and efficient any history of big data storage would be remissed. If a no SQL databases weren’t mentioned, no SQL database have some of the same characteristics of I do that both give developers flexible schemos and different Corey Mac mechanisms besides no sequence. While helping administrators by being distributed and fault-tolerant and running on cheap clusters of x 86 computers. No SQL databases, however, provide functions above and beyond what you get in a flat file storage system, which I do mostly is no SQL databases in are mainly used as operational data stores for structured and semi-structured data like JSON as opposed to the data lake for semi structured and unstructured data, which is hadoop’s biggest use case.
Just as we’ve seen a proliferation of engines that plug into Heti, we’ve seen the rise of specialized no SQL databases designed to handle specific tasks. We’ve seen key value stores like memcache d and radice excel. It’s serving read heavy workloads such as travel websites while document stores like Mongo DB and counter base excel. It serving back ends to most of the world’s most popular web and mobile apps. Why call them? Stores like Cassandra dominate the most intensive scale out use cases while graph databases like neo 4 j and tightened DB pro provide entirely new twist on data processing through degrees of connectedness. At the same time other No SQL databases such as Splunk, elastic search and Spark logic serve even more specialized in these cases and in addition to the Hadoop stack, the stream processing systems and the No SQL databases you have object storage to contend with as well and object store treat every piece of data as an object as opposed to a file system like Hadoop or a block storage method like sand storage.
It sometimes is considered the simplest and most scalable data storage method. Each record is given an identifier which is stored in a metadata store while the object itself and stored in the cluster. Amazon’s S3 today is the most dominant object storage system by far. The S3 API is the standard adopted by other storage systems as well. Together the No SQL database has to do strict. The stream processing systems and object stores seem poised to upend the status quo and the $2 trillion IT industry. While No SQL databases process transaction Hadoop provides the insights through machine learning workloads. Stream processors deliver the instantaneous insights while huge data lakes filled with the videos and pictures would be efficiently stored in object stores. With all the big data pieces in place. It was just a matter of putting them all together and yet the innovation didn’t end as the big datasets continued to grow.
So did the number of big data projects designed to help the process at all. The hype surrounding Hadoop peaked around 2014 which incidentally is the same year that Apache Spark emerged from incubator status to replace large parts of the Hadoop stack. The death of MapReduce as the primary engine of Hadoop was relatively swift and today’s Spark is the de facto standard processing engine for a range of big data workloads including stream processing and machine learning. All Sparks graph and SQL processing capabilities are are quickly maturing and yet innovation still hasn’t slowed. Today. We’re seeing the rise of other frameworks like Apache Flink and Apache Beam that advocate a stream first approach to big data instead of running separate Hadoop infrastructure. Architecture for Lincoln being proposed that we process all data, even batch oriented data as if it were real time. This approach has a number of benefits including the elimination of the lambda, the architecture and the simplification of the stack and considering the huge amount of data that the Internet of things is predicted to in the coming years. It may be the only way to keep our head above water as much as people seem to hate the term big data, the term still have legs and why? Because the term athlete describes the core of the problem we faced because managing the data and including storing it, accessing it, governing it, cleansing it, securing it, and ultimately turning it into useful information ultimately is a big data problem.
Consider the growth of data that we’ve experienced up to this point. In 2003 the world generated on the order of five exabytes of data. By 2006, two years after Facebook ignited the social media revolution, data generation exploded to 161 exabytes. According to the IDC. By 2010, three years after the first iPhone reached the consumer hands, people in their devices cracked a zettabyte barrier for the first time. The exponential growth continued in 2014 when we created 4.4 zettabytes of data that was about as many bits generated as there are stars in the physical universe. According to the IDC, by 2016 the world was generating 16 zettabytes of data per year. Those are huge numbers to be sure, but here’s the rub. Most of that data is never stored according to the IDC. Our ability to manufacture storage capacity trails far behind our ability to generate data.
The key number here is 15% that’s the fraction of data that we generate that ultimately gets written to disk or tape or flashdrive or optical or cloud or any other permanent storage mechanism. Most of the data we have created up to this point has been a femoral. The beds to make up our telephone calls, the TV and radio signals that are broadcast and never written down. The http requests that fetch data into our I ops and browsers that we, that we eventually closed and forget about. In many ways, our world has always resembled snapchat, data’s fleeting, appearing and disappearing in our lives to suit our whims and needs. The snapchat like trend will continue with the Internet of things. The IoT is widely expected to supercharge our data generation capability to a mind boggling 160 zettabytes per year by 2025 according to idcs most recent recent data age report storage manufacturers are doing their best to keep up with a huge growth, hard drive makers are building bigger and better hard drives, including five terabyte drives today and 30 terabyte drives on the roadmap while tape manufacturers are shipping six terabyte lto cartridges today with 48 terabyte cartridges on the roadmap.
This might, the innovation has spending desk, solid state date formats and cloud storage. The IDC predicts our storage capacity to hold steady at roughly 15% of the data we generate. That number of 15% appears to be some sort of magic number reflecting in a way the ratio between the raw data of questionable value in the actionable information that we’re willing to invest in.
The question then becomes how best to extract that 15% kernel of value from the other 85% of chaff. We’ll be writing 100% of it temporarily to disk with the hopes of willing down to 15% that we value through some sort of map method. That approach seems unlikely. While it had a shot of working in 2004 when Google researchers, Jeffrey Dean and Sanjay Ghemawat published the seminal MapReduce paper, there’s almost no chance of it working with the large amount of today. While Google was the source of inspiration for both core components of early Hadoop, including the Google file system that Cutting would implement and Java as HDFS and the original MapReduce method. It has long since moved beyond that form of computing. So what did the mighty Google replace MapReduce with stream processing? Of course in 2014, the company revealed cloud data flow of new big data, software development and execution paradigm designed to enable developers to build data flows or pipelines that produce that, uh, process, exabytes worth of data.
At this point, I think I would like to ask our first polling question. We would like to know what the state of stream processing is in your organization. Do you have any plans to implement it? Do you have plans to implement it within a year, within one to two years or beyond two years or do you have no plans to implement it or have you implemented it already? We’re going to wait just a little bit here. While everybody takes the poll and we can collect the results. All right, well it looks like we have about 15% of you have already implemented stream processing. About 15 and half plans to implement with a year about 23 within two years. Nobody with long term implementation plans and roughly half of you have no current plans. One of the current tenants of a Google data flow, which is turned into Apache Beam is the unification of batch and realtime data processing.
While data flow ostensibly enables developers to process data as soon as it comes off the wire, usually Apache Kafka or Amazons Kinesis, Google realized that the same technique can also be used for batch processing and thereby eliminating the need to build, maintain separate systems, ask for the land to architecture with Apache beam developers can write batch and streaming applications by using utilizing a single API once more. The notion of runners developers can access other execution frameworks from within the Beam API, including Flink, Spark and and Apex as well as the cloud data flow engine living on the Google cloud engine. This brings us up to the present. The state of big data is still in flux and evolving at a tremendous pace. The data generated by the Iot is exploding instead of petabytes of data, which we used to think was big data.
Now, individual companies aren’t talking about storing more than an exabyte of data. When a worldwide storage of data is measured in the does that have bytes to keep up with this massive generation, we’re moving beyond batch based methods and bodied by deep and it’s file system. Today. The future is clearly focused squarely on real time data processing methods at the edge upon mobile devices and using new hardware form factors because it’s really the only chance that we have to keep our heads above the digital tsunami. That completes my presentation and I’m going to hand it on over to Steve.
Thank you, Alex. So I just want to kind of reiterate what Alex was saying and show you in a graphical form. So really hits home and hopefully those of you who aren’t thinking about stream processing will be convinced that you may need to. So today we’re around 16 zettabytes of data annually, as Alex mentioned. And by 2025 IDC is estimating 160 zettabytes of data, just to put that into perspective and an exponential graph. That means that in every two year period in this graph, more data is generated during those two years that was generated in the entirety of mankind’s life on earth up until that point. So every two years represents more data than was ever generated before, which is to me just quite amazing. Right now IDC is saying around 5% of it needs to be dealt with in real time.
By 2025, around 25% of that data will need to be real time and have that real time data. 95% of it will be generated by IoT, and that’s a massive increase. You can see the exponential curve. It wasn’t really starting to hit for the real time data until now. And by 2025, we’re going to be overwhelmed with the amount of data being generated. As Alex mentioned, only a small percentage of that data can be stored. Now, if only a small percentage of that data can be stored, then you have no choice. The only logical conclusion is that you have to process and analyze data in memory in a streaming fashion in order to deal with a huge amount of data that are being generated. Now, it’s not just IoT data that is streaming. As I’ve mentioned, there is this trend to move towards a streaming first architecture, and if you think about it, it’s quite natural and batch is actually quite artificial.
Everything that happens within the enterprise happens because of some event, because something happens. It could be a user typing something into a form that puts stuff into a database. It could be a web application that’s writing to a database. It could be web applications that generate web logs. It could be machines doing stuff to generate logs. It could be devices that sending things as events as those things happen, but all of it is happening. One event at a time, log files aren’t created a whole file at a time. They don’t wait until everything’s done and then write. Everything’s for a log. Databases don’t always work, you know, huge numbers of rows at a time. They typically, you know, row by row in serve by inset, update by update. So databases can be streaming through change data capture logs can also be streaming by reading at the end of the log and taking things as they are written.
And devices obviously can also be streaming. So it helps convince you that stream processing is a major infrastructure requirement and it can also be the precursor to everything else you’re doing. And the only real limitations on turning a streaming architecture to applying batch concepts within it cause some things has to be batch, right? Your end of day report needs to be done at the end of the day. The only real limitation to that is memory. The more ram you have, the more data in a streaming fashion you can contain. And the larger your in memory batches can be that you’re doing through stream processing. But there’s no real requirement if you have enough memory to store all of this stuff on disk. So some of the things that you might be thinking about doing, we’ll talk about how you can think about that in the streaming fashion, right?
If you’re creating a data lake, the first thing you have to think about is kind of what is the end goal. You know, what is the point of creating this data lake? What’s the information that’s going in it? And the Alex is reporting on the other people have reported on kind of some of the failures that have happened with Hadoop. And some of the approaches that have been taken that were just completely wrong. And the overall completely wrong approach is to throw everything in there, you know, raw fashion and kind of hope for the best kind of hope you’re gonna get some value out of it later. So thinking about your end goal is kind of the first thing. And the second thing is going to how does it scale? Um, how do queries are slow? You can pair Hadoop queries with a well architected data warehouse, the queries incredibly slow.
In order to speed that up, you need to think about what does the data look like when you put it in Hadoop. And typically raw data isn’t the best thing to put into her do from a query and perspective. You need to be able to preprocess it and enrich that data. And denormalize it in a database terminology in order that when you query it, you can get your queries results back fast. And you also have to think about do we need all of the raw data? Is it the best form? Should we instead be doing aggregates and writing aggregates into Hadoop that also facilitate fast queries? So you think, you know, is the raw data actually useful? Um, and then how do, how does it scale? Um, how do you scaled feeding it, you know, putting things into Hadoop. You need to think about the overall architecture.
The use of big data really do need to be considered. And how do you score scale, storing it? Is this something that you want to do on premise or you want to do in the cloud than it turns out that you know, AWS is currently the top Hadoop distribution and it’s in the code. So that’s also something to consider. The other thing you might be consider doing is providing streaming data as a service. And organizations are investing in Kafka for example, if you’re investing in Kafka, again you ask yourself the question, what is the end goal? Because Kafka is just a message queue and message cues have been around for a long time. MQ series has been around for decades. So if you’re putting stuff onto a message queue, why are you doing it? Do you want to do real time analytics? Is it something that you using to feed your data lake? I expecting people to do self service analytics on the stuff that you’ve written into Kafka.
And again, you ask yourself the question, is the raw data itself useful?If you’re putting data onto Kafka, imagine you’re doing change data capture from a database and you’re feeding that raw data into Kafka. If you have a nicely normalized database, then the majority of your data is going to consist of Ids, your customer or the detailed table change, someone added the road to it is going to contain your customer Id, order Id, item Id, some timestamps, something else. All of those ideas are not going to be useful for someone doing analytics on a message queue or delivering even into a data lake. See you need to do enrichment of that data before it even make sense. And then how do you integrate if Kafka is just a message queue, how do you get data in and out of it? How do you integrate it with your existing investments like databases and how do you get value out of it?
How do you perform the in memory analytics and processing and even create dashboards from it to actually give value to your end audience. So we recommend that if you try and do any of these things, you know, you don’t boil the ocean, you don’t try and do everything in one go. You don’t throw all of your raw data into Hadoop. You don’t throw all of your raw data onto a message issue, you literally take things one stream at a time and do this as part of your overall business goal. You identify business use cases that make sense for these types of architectures. And the streaming piece is part of your overall data architecture. It doesn’t immediately replace everything that you have a co-exist. You probably already have an operational data store, enterprise data warehouse. You probably already have ETL jobs that run in batch mode that are taking things from databases and or files and putting those into a data warehouse, putting those into Hadoop, right?
So this streaming architecture, yes, eventually it can supplant a lot of those things, but that’s not something you want to do in one go. Cause these things are already working and they’re already part of your enterprise already, part of your decision making process. So streaming integration can be applied one stream at a time. You can connect to it into your databases using change data capture. You can read log files in real time. You can also access other machine data, message queue sensors, etc, and do all of this, streaming architecture. And those can drive real time applications. Those can allow you to do real time analytics. They can also feed data to machine learning. So the machine learning can learn but then they can use machine learning models and do real time scoring as part of the streaming integration piece. They can also read and write data from data warehouses and Hadoop.
And also use data in data warehouses, databases and they domfor enrichment purposes. So load all the reference data that you may have lying around in these things into memory and make that part of your streaming architecture. And so this kind of an enhanced version of lambda. Other people have talked about kind of a kappa architecture where your’re keeping kind of time windows in memory, that kind of replaced batch. But if you are moving everything towards kind of a streaming architecture, we may as well call it come of the Omega architecture cause you can’t go higher than that in the Greek alphabet. Striim is a complete a end to end platform. I just introduced kind of what stream does a little bit to give you some context around how some of these things can be done. And we do both streaming integration and analytics, all these things apply to stream.
These are things you should be considering if you’re looking at moving towards streaming. So first thing is you need to do continuous data collection and not just from iot devices but also from message cues. You may have, that could be Kafka, it could be Flume, it could be MQ series, it could be JMS. You probably already have investments in those things and it could also log files as I mentioned, can be read in a streaming fashion. Sensors typically are inherently streaming. They’re sending data continually. But databases can also be streaming through change data capture. Stream processing is an essential part of this architecture. You need to be able to do in memory filtering, transformation, aggregation of data. And as I mentioned, enrichment is key. You need to be able to load context and reference data into memory and join that with the streaming data in real time.
In order to try to get value out of your streaming data, you need to do stream analytics and normally resection a what used to be called complex event processing, looking for patterns, sequences of events over time that might indicate something interesting is going on. Do the statistical analysis and compare real time data with statistics, integrate machine learning and do that in real time and be able to visualize all of this and create dashboards around it to actually get value out of it. And then of course, you need to be able to deliver your results of this processing. And that could be putting aggregated data or enrich data into Hadoop, into Kafka or delivering into the cloud, putting it into S3 or, or Redshift or Azure SQL. So you need to think about all of these pieces. If you’re embarking on a stream architectures and you need to think about how do all these pieces fit together and how do they make sense?
There are different categories of business use case that you can solve using a streaming architecture. And we break those down into kind of real time data integration, detecting patterns and also monitoring and building metrics and KPIs. Um, here’s just some examples. You could be feeding your data lake in real time or feeding your Kafka or I’ve, I messaged a few infrastructure in real time. You might be migrating data into the cloud or setting up a real time hybrid cloud where data on premise is mirrored in the cloud for reporting or scalability, purposes. Or you might be looking at IoT edge processing, doing preprocessing of data, doing aggregation, redundancy removal, extracting the signal from the noise, turning data into information and doing that as the edge on pattern detection anomalies. There’s a lot of things you could be doing here, whether it’s fraud detection or kind of active cyber security monitoring, anti money laundering, doing things based on locations, doing predictive maintenance in real time and a lot of the iot analytics.
And then kind of building on that. If you’re looking at building a monitoring metrics and KPIs, then you’d, so think about real time call center or network quality monitoring, sale monitoring, general if you’re providing API is API is or other a SLA is your customers and monitoring those in real time and seeing if you or your customers are breaking them. And there’s a lot of things that you could be doing here as well. So I’m going to take a quick pause here and we’re going to ask, ask the next survey question, which is around what type of use cases would you think of as being useful for, streaming out of architectures.
Um, we’ll wait a short time for those survey results to come in. And I would encourage you to Kinda answer this. It’s kind of interesting to see any will’s going to help target some of the things we talk about later as we go through the rest of the presentation. So while those results are coming in you hear a lot of talk about stream processing. You hear it coming from a lot of different vendors and you know, even Kafka vendors and people that focus on talking about kind of stream processing. Well there are a lot of different components to a stream processing architecture. Yes. A message queue, like Kafka is an important aspect of that. But it’s only one tiny sliver of the entire architecture. You need to consider how do you load large amounts of data into memory that’s an in memory data grid that you need in order to actually do that side of things.
Um, you need to think about how do you do your stream processing? How do you do your analytics, how do you integrate with machine learning? How do you visualize? Where do you store the results? Where do you push them? Is that something you need to consider as well? And then of course, how do you collect the data, whether it’s real time from devices messaged use log files and also databases for change data capture and how do you deliver that data act some way, whether it’s in the cloud or on premise. So yes, Kafka is an important aspect of your stream processing, but there’s a lot of other pieces of technology that you’ll need as well. So we’ve got a lot of the results through now for what are you considering stream processing for and interestingly, but not unexpectedly iot and machine data seems to be the top people are actually also looking at it for log intelligence, real time monitoring of machine logs, recommendations, fraud detection, and then we have some others which we would obviously be interested in drilling down.
If you have one of those others, then you can email us and kind of let us know what the others is. That’d be great. So let’s go back to the slides. I’m going to zoom through the next one and get onto kind of the iot portion. This is just a reiteration of kind of the pieces that you need and a streaming architecture. And this is the Striim platform. One of the things that we do that’s kind of key to success I think is we lower the bar drastically for people wanting to build streaming data flows. And we do that by doing all of the stream processing through SQL. So you can write continuous queries, continuous processing in memory without having to learn how to code in Java script or Java or C# or any of the other programming language that you may need.
And we’re also an end to end platform that contains all of these pieces so you don’t have to choose and evaluate and work out which, um, message infrastructure you’re going to use, which cache system you’re gonna use with stories you’re gonna use and how you get data in and out of all of this and then how you do the processing and analytics. So those are kind of the key aspects of a streaming architecture. And this is kind of how we put them together at Striim. And we also add this overall kind of consistent UI that allows us to build things in a very easily utilizing our platform. And also allows us to kind of build dashboards to visualize your streaming data in real time as well. This is an example of a data flow. So when you think about stream processing is not just a single query running on Striim, it is an entire pipeline and those pipelines contain multiple data processing steps and it may well be that the streams generated at each point.
Those intermediate streams are useful for other purposes as well. And typically we find organizations that start with a data stream start with a source of streaming data for one particular application. Use that for a lot of other use cases as well. So not just the raw streams for the intermediate streams that you are creating as part of this processing can also be reused. And as I mentioned, we also allow you to do building dashboards and visualization as move on to kind of Iot. IoT is crucial to a lot of industries already. It is going to be crucial to almost every industry going forward. That huge growth in IoT data doesn’t just affect one industry, affects everyone. And it amazes me that you know, an industry as old established as the insurance industry. And so as established that they use 500 years worth of data to determine where to build their headquarters creates a safest place in the country.
That industry is adopting IoT to monitor your driving habits and I’m sure they’ll use it to monitor other aspects of your life in order to provide the most tailored, most suitable policy going forward. And if it’s affecting insurance, it’s going to affect almost everyone. Agriculture I think will take that big advantage out of IoT. It’s essential to have a smart architecture and that’s my architecture means that you are processing or collecting, processing, storing, analyzing the data where it makes sense. And that could be within the device. It could be at the edge, so a collection of devices together. But a small number of devices at the edge managed by an edge server. It could be on premise. And it could be in the cloud and all of these pieces have to integrate together. The scale is obviously larger.
This side you have a huge number of devices, but the depth that you have in the view that you have into things increases as you move over to the right side because you are aggregating and looking at the combined value of a large number of devices and that could involve correlating device data together, but most likely it’s going to involve correlating device data with other enterprise data and doing that in real time to get the best value. Whether it’s true enrichment of that device data by adding context to it. Like you have a device id, you want to know what that device is, where it comes from. Maybe you have to look that up in your asset database or integrate with the ERP system in real time. Or it could be that you are correlating it. You are looking for events that are happening concurrently with your devices and logs from other machines.
Maybe security logs or other logs that might indicate something interesting is happening. So you need this kind of generic scalable IoT architecture that incorporates edge processing. And yet one of the things you have to remember with IoT is a loT of the devices out there right now, are not Internet enabled. You know, there’s just a world of things that a large number of the things still need connecting. And so pretty often you need this physical gateway, a box that you have to plug the wires into in order to actually connect to these devices. A lot of manufacturing devices, a lot of things like air conditioning systems, hospital kind of healthcare, medical monitoring devices, they need to be wired. So you have to ask how do you get that data? And that’s where you need a protocol translation gateway to turn the communication with those old school wired bus back net, to two port devices into something that works for the Internet.
You need to do edge processing and analytics and you may also want to do machine learning, kind of model scoring at the edge to build a model in the cloud or on premise in your own data center and then use the results of that model, move that to the edge and do real time scoring and maybe do real time, predictive maintenance or quality monitoring or any other aspect that you can model that you can say is this behaving normally? Is this behaving as my model will predict if I also want to do processing and analytics on premise and move some of that through a hub into the cloud where you do more processing and analytics and maybe feed your machine learning and moving that model back. And we built something like this recently with some partners where we use the Dell EMC gateways, which are pretty beefy gateways.
They have quite a lot of processing power and memory. We integrated with the Azure IoT Gateway SDK to act as a translation gateway that was talking to a Bluetooth and APCU 80 devices. We have the Striim’s edge server that was actually doing edge processing and analytics and we used Statistica for machine learning where we actually built a model in the cloud using statistica that was predicting product quality and then scored that model at the edge to do real time analytics. So that’s a kind of an example of the architecture. There’s lots of benefits to kind of an architecture like this. If you can switch out your practical translation gateway, you can connect to anything you can react immediately cause you’re doing it at the edge. You can do aggregation and turning data into information, remove the seeing the signal from the noise a doing all of that at the ASU.
Limit, the data center to the cloud or into your a B. Data storage can scale it as required by adding more edge nodes, more on premise nodes, more cloud nodes and control everything centrally. So how do you handle is oncoming tsunami of data. So just to reiterate, some of the guidelines that we’ve given you first is transition away from batch. It is an artificial construct. The world is not batch. The world is event driven. Batch was something that was a limitation of technology. So you need to move towards a streaming first architecture where you are at least collecting data in a streaming fashion but move towards in memory processing and analytics, especially edge processing for IoT. Don’t store all of your data. Data is not information. If you have a nest thermometer is sending your certain temperature in the room every second, three and a half thousand data points in an hour, if your room stays at 70 degrees, that’s one piece of information is 70 degrees for an hour.
You don’t need this three and a half thousand data points. So process at the edge, do filtering aggregation, remove redundancy before sending that to the cloud. And you do need a complete end to end platform. Open source provides lots and lots of pieces, but it doesn’t provide an overall solution. You’ll still have to glue all that together. And most enterprises that we’ve spoken to, they focus on solving their business problems. I know Alex said that all businesses are software companies. That is true, but you have to choose how much of it do you want to build yourself. So I think we can go into kind of questions and a discussion.
Thanks so much Steve. I’d like to remind everyone to submit your questions via the Q and A panel on the left hand side of your screen. At this time I’ll mention that the slides from today’s presentation as well as a recording of the Webinar will be made available for download. We will be sending a followup email with links to these assets within the next few days. So let’s turn to our questions. Steve, you mentioned Statistica. One of our participants was asking do we have a machine learning capability in Striim or do we have to prepare a third party, something for using machine learning?
So we have some degree of machine learning, kind of only focusing on real time. So for example, we have real time linear polynomial multivariable regression algorithms in our platform that you can train over a training window and then use to predict out into the future. And similar kind of real time clustering we told limited the amount of data that you can store in memory. UThat’s why we’ve partnered with companies like Statistica and we’ve had customers also work with machine learning software like H2O where we generate and enrich data. It’s getting in a form for machine learning, then trained and analysts work their magic and derive value out of that data. Create a model which is an export it and typically they’ve been exporting them as jar files, which we then can incorporate directly into a SQL. So you can then basically write queries within Striim. They reference the machine learning functions and do real time scoring. And that’s kind of how we’ve done about it today. And if you have any particular machine learning software in mind, we can talk about kind of ways that we can integrate with that.
Excellent. Thank you. This question is also for you, Steve. You mentioned change data capture. One of our participants is asking, can you compare all Oracle golden gate streaming with Striim?
Because we are very fond of GoldenGate because the four of us that founded the Striim, we were, on the executive team of GoldenGate prior to the acquisition in 2009 by Oracle. But your GoldenGate is great at doing database replication is credit moving data from one place to another. And they also do have kind of this thing data adapt to the can write the rule change data somewhere else. We are a full streaming integration platform and what that means is that you can do quite complex data flows, quite complex processing of the data and very importantly we can enrich the data by loading arguments of reference data into memory, and do that in real time. And I mentioned earlier why that would be important for say a normalized data coming from database. So I think it’s crucial to recognize that your different software CD for different things. We are a full integration platform that can also do analytics on that streaming data and can also build dashboards and visualizations over that streaming data. We’re not going to limited to replicating data from one to another.
Perfect. Thank you. This next question is for Alex. Many of the computing enhancements of the last decade have been made in software. Do you think we’re entering a period where hardware innovation will again impact the industry?
Oh, thanks Katherine. I do. We’re starting to see some smaller processors come out. Google came out with it’s TPU last week. Intel has had its knuck quadrant. Qualcomm is building a snapdragon. People want to be able to do the scoring part of the machine learning out on the edge. And that’s a big part of the stream processing story here that Steve’s been talking about. A lot of the work that people are trying to do on big clusters in the data lake on Hadoop and other platforms. They’re realizing they needed to do it out on the edge. And we’re seeing a wave of innovation now with some of these smaller processors that are going into devices out on the edge. So I definitely think that we’re going to see more hardware innovation out there, especially, with deep learning and artificial intelligence becoming such a big driver of technology these days.
Excellent. And that in fact, Alex, there was a related question to that and that was, what will be the long term impact of deep learning on the big data space and what’s the intersection of deep learning and stream processing?
That’s a great question. I’m not exactly sure what the answer is going to be. Deep learning right now from what I can tell is mostly focused on solving a few problems, image recognition,you know, pulling, still photos off of video feeds and trying to determine, you know, is that a sidewalk? Is that a stop sign? Right. It’s being driven by a lot of the autonomous car developments that are going on. And also voice recognition and natural language processing. Those, from what I can tell are the main uses for deep learning. But I don’t know, maybe Steve can talk about how, if any of that stuff is going to make its way into stream processing.
This also segues into another question that came up, which was how my data streaming be utilized in the healthcare industry, specifically our hospital, well I don’t know which hospital is your hospital, but in general there’s a lot of possibilities around streaming data. Part of it would be around kind of patient monitoring and we involved the interesting story you once the commonality between hospitals and airports. It sounds like one of his bad jokes, the answers wheelchairs and the wheelchairs have a big impact on how airlines actually run on schedule and how hospitals move patients in and out. So doing real time tracking of wheelchairs and other equipment you know, like trash carts and a portable blood pressure monitors and all those things inside the hospital, it could also be very, very useful. So you know, where these things, I immediately, and you can optimize your flow monitoring patients.
You put a price on them when they answer, you know, where they are at all times, whether they wandered out into the fire escape or the restroom where you’ve lost them, you can find them. Right. Um, through real time tracking is important, but the other piece would be it mentioned the world where you combine a real time biometric monitoring, you know, all of the things you hooked up to when you’re in a hospital, right? And you can anonymize that data. You can send that into a cloud and you can also enriched that data with additional context like the patients symptoms. And you know, treatment and, and other things, right? Do deep learning on that again, is all anonymized. There’s no patient in specific patient information that, right. Do deep learning on that name much and you can then apply what you’ve learned in that deep learning to the real time signals that you’re getting from patients.
Maybe that machine learning can spot something that might be a potential risk that isn’t just an obvious sign that a single biometric monitor would spot. So I think there’s a lot of potential, a lot of things that people are looking at to kind of use real time data and you real time data streaming in the health care industry and that goes across all industries. Imagine you’re streaming all of the data from your fields in agriculture. The soil quality, water, content, sunlight, your whole bunch of things, right? Even monitoring with video cameras for past and sending drones as is acting with lasers. So there’s a lot of things that you could do in almost every industry and it all relies on having up to the second information and being able to react on it immediately.
Great. That was a very thorough response. Thank you both. Steve, this next question is for you what data sources does stream work with? And a couple of examples the person gave with Mongo or couch TV.
So we haven’t, today’s put kind of real time data collection for Mongo and coach Db. On the database side, we support a change data capture. So real time streaming of database change, the inserts are based in disease as they happen. Um, for, uh, Oracle SQL server, my sequel and HP nonstop databases. Um, I said you can source data from, uh, Mongo and coach CB through other means. Um, which would basically be in the form of, of queries. You might have to kind of build something, um, to work with some of these things by the edition. We do read from a log files in real time. I HTFS a hive. Um, also, um, working with message cues like, uh, JMS and CAFCA and QP flume and with devices through variety of protocols, TCP, UDP, http, MQP, MQTT, all those kind of the wide protocols. And we have some adapters for the protocol translation gateways as well.
So sources of data means different things. We can source data for example, from JDBC databases in the form of queries, but typically for loading in memory data, our in memory data grid for enrichment purposes, you wouldn’t want to do that for real time data because JDBC query is on real time single result set. So you have to differentiate between gonna loading static data, which we use for reference and context information and kind of if you want to do batch in a streaming fashion to real streaming data which would go through change data capture. So as things are happening within a database, you are streaming that.
Excellent. Shall we squeeze in just one more question? How does today’s stream processing relate to older messaging techniques developed, developed by Tibco software, AG, IBM, SAP and Oracle that are still in widespread use? Let’s take a stab at that one, Alex. Um, you know, I think you might pay better for this one, Steve. Now obviously with your history at GoldenGate ride, you’re pretty well steeped in this stuff.
So there’s definitely messaging technologies out there, right? JMS has been around for years. MQ series has been around for even more years and a lot of those messaging technologies were typically applied to kind of application integration and kind of SOA service oriented architectures. And so inherently are not necessarily designed with the throughput in mind on some of the other requirements in mind from a scalability cluster ability other things that will take the load of kind of huge amounts of IoT data. Right. Um, so it’s kind of, and that’s possibly part of the reason why Kafka is kind of had such a sudden rise in popularity. Um, but that, that’s just half the story, right? The message queues is half the story. The other piece is the stream processing and as I mentioned, something like complex event processing has been around for quite some time and it used to be that it was, you know, not designed for specific purposes and the barriers to entry were really high.
Um, it was difficult to get data in, is difficult to get data out is difficult to visualize and kind of build dashboards over it and to integrate it with other data to integrate it with reference data, for example, that you load in memory. So I think one of the major things that is happening and we are seeing that more and more is, is the integration of multiple in-memory components. So the combination of a high speed message, infrastructure in memory, data grid, in memory stream processing and analytics, in memory databases, in memory visualization, in memory, machine learning, scoring, in memory transaction processing, kind of all those things kind of coming together to do a lot more stuff in memory. And part of the reason why that is possible now is because memory is getting cheaper and more available. And as we start to see new interesting forms of memory, like Intel’s crosspoint with Maria, the speed of Ram with assistant storage. The amount of in memory processing that you’d be able to do will increase astronauts astronomically almost exponentially. Which should help him, help keep that, help that keep up with the growth in data that we saw earlier.
Thank you so much Steve and Alex. Unfortunately we are out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. On behalf of Alex and Steve, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.