Many corporations are taking the first steps on their Big Data journey by building an infrastructure that supports capturing and storing huge volumes of data from disparate sources. These Big Data infrastructures, largely based on Apache Hadoop, had been developed with the mindset that data mining and after-the-fact analytics would be the primary focus of the activity. The advent of Hadoop has made efficiently storing data across cheap commodity hardware simple and, as a result, companies began to store every log stream from every machine and sensor. Now stored data volumes are exploding.
Just as companies are getting comfortable storing Big Data and processing the information later for analytics, other companies are now looking for competitive advantage by using their real-time data for decision making, milliseconds after the data is generated. It does little good to discover an insight that was useful 15 minutes ago. As streaming data becomes the norm, reacting immediately to streaming events and selectively filtering and adding rich history and context to streaming data requires a new architectural approach. To achieve this goal you need a streaming analytics infrastructure that integrates well with your existing data management frameworks, including Hadoop.
There are many open source projects that support pieces of the end-to-end streaming data analytics infrastructure, including: data assimilation, message queuing, search, in-memory grid, visualization, distributed processing, etc. Piecing these open source projects together for a working application is a delicate job, requiring sophisticated development expertise and ongoing management talent, which is typically hard to retain. Worse, as requirements change, the same specialized developer expert must revisit the applications and architect an update… not only taking into account your requirement, but also whatever has changed with the open source projects that underpin the solution. With increasing data volumes and application complexity, this simply doesn’t scale.
Another streaming analytics option is to use an end-to-end platform, and let the platform handle the nuances to make the streaming data available for analytics and applications without requiring deep technical skills. This platform should also leverage all of your existing investments in Big Data infrastructure.
With native HDFS read and write, and HBase support, Striim lets Hadoop shine as Hadoop was intended: for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware (Wikipedia). Striim adds a true continuous Big Data streaming analytics layer to your infrastructure that lives in harmony with Hadoop. Striim augments a Hadoop ecosystem in a few ways.
For streaming data inbound to Hadoop, Striim offers continuous real-time data integration with HDFS, capturing heterogeneous streams and flowing them into Hadoop. During the stream ingestion, Striim can pre-process streams before they land on Hadoop, quickly filtering out excess data that will never be used (such as timestamps to the 12th decimal, when minutes are the required granularity) and adding rich context data to make the stored data immediately useable. Reading from HDFS and HBase data sources allows for rich context to be pulled from Hadoop into stream caches, correlating streaming data milliseconds after the data is generated. This deep context data can also be fed into the Striim predictive algorithms as baseline information that is continuously refined with streaming data.
Another use for Striim is exploratory analytics on large time-series data sets that have been stored in Hadoop. The platform is optimized to correlate and aggregate over time-series data and provides significant performance enhancements over traditional methods of processing large time-series data sets, such as reprocessing a stored log file. Finally, for a quick preview of what exists in Hadoop, the Striim Source Preview functionality allows you to quickly connect to a Hadoop cluster, browse the files immediately open them to inspect their contents, and begin building dashboards.
Striim users get more out of their Hadoop investment. Through up-front stream processing, filtering and adding context to streams before they land in Hadoop, the data storage footprint is reduced and stored records are more actionable. Populating stream caches with data from Hadoop allows real-time correlation of deep context with real-time streaming data. Finally exploratory analysis is enhanced with simple point-and-click HDFS access via Striim Source Preview, find files, preview their contents, and prepare the data for ingestion into the stream analytics platform, all using the GUI.
Upgrade to analytics at the speed of reality, with no perceived delay from the event to the action. A new era of computing is upon us.