Key Takeaways from Strata+Hadoop World, San Jose

3 Minute Read

Or should it have been Strata[minus]Hadoop World?

One of the most interesting takeaways about last week’s Strata+Hadoop World in San Jose was that it isn’t about Hadoop any more. With the announcement that the conference name will be changed to “Strata Data Conference” going forward, and a flurry of negative press about Hadoop, people may be left scratching their heads wondering what went wrong.

It’s also amazing to see just how much attention real-time and streaming data architectures are getting. Maybe O’Reilly should have cut to the chase and renamed the event, “Strata+Streaming World.”

It’s not like there weren’t predictions about this happening for some time. In 2014 Gartner famously declared:

“Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases.”

In 2015, Bill Schmarzo of EMC Global Services said,

“You start with your technology, bring in some Hadoop, throw some data in there and you kind of hope magic stuff happens. It’s really a process fraught with all kind of misdirection.”

He indicated the problem was that there was no process around Data Lakes, and the whole thing needed to be use-case-driven.

I, in turn, followed up on this theme, suggesting that part of the solution was moving to streaming data ingestion, and preparing your data as it moves before landing it in your lake. Importantly, I specified only doing this on a use-case by use-case basis.

“A Data Lake can be a powerful ally if it is built in the correct way. By utilizing streaming data collection and processing, you can feed your Lake with data designed to answer your questions, and ensure it is current and relevant. After all, what is better for a lake than a supply of good, healthy streams?”

Now the chickens have come home to roost, and people are realizing three things:

  • The premise of storing all your data for later (as yet unknown) analysis does not make sense
  • The investment in Hadoop has not yielded the expected return on investment
  • Real data is created fast and needs to be streamed and processed in-memory to get value from it

We can thank IoT for helping to hammer home that latter point. If more IoT data is created than can be stored, you are forced to think of streaming data integration and in-memory processing. Once you have in-mind the concepts of data streams and real-time continuous data integration, it’s easier to make the leap to streaming all your data sources. The only question that remains is “How?”.

At Strata, several companies cited IoT data as an important strategic aspect of their data integration strategy. Others were concerned with more tactical things, like how do I continually up-load my on-premise data to the cloud.

Striim’s ability to continuously collect data from files and databases (through change data capture), as well as message queues like Kafka, and sensors, enables companies to focus on what they want to move and what processing or enrichment that data needs. Since Striim takes care of the collection and plumbing, and also enables them to target their existing investments like cloud or on-premise data stores, the speed of development and ROI is much easier to understand.

For more information on how Striim can help companies address the shortfalls of Hadoop, check out this post on how to integrate streaming data management into an existing Hadoop infrastructure.