
Introduction
Big Data is booming and many businesses have started embracing it to stay competitive but there is still lot of misconception that Hadoop which is being used interchangeably with Big Data is the silver bullet and the only solution. It’s hard to talk about Big Data without 3 V’s (Volume, Velocity, and Variety) but we will keep it short. Big Data or may be called as Hyper Data or Hybrid Data has either one or combination of V’s. It’s important for businesses to look beyond the Hadoop hype and figure out what tools and tech stack they need to fit their specific business problems. Not all businesses have Facebook and Yahoo volume problems.
Unfortunately, the complexity of Hadoop ecosystem and associated skills shortage are the two main impediments for many businesses to adopt Big Data. This blog post tries to put out some facts about Hadoop, why it is not the only Big Data solution, why businesses need to think about a different tech stack that doesn’t need a Zoo, and the tech stack that we have architected to address Big Data variety challenge.
Hadoop Facts
“Apache Hadoop, by all means, has been a huge success on the open source front. Thousands of people have contributed to the codebase at the Apache Software Foundation, and the Hadoop project has spawned off into dozens of happy and healthy Apache projects like Hive, Impala, Spark, HBase, Cassandra, Pig, Tez, Ambari, and Mahout. Apart from the Apache Web Server, the Apache Hadoop family of projects is probably the ASF’s most successful project ever.”
“It’s not that Hadoop is just an immature technology – rather, it’s unsuitable for many mainstream Big Data projects.”
“There is no doubt that some companies have gotten great results out of Hadoop and are using it to hammer petabytes of less-structured data into usable insights. But these success stories are predominantly relegated to either the biggest firms in their respective industries, or well-funded startups looking to leverage new Internet business models to disruptive existing industries. By and large, it hasn’t trickled down into the marketplace as a whole, at least not yet.”
“Hadoop will disappear just like other underlying database technologies have disappeared, Cloudera’s chief strategy officer Mike Olson says. Hadoop distributors hope that predictive analytics via machine learning becomes a requirement and that Hadoop-powered analytics get built in and integrated with other offerings. Fast new frameworks, like Apache Spark, can abstract away the complexity and allow organizations to use big data analytic systems without becoming data scientists or brilliant architects themselves. But even if Spark and the rest help abstract away some of the underlying complexity, the complexity is still there under the covers.”
“Despite considerable hype and reported successes for early adopters, 54% of survey respondents report no plans to invest at this time,” said Nick Heudecker, research director at Gartner. “Furthermore, the early adopters don’t appear to be championing for substantial Hadoop adoption over the next 24 months. In fact, there are fewer who plan to begin in the next two years than already have.”
“Only 26 percent of respondents claim to be deploying, piloting or experimenting with Hadoop, while 11 percent plan to invest within 12 months and seven percent are planning investment in 24 months. Responses pointed to two interesting reasons for the lack of intent. First, several responded that Hadoop was simply not a priority. The second was that Hadoop was overkill for the problems the business faced, implying the opportunity costs of implementing Hadoop were too high relative to the expected benefit.”
“A source from Gartner says, Skills gaps continue to be a major adoption inhibitor for 57 percent of respondents, while figuring out how to get value from Hadoop was cited by 49 percent of respondents. The absence of skills has long been a key blocker. While tools are improving, they primarily support highly skilled users rather than elevate the skills already available in most enterprises.”
“71% of Data Scientists feel Taming Data Variety is proving to be more important than Volume according to Paradigm4 research and Hadoop only takes you so far. Hadoop was unrealistically hyped as universal and disruptive Big Data solution, it is a technology not a solution. This is causing concerns for companies to embark on Big Data opportunities.”
No Elephant Big Data Tech Stack
Component | Description |
Mediation/Routing/Messaging | Apache Camel DSL is used for mediation/routing logic to invoke appropriate input adapter in the Ingestion Layer via Kafka & ActiveMQ message queue |
Ingestion | Selenium with Jsoup is used to scrape interactive websites with multiple depth levels to get data. Camel protocol & data adapters are used to ingest JSON/XML/CSV/TSV, and multiple custom plugin based adapters are written to process Mainframe/DB2 data files |
Aggregation/Transformation Process | Talend is used to perform lot of transformation process, R/Python for statistical aggregation calculations, and other custom adapters to perform user defined functions (UDF) |
Security | All Layers goes thru LDAP authentication for tighter access control list |
Data Stores | ElasticSearch is used as metadata repository that contains ontologies, mapping, and transformation rules. Neo4j contains nodes and relationships between business entities and data records. MongoDB contains unstructured scrapped content and geospatial documents that are transformed by process layer. Cassandra contains timeseries tuples of pre-computed data |
Product | Spring based Java REST API that interacts with polygot data stores using Spring XD and other custom adapters to get graph relations, metadata, and timeseries. A frontend app built with AngularJS, Highcharts, and D3.js for visualization and other user based preferences. |
Data Science Tools | A revamped version of OpenRefine is used for Analysts to analyze the raw source data. A custom analytical app built using Shiny + R is used to provide visualizations and different statistical calculations for Analysts/Data Scientists to come up with appropriate computation to apply on the raw source data. |
Conclusion
- Hadoop is a great technology and one of the main catalysts that enabled businesses to adopt Big Data to solve interesting problems. However, it is not THE ONLY solution for all Big Data challenges and opportunities.
- More and more businesses are facing challenges with taming data variety and should think of the tech stack that best suit their needs.
- Treselle Systems has been voted as one of the top 25 most promising Big Data vendors of 2015 by Outsourcing Gazette Magazine and has good expertise with Big Data from strategy to design to implementation to deployment.