Commercial Hadoop supplier Cloudera is adding more methods to extract data stored in the Hadoop Distributed File System by rolling up the Solr search engine and hooking it into its CDH distro.
The company is also banging the drum that we are entering a new era of computing, one in which old-style relational databases will still have a role in transaction processing and analytics – but a much more diminished one.
Solr is an Apache project just like many of the elements of the Hadoop stack and is a variant of the Lucene search engine created by Doug Cutting, one of the creators of the Nutch web crawler fifteen years ago. Cutting was working at search engine giant Yahoo! (before it decided to piggyback on Microsoft Bing for search, and had its own search engine) when he read the Google paper on MapReduce, and extended Nutch with Tom White to create Hadoop. Solr was created by Yonik Seeley (not Cutting as this article originally said).
For the past several years, Cutting has been chief architect at Cloudera, and the wonder is what took so long for search to be added to the Hadoop stack officially. For whatever reason, there is no better person than Cutting to do the job.
Cloudera is not going to be the first commercial Hadoop distributor or large NoSQL data store supplier to adopt Solr as a search engine for unstructured data. DataStax, which has commercialized the Cassandra NoSQL data store originally created by Facebook added Solr search for Cassandra back in March 2012, and MapR Technologies added Solr search to its M7 Hadoop distro last month, but as is the case with Cloudera, it is only in beta at the moment.
[wdsm_ad id=”62″ class=” ” ]
Cloudera CEO Mike Olson says that Cloudera Search, as the feature will be known in the CDH stack, has been in private beta for a number of months and is being made available for public beta testing now. Cloudera Manager 4.6, the control freak for the CDH stack, has been tweaked to install Solr search and to monitor it as well.
Cloudera Search is being distributed as a separate download, but the next release of CDH will have Solr search rolled up alongside MapReduce batch and Impala SQL query methods for tickling data stored in HDFS. General availability is expected sometime in the third quarter, but Olson says it is subject to change. And like the Interactive Query (Impala) feature, Cloudera Search (Solr) will have an additional support fee above and beyond the base CDH support fee.
“The key benefit is that anybody can now use this platform,” says Olson. “When Hadoop first appeared on the market, the knock against it by the existing analytics vendors was that you had to learn this new MapReduce thing and you have got to be a Java programmer. We have added SQL, but there are people who don’t know that language, either. People want to search for data they know exists in their cluster, but with a petabyte of data, there is not set of folders that makes sense any more. What we have learned from Google is that we just want to type terms into a search box.”
The Solr search engine can be used to index data as it is being ingested into HDFS or HBase and then embed it into HDFS for future searching. In some cases, using Solr to search through data will be sufficient to the task, and in others, end users will just use Solr to do data exploration before they write a MapReduce routine in Java, kick off a query against an HBase table, or even run Impala SQL queries against HDFS.
Providing access to data in HDFS doesn’t end at MapReduce, HBase, Impala, and Solr, says Olson. “Watch this space, because we will add other engines over time because what companies want is to access the same data without making copies.”
With all of the expanded capabilities of Hadoop, which are speeding up query times as the system moves from batch to near-realtime processing, and the significantly lower cost of storing data in Hadoop compared to traditional data warehouses powered by parallel relational databases, Olson thinks that the center of gravity for analytics is shifting away from relational tools to Hadoop.
“If you are paying by the terabyte, then these numbers on data warehouses get pretty scary pretty fast,” says Olson.
Moreover, customers have different kinds of data than these warehouses were designed to store, and they are asking different kinds of questions as well of a mix of data types from varied data sources. The street price of a data warehouse is something on the order of $20,000 per terabyte, according to Olson, while it is on the order of $500 per terabyte for a Hadoop cluster. And so, performing data cleansing and doing extraction/test/load operations on data in a traditional warehouse can be very pricey indeed.
“On a data warehouse, every workload you have is flying first class,” he quips. “Hadoop is not only cheaper, but you get a faster time to insight. And, you can move data transformation and analysis to Hadoop and free up capacity on the warehouse to do other work without spending more money there.”
It will be many years before most corporations are ready to give up their data marts and data warehouses, but the economics of the situation and the improving query and analytics tools in Hadoop are certainly going to make them stop and think. This is why Teradata and Oracle should probably have their own Hadoop distros at some point, like IBM has, instead of partnering with Hortonworks and Cloudera, respectively.
You want to be the next Red Hat more than you want to be its reseller. ®