Hadoop Tutorials season II: 7. How to index and search Yelp data with Solr

In the previous episode we saw how to use Pig and Hive with HBase. This time, let’s see how to make our Yelp data searchable by indexing it and building a customizable UI with the Hue Search app.



Indexing data into Solr


This tutorial is based on SolrCloud. Here is a step by step guide about its installation and a list of required packages:

  • solr-server

  • solr-mapreduce

  • search


Next step is about deploying and configuring Solr Cloud. We are following the documentation.


After this, we create a new collection and index named ‘reviews’. We use our predefined schema that needs to be copied from the Hadoop tutorial github.


cp solr_local/conf/schema.xml solr_configs/conf/schema.xml

solrctl instancedir --create reviews solr_local

solrctl collection --create reviews -s 1

We replace the field definitions in the schema with a mapping corresponding to our Yelp data. The schema represents each data fields that will be available in the search index. You can read more about schema.xml in the Solr wiki.

 <field name="business_id" type="text_en" indexed="true" stored="true" />  
  <field name="cool" type="tint" indexed="true" stored="true" />
  <field name="date" type="text_en" indexed="true" stored="true" />
  <field name="funny" type="tint" indexed="true" stored="true" />
  <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
  <field name="stars" type="tint" indexed="true" stored="true" />
  <field name="text" type="text_en" indexed="true" stored="true" />
  <field name="type" type="text_en" indexed="true" stored="true" />         
  <field name="useful" type="tint" indexed="true" stored="true" />
  <field name="user_id" type="text_en" indexed="true" stored="true" />
  <field name="name" type="text_en" indexed="true" stored="true" />
  <field name="full_address" type="text_en" indexed="true" stored="true" />
  <field name="latitude" type="tfloat" indexed="true" stored="true" />
  <field name="longitude" type="tfloat" indexed="true" stored="true" />
  <field name="neighborhoods" type="text_en" indexed="true" stored="true" />
  <field name="open" type="text_en" indexed="true" stored="true" />
  <field name="review_count" type="tint" indexed="true" stored="true" />
  <field name="state" type="text_en" indexed="true" stored="true" />

Then, we retrieve and clean a subset of our Yelp data with a Hive query, download it as CSV and index it with the indexer tool and this command:

hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file solr_local/reviews.conf --output-dir hdfs://localhost:8020/tmp/load --verbose --go-live --zk-host localhost:2181/solr --collection reviews hdfs://localhost:8020/tmp/query_result.csv

The command will use our morphline file to map the Yelp data to the fields defined in our index schema.xml.

While debugging morphline, the —dry-run option will save you some time.


Customize the search result

The administration panel lets you tweak the look & feel and features of the search page. This is explained in the second part of the video.


Conclusion

Cloudera Search is great for opening your user base to Hadoop and do quick data retrieval. Some other articles describes greatly some user use cases, like email or customer data search.

Cloudera Morphline is also an interesting tool for facilitating the indexing of your data. You can learn more about it on its project website.

As usual feel free to comment on the hue-user list or @gethue!


Troubleshooting

1. If you see this error:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore ‘reviews_shard1_replica1’: Unable to create core: reviews_shard1_replica1 Caused by: Could not find configName for collection reviews found:null</str>

You might have forgotten to create the collection:

solrctl instancedir --create review solr_configs

2. If you see this error:

ERROR - 2013-10-10 20:01:21.383; org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check solr/home property and the logs
ERROR - 2013-10-10 20:01:21.409; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: solr.xml not found in ZooKeeper
       at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:109)
Server is shutting down

You might need to force Solr to reload the configuration. Beware, this might break ZooKeeper and you might need to read error #3.


3. If you see this error:

KeeperErrorCode = NoNode for /overseer/collection-queue-work</str>
<str name="trace">
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /overseer/collection-queue-work


It probably comes from error #2. You might need to re-upload the config and recreate the collection.


This article was originally posted 10 months ago.

Tags: search video tutorial season2


comments powered by Disqus

Blog Archive

Browse archive

Blog Tags

loading...