Welcome to season 2 of the Hue video series. In this new chapter we are going to demonstrate how Hue can simplify Hadoop usage and lets you focus on the business and less about the underlying technology. In a real life scenario, we will use various Hadoop tools within the Hue UI and explore some data and extract some competitive advantage insights from it.
Let’s go surf the Big Data wave, directly from your Browser!
We want to open a new restaurant. In order to optimize our future business we would like to learn more about the existing restaurants, which tastes are trending, what food eaters are looking for or are positive/negative about… In order to answer these questions, we are going to need some data.
Luckily, Yelp is providing some datasets of restaurants and reviews and we download them. What’s next? Let’s move the data into Hadoop and make it queryable!
The current format is Json, which is easy to save but difficult to query as it consist in one big record for each row and requires a more sophisticated loader. We are also going to cleanup the data a bit in the process.
All the code is available on the Hadoop Tutorial github.
Pig natively provides a JsonLoader. We load our data and map it to a schema, then explode the votes into 3 columns. Notice the clean-up of the text of the reviews.
Here is the script:
reviews = LOAD 'yelp_academic_dataset_review.json' USING JsonLoader('votes:map,user_id:chararray,review_id:chararray,stars:int,date:chararray,text:chararray,type:chararray,business_id:chararray'); tabs = FOREACH reviews GENERATE (INT) votes#'funny', (INT) votes#'useful', (INT) votes#'cool', user_id, review_id, stars, REPLACE(REPLACE(text, '\n', ''), '\t', ''), date, type, business_id; STORE tabs INTO 'yelp_academic_dataset_review.tsv';
Note: if the script fails with a ClassNotFound exception, you might need to logging as ‘oozie’ or ‘hdfs’ and upload /usr/lib/pig/lib/json-simple-1.1.jar into /user/oozie/share/lib/pig on HDFS with File Browser.
Let’s convert the business data to TSV with a great Pig features: Python UDF. We are going to process each row with with a UDF loading the Json records one by one and printing them with tabs as delimiter.
As Pig is currently using Jython 2.5 for executing Python UDF and there is no builtin json lib, we need to download jyson from http://downloads.xhaus.com/jyson/. Grab the jyson-1.0.2 version, extract it and upload jyson-1.0.2.jar to /user/oozie/share/lib/pig with FileBrowser.
We need to import our Python UDF into Pig. Open up the Pig Editor and upload a file resource named converter.py [link]. You can also create the file directly on HDFS with FileBrowser, then edit it and add this script:
from com.xhaus.jyson import JysonCodec as json @outputSchema("business:chararray") def tsvify(line): business_json = json.loads(line) business = map(unicode, business_json.values()) return '\t'.join(business).replace('\n', ' ').encode('utf-8')
Go to ‘Properties’, ‘Resource’ and specify the path to converter.py on HDFS.
You are then ready to type the following Pig script:
REGISTER 'converter.py' USING jython AS converter; reviews = LOAD '/user/romain/yelp/yelp_academic_dataset_business.json' AS (line:CHARARRAY); tsv = FOREACH reviews GENERATE converter.tsvify(line); STORE tsv INTO 'yelp_academic_dataset_business.tsv'
Pig is a powerful tool for processing terabytes of data and Hue Pig Editor makes it easier to play around. Python UDF will become part of the editor when HUE-1136 is finished. In episode 3, we will see how to convert to even better formats.
In the next episode, let’s see how to query the data and learn more about the restaurant market!
comments powered by Disqus