This blog post is about accessing the Hive Metastore from Hue, the open source Hadoop UI and clearing up some confusion about HCatalog usage.
Apache HCatalog is a project enabling non-Hive scripts to access Hive tables. You can then directly load tables with Pig or MapReduce without having to worry about re-defining the input schemas, caring about the data location or duplicating it.
Hue comes with an application for accessing the Hive metastore within your browser: Metastore Browser. Databases and tables can be navigated through and created or deleted with some wizards.
The wizards were demonstrated in the previous tutorial about how to Analyse Yelp data. Hue uses HiveServer2 for accessing the Hive Metastore instead of HCatalog. This is because HiveServer2 is the new secure and multi concurrent server for Hive and it already includes a fast Hive Metastore API.
HCatalog connectors are however useful for accessing Hive data from Pig. Here is a demo about accessing the Hive example tables from the Pig Editor.
Here is a video summary of the new features:
First you need to install HCatalog from here or Cloudera Manager. If you are using a non-pseudo-distributed cluster (e.g. not on a demo VM) make sure that the Hive Metastore is remote or you will have an error like below. Then, upload the 3 jars from /usr/lib/hcatalog/share/hcatalog/ and all the Hive ones from /usr/lib/hive/lib to the Oozie Pig sharelib in /user/oozie/share/lib/pig. This can be done in a few clicks while being logged as ‘oozie’ or ‘hdfs’ in the File Browser. Beware than all the jars will be included in all the future Pig script, which might be unnecessary.
In Hue 3.6 or CDH5, no need to copy the jars anymore. Just include the hive-site.xml file as File in the Properties of the script, e.g, /user/test/hive-site.xml
-- Load table 'sample_07' sample_07 = LOAD 'sample_07' USING org.apache.hcatalog.pig.HCatLoader(); -- Compute the average salary of the table salaries = GROUP sample_07 ALL; out = FOREACH salaries GENERATE AVG(sample_07.salary); DUMP out;
As HCatalog needs to access the metastore, we need to specify the hive-site.xml. Go in ‘Properties’, ‘Resources’ and add a ‘File’ pointing to the hive-site.xml uploaded on HDFS.
Then submit the script by pressing CTRL + ENTER! The result (47963.62637362637)
will appear at the end of the log output.
Notice that we don’t need to redefine the schema as it is automatically picked-up by the loader. If you use the Oozie App, you can now freely use HCatalog in your Pig actions.
If you are getting this error, it means that your metastore belong to the Hive user and is not remote.
Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection. 2013-07-24 23:20:04,969 [main] INFO DataNucleus.Persistence - DataNucleus Persistence Factory initialised for datastore URL="jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true" driver="org.apache.derby.jdbc.EmbeddedDriver" userName="APP"
sudo rm /var/lib/hive/metastore/metastore_db/*lck sudo chmod 777 -R /var/lib/hive/metastore/metastore_db
Similarly as HCatLoader, use HCatStorer for updating back the table, e.g.:
STORE alias INTO 'sample_07' USING org.apache.hcatalog.pig.HCatStorer();
We saw that Hue makes Hive Metastore easy to access and supports the HCatalog connectors for Pig. Hue 3.0 with simplify it even more by automatically copying the required jar files and making the table names auto-completable!
As usual, we welcome any feedback on the user group!
comments powered by Disqus