Tutorial: Executing Hive or Impala Queries with Python

This post talks about Hue, a UI for making Apache Hadoop easier to use.

Hue uses a various set of interfaces for communicating with the Hadoop components. This post describes how Hue is implementing the Apache HiveServer2 Thrift API for executing Hive queries and listing tables. The same interface can also be used for talking to Cloudera Impala.

 

Hive

This code requires Hue 2.3 or later in order to successfully work. We are using the Beeswax examples as data. They can be installed in the second step of the Hue Quick Start wizard. Obviously, Hive Server 2 needs to be running.

First we log in in the Hue shell. HUE_HOME is the path where was Hue installed:

$HUE_HOME/build/env/bin/hue shell

Create a user under which the queries are going to be sent:

from beeswax.server import dbms
from django.contrib.auth.models import User
hue, created = User.objects.get_or_create(username='hue')

1. List the tables of the default database:

db = dbms.get(hue)
db.get_tables()
>
['sample_07', 'sample_08']

2. Execute a statement.

Here we are doing a SELECT in order to calculate the average salaries of the employes but it could be any SQL statements (e.g. CREATE, ALTER, SHOW…):

query = db.execute_statement('select avg(salary) from sample_07')

We then check for the query completion:

db.get_state(query.get_handle())
> EnumValue(<enum.Enum object at 0x29d8b50>, 1, 'running')

db.get_state(query.get_handle())
> EnumValue(<enum.Enum object at 0x29d8b50>, 1, 'running')

db.get_state(query.get_handle())
> EnumValue(<enum.Enum object at 0x29d8b50>, 2, 'available')

Here we fetch the result and show its structure:

result = db.fetch(query.get_handle())
> dir(result)
['__doc__',
'__init__',
'__module__',
'cols',
'columns',
'data_table',
'has_more',
'ready',
'rows',
'start_row']

The result is a generator, this is a way to print its content:

for row in result.rows():
 print row
 
> [47963.62637362637]

Query logs can be retrieved like this:

print db.get_log(query.get_handle())

> 13/04/22 17:32:50 INFO ql.Driver: <PERFLOG method=compile>
13/04/22 17:32:50 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
13/04/22 17:32:50 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis
...
13/04/22 17:32:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/04/22 17:32:50 WARN conf.Configuration: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
13/04/22 17:32:50 INFO exec.Task: Starting Job = job_201304170951_0028, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201304170951_0028
13/04/22 17:32:50 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201304170951_0028
13/04/22 17:32:52 INFO exec.Task: Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
13/04/22 17:32:52 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
13/04/22 17:32:52 INFO exec.Task: 2013-04-22 17:32:52,927 Stage-1 map = 0%,  reduce = 0%
13/04/22 17:32:55 INFO exec.Task: 2013-04-22 17:32:55,937 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.66 sec
13/04/22 17:32:56 INFO exec.Task: 2013-04-22 17:32:56,942 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.66 sec
...

Note

HiveServer2 by default uses ThriftSASL transport. You can run it in non-sasl mode by adding the following to the hive-site.xml:

<property>
  <name>hive.server2.authentication</name>
  <value>NOSASL</value>
</property>

 

Impala

Impala is using the same API as HiveServer2. We reuse the same client by just specifying Impala as the server.

from beeswax.server.dbms import get_query_server_config

impala_config = get_query_server_config(name='impala')
db = dbms.get(hue, impala_config)

We can then perform the same operations as with HiveServer2:

db.get_tables()
query = db.execute_statement('select avg(salary) from sample_07')
...

Note

Impala needs to be configured for the HiveServer2 interface, as detailed in the hue.ini.

 

How it works

Here are the steps done in order to send the queries from Hue:

  1. Grab the HiveServer2 IDL.

  2. Generate the python code with Thrift 0.9. Hue does it with this script regenerate_thrift.sh.

  3. Implement it. This is hive_server2_lib.py.

  4. An extra layer (dbms.py) has been added in order to simplify the use of the raw API and keep compatibility with Beeswax (ancestor of Hive Server 2). This is what we use in the above example. This is the list of all the possible operations.

 

Sum-up

Hue 2.3 supports most of the functionalities of HiveServer2 and Impala interfaces. The full implementation will be available in Hue 2.4 or in the upcoming CDH4.3 release (and is already in Hue master). User sessions are saved in the Hue DB and are transparent in the use of the API.

Hue’s Beeswax and Impala apps are now based on the HiveServer2 interface which offers a more robust alternative than Beeswax. Hue’s implementation can be reused for building new apps or clients. Feel free to post comments or question on the hue group!


This article was originally posted 1 year ago.

Tags: hive impala tutorial sdk


comments powered by Disqus

Blog Archive

Browse archive

Blog Tags

loading...