Hadoop Tutorials II: 3. Schedule Hive queries with Oozie coordinators

In the previous episode we saw how to create an Hive action in an Oozie workflow. These workflows can then be repeated automatically with an Oozie coordinator. This post describes how to schedule Hadoop jobs (e.g. run this job everyday at midnight).

Oozie Coordinators

Our goal: compute the 10 coolest restaurants of the day everyday for 1 month:


From episode 2, now have a workflow ready to be ran everyday. We create a ‘daily_top’ coordinator and select our previous Hive workflow. Our frequency is daily, and we can start from November 1st 2012 12:00 PM to November 30th 2012 12:00 PM.


The most important part is to recreate a URI that represents the date of the data. Notice that there is more efficient way to do this but we have an example easier to understand.


As our data is already present, we just need to create an output dataset named ‘daily_days’ (which contrary to the input dataset won’t check if the input is available). We pick the URI of the data set to be like the date format of the episode one (e.g. $YEAR-$MONTH-$DAY). These parameters are going to be automatically filled in our workflow by the coordinator.


We now link our ‘daily_days’ dataset to our workflow variable ‘date’ and save the coordinator.


Notice that on Step 5 the  ’Oozie parameters’ list which is the equivalent of the coordinator.properties file. The values will appear in the submission pop-up an can be overridden. There are also ‘Workflow properties’  for fill-up workflow parameters directly (which can be parameterized themselves by ‘Oozie parameters’ or EL functions or constants). We will have more on this in the upcoming Oozie bundle episode.


Now submit the coordinator and see the 30 instances (one for each day of November) being  created and triggering the workflow with the Hive query for the corresponding day. Coordinators can also be stopped and re-ran through the UI. Each workflow can be individually accessed by simply clicking on the date instance.


Sum-up

With their input and output datasets Coordinators are great for scheduling repetitive workflows in a few clicks. Hue offers a UI and wizard that lets you avoid any Oozie XML. At some point, Hue will also make it even simpler by automating the creation of the workflow and coordinator: HUE-1389.

Next, let’s do fast SQL with Impala!


This article was originally posted 7 months ago.

Tags: hive oozie video tutorial season2


comments powered by Disqus

Blog Archive

Browse archive

Blog Tags

loading...