Hadoop Tutorial: Schedule your Hadoop jobs intuitively with the new Oozie crontab!

Hue is taking advantage of a new way to specify the frequency of a coordinator in Oozie (OOZIE-1306). Here is how to put it in practice:


The crontab requires Oozie 4. In order to use the previous Frequency drop-down from Oozie 3, the feature can be disabled in hue.ini:


[oozie]

 # Use Cron format for defining the frequency of a Coordinator instead of the old frequency number/unit.

 enable_cron_scheduling=false

As usual feel free to comment on the hue-user list or @gethue!


This article was originally posted 5 months ago.

Tags: oozie tutorial video


Hadoop Tutorial: Submit any Oozie jobs directly from HDFS

With HUE-1476, users can submit Oozie jobs directly from HDFS. Just upload your configuration or browse an existing workspace and select a workflow, coordinator or bundle. A submit button will appear and let you execute the job in one click!

File Browser supports:

  • Parameters from workflow.xml, coordinator.xml, bundle.xml

  • Parameters from job.properties

Oozie Dashboard supports:

  • Dynamic progress and log report

  • One click MapReduce log access

  • Stop, Pause, Rerun buttons


Here is the workflow tutorial used in the video demo.


Of course, the Oozie Editor is still recommended if you want to avoid any XML :)


This article was originally posted 8 months ago.

Tags: oozie video tutorial HDFS


Hadoop tutorial: Bundle Oozie coordinators with Hue

Hue provides a great Oozie UI in order to use Oozie without typing any XML. In Tutorial 3, we demonstrate how to use an Oozie coordinator for scheduling a daily top 10 of restaurants. Now lets imagine that we also want to compute a top 10 and 100. How can we do this? One solution is to use Oozie bundles.



Workflow and Coordinator updates

Bundles are are way to group coordinators together into a set. This set is easier to manage as a unique instance and can be parameterized too.

The first step is to replace 10 by a variable ${n} in our Hive script:

CREATE TABLE top_cool AS
SELECT r.business_id, name, SUM(cool) AS coolness, '${date}' as `date`
FROM review r JOIN business b
ON (r.business_id = b.business_id)
WHERE categories LIKE '%Restaurants%'
AND `date` = '${date}'
GROUP BY r.business_id, name
ORDER BY coolness DESC
LIMIT ${n}

Then, in the workflow, we add a parameter in the Hive action: n=${n}. You can test the workflow by submitting it and providing 10 for the value n.


We now need to tell the Coordinator to fill-up with a value. For testing purpose, going to Step #5 of the editor and adding a ‘Workflow properties’ named ‘n’ and with value ‘10’ would produce the same result as in Tutorial 1. In practice these properties are mostly used for entering constants and EL functions that will directly provide a value to the workflow.


Bundle Editor

Lets create a new Bundle named ‘daily_tops’ with a kickoff date of 20121201. On the left panel, click on ‘Add’ in the Coordinator section. Select our ‘daily_top’ coordinator and a property named ‘n’ and with value ‘10’.


Add again the same coordinator and this time pick ‘10’ for the value of ‘n’. Repeat with ‘n’ set to ‘100’.


Bundle Dashboard

You are now ready to go and submit the bundle! You can follow the overall progress in the Bundle dashboard. Bundles can be stopped, killed and re-run. Clicking on an instantiation will link to the corresponding coordinator which is also linking to its generated workflows.



Sum-up

Of course, more efficient solutions exist than those in our simplified example. In practice Bundles are great for parameterizing non-date variables like market names (e.g. US, France). Another use case it to group together a series of coordinators in order to make them easier to manage (e.g. start, stop, re-run). Notice that the latest version of Hue that contains HUE-1546 was used in the video.


Hue comes up with a full set of Workflow/Coordinator/Bundle examples, ready to be submitted or copied. Hue can even be used with only its Oozie UI Dashboard, making it a breeze to manage Oozie in your browser.


Next, we will see how to browse our Yelp data in HBase! As usual feel free to comment on the hue-user list or @gethue!


This article was originally posted 10 months ago.

Tags: oozie video tutorial season2


Hadoop Tutorials II: 3. Schedule Hive queries with Oozie coordinators

In the previous episode we saw how to create an Hive action in an Oozie workflow. These workflows can then be repeated automatically with an Oozie coordinator. This post describes how to schedule Hadoop jobs (e.g. run this job everyday at midnight).

Oozie Coordinators

Our goal: compute the 10 coolest restaurants of the day everyday for 1 month:


From episode 2, now have a workflow ready to be ran everyday. We create a ‘daily_top’ coordinator and select our previous Hive workflow. Our frequency is daily, and we can start from November 1st 2012 12:00 PM to November 30th 2012 12:00 PM.


The most important part is to recreate a URI that represents the date of the data. Notice that there is more efficient way to do this but we have an example easier to understand.


As our data is already present, we just need to create an output dataset named ‘daily_days’ (which contrary to the input dataset won’t check if the input is available). We pick the URI of the data set to be like the date format of the episode one (e.g. $YEAR-$MONTH-$DAY). These parameters are going to be automatically filled in our workflow by the coordinator.


We now link our ‘daily_days’ dataset to our workflow variable ‘date’ and save the coordinator.


Notice that on Step 5 the  ’Oozie parameters’ list which is the equivalent of the coordinator.properties file. The values will appear in the submission pop-up an can be overridden. There are also ‘Workflow properties’  for fill-up workflow parameters directly (which can be parameterized themselves by ‘Oozie parameters’ or EL functions or constants). We will have more on this in the upcoming Oozie bundle episode.


Now submit the coordinator and see the 30 instances (one for each day of November) being  created and triggering the workflow with the Hive query for the corresponding day. Coordinators can also be stopped and re-ran through the UI. Each workflow can be individually accessed by simply clicking on the date instance.


Sum-up

With their input and output datasets Coordinators are great for scheduling repetitive workflows in a few clicks. Hue offers a UI and wizard that lets you avoid any Oozie XML. At some point, Hue will also make it even simpler by automating the creation of the workflow and coordinator: HUE-1389.

Next, let’s do fast SQL with Impala!


This article was originally posted 11 months ago.

Tags: hive oozie video tutorial season2


Hadoop Tutorials II: 2. Execute Hive queries and schedule them with Oozie

In the previous episode, we saw how to to transfer some file data into Apache Hadoop. In order to interrogate easily the data, the next step is to create some Hive tables. This will enable quick interaction with high level languages like SQL and Pig.

 

We experiment with the SQL queries, then parameterize them and insert them into a workflow in order to run them together in parallel. Including Hive queries in an Oozie workflow is a pretty common use case with recurrent pitfalls as seen on the user group. We can do it with Hue in a few clicks.

Get prepared

First, based on the data of the previous episode we create two tables in the Hive Metastore. We use the Metastore app and its create table wizard. Then, it is time to study the data!


Hive

Goal: we want to get the 10 coolest restaurants for a day.


Let’s open Beeswax Hive Editor and explore the range of dates that we have:

SELECT DISTINCT `date` FROM review ORDER BY `date` DESC;

Notice that you need to use backticks in order to use date as a column name in Hive.


The data is a bit old, so let’s pick 2012-12-01 as our target date. We can join the two tables in order to get the name of the restaurant and its average ‘cool’ score of the day. Submit this parameterized query and enter 2012-12-01 when prompted for the date:


SELECT r.business_id, name, AVG(cool) AS coolness
FROM review r JOIN business b
ON (r.business_id = b.business_id)
WHERE categories LIKE '%Restaurants%'
AND `date` = '$date'
GROUP BY r.business_id, name
ORDER BY coolness DESC
LIMIT 10


We have a good Hive query. Let’s create a result table ‘top_cool’ that will contain the top 10:


CREATE TABLE top_cool AS
SELECT r.business_id, name, SUM(cool) AS coolness, '$date' as `date`
FROM review r JOIN business b
ON (r.business_id = b.business_id)
WHERE categories LIKE '%Restaurants%'
AND `date` = '$date'
GROUP BY r.business_id, name
ORDER BY coolness DESC
LIMIT 10

And later replace ‘CREATE TABLE top_cool AS’ by ‘INSERT INTO TABLE top_cool’ in the Hive script as we want to create the table only the first time:


INSERT INTO TABLE top_cool
SELECT r.business_id, name, SUM(cool) AS coolness, '${date}' as `date`
FROM review r JOIN business b
ON (r.business_id = b.business_id)
WHERE categories LIKE '%Restaurants%'
AND `date` = '$date'
GROUP BY r.business_id, name
ORDER BY coolness DESC
LIMIT 10



Hive action in Apache Oozie

The video also starts here.

First we create a new workflow and add an Oozie action. We need to specify which SQL we want to run. This one needs to be uploaded to HDFS. In our case we open up the ‘workspace’ of the workflow, create a new file and copy paste the query. We we upload and pick the query file as the ‘Script name’.


Important

Then comes an important step. Our Hive action needs to talk to the Hive Metastore and so know its location. This is done by including the /etc/hive/conf/hive-site.xml as a ‘File’ resource and telling Oozie to use it as ‘Job XML’ configuration.


Note: when using a demo VM or a pseudo distributed cluster (everything on one machine), you might hit the error explained in the ‘Warning!’ section of the HCatalog post.


Note: when using a real cluster, as the workflow is going to run somewhere in the cluster, we need to the metastore to be remote. A remote Metastore can be contacted from any other hosts.


Lets specify that we are using a ‘date’ parameter in the Hive script. In our case we add the parameter in the Hive action:

date=${date}


The we save the workflow, fill up the date when prompted and look at the dynamic progress of the workflow! The output of the query will appear when you click on the ‘View the logs’ button on the action graph. In practice, INSERT, LOAD DATA would be used instead of SELECT in order to persist the calculation.


You can now monitor the workflow in the dashboard and stop or rerun it.


Note:

If you are seeing this error, it means that the input file or destination directory of the table is not writable by your user or the ‘hive’ user if you are with HiveServer2:


 Failed with exception copyFiles: error while moving files!!!
 FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
 


Sum-up

Hive queries can be simply tested in Beeswax Hive Editor before getting inserted in an Oozie workflow, all without touching the command line.


One of the Hue 3 goal is to remove the duplication of the hive script on the HDFS and the manual creation of the Hive action. With the new document model, one would refer to the saved Hive query in Beeswax and with just a click create it.


Creating a workflow lets you group other scripts together and run them atomically. Another advantage is to then execute the workflow repetitively (e.g. run a query every day at midnight) with an Oozie coordinator.

This is what we will cover in the next episode!


This article was originally posted 11 months ago.

Tags: tutorial video hive oozie metastore season2


Tutorial: A new UI for Oozie

Apache Oozie is a great tool for building workflows of Hadoop jobs and scheduling them repeatedly. However, the user experience could be improved. In particular, all the job management happens on the command line and the default UI is readonly and requires a non-Apache licensed javascript library that makes it even more difficult to use.


image

Current Oozie UI


image

New Oozie UI

 

Here is a short video demo:

 

The UI just sits on top of Oozie like the current Oozie UI. You can download a release here.

The README is available online as well as the source code on github and details how to install and start the UI.


Feature list

  • Workflows, Coordinators, Bundles dashboards
  • Built with standard and current Web technologies
  • Filtering, sorting, progress bars, XML highlighting
  • Kill, suspend, and re-run jobs from the UI
  • One click access to Oozie logs or MapReduce launcher logs
  • One click access to the HDFS outputs of the jobs
  • Spotlight search about Oozie instrumentation/configuration

 

We hope that you give a try to this new standalone UI. In the next version, we can see for providing some packages for a quicker install. As a side note, Oozie users who would like to try a Workflow/Coordinator/Bundle editor could have a look to the Hue Oozie app.


As usual, we are welcoming any feedback!


This article was originally posted 1 year ago.

Tags: oozie video tutorial release


What’s New in Hue 2.3

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Here’s a video demoing the major changes:

Here’s the new features list:

  • Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters
  • Table Browser: new application for managing Apache Hive databases, viewing table schemas and sample of content
  • Apache Oozie Bundles are now supported
  • SQL highlighting and auto-completion for Hive/Impala apps
  • Multi-query and highlight/run a portion of a query
  • Job Designer was totally restyled and now supports all Oozie actions
  • Oracle databases (11.2 and later) are now supported

We would like to thank everybody who worked on this release. New features and feedback are continuously being integrated!


This article was originally posted 1 year ago.

Tags: hive pig oozie video release


Hue 2.1.0 - Oct 2nd, 2012

Hue now provides an Apache Oozie application for creating workflows of Apache MapReduce, Apache Pig, Apache Hive, Apache Sqoop, Java, Shell, Ssh and Streaming jobs and scheduling them repetitively. Hue is now available in German, Spanish, French, Japanese, Korean, Portuguese, Brazilian and simplified Chinese.


This article was originally posted 1 year ago.

Tags: oozie pig hive sqoop release


What’s new in Hue 2.1

Hue is a Web-based interface that makes it easier to use Apache HadoopHue 2.1 (included in CDH4.1) provides a new application on top of Apache Oozie (a workflow scheduler system for Apache Hadoop) for creating workflows and scheduling them repetitively. For example, Hue makes it easy to group a set of MapReduce jobs and Hive scripts and run them every day of the week.

In this post, we’re going to focus on the Workflow component of the new application.

Workflow Editor

Workflows consist of one or multiple actions that can be executed sequentially or in parallel. Each action will run a program that can be configured with parameters (e.g. output=${OUTPUT} instead of hardcoding a directory path) in order to be easily reusable.

The current types of programs are:

  • MapReduce
  • Pig
  • Hive
  • Sqoop
  • Java
  • Shell
  • Ssh
  • Streaming jobs
  • DistCp

The application comes with a set of examples:


Workflows can be shared with other users and cloned. Forks are supported and enable actions to run at the same time. The Workflow Editor lets you compose your workflow.

Let’s take the Sequential Java (aka TeraSort) example and add an Hive action, HiveGen, that will generate some random data. TeraGen is a MapReduce job doing the same thing and both actions will run in parallel. Finally, the TeraSort action will read both outputs and sort them together You can see how this would look in Hue via the screenshot below.

Workflow Dashboard

Our TeraGen workflow can then be submitted and controlled in the Dashboard. Parameters values (e.g. ${OUTPUT} of the output path of the TeraSort action) are prompted when clicking on the submit button.

Jobs can be filtered/killed/restarted and detailed information (progress, logs) is available within the application and in the Job Browser Application.

Individual management of a workflow can be done on its specific page. We can see the active actions in orange below:

Summary

Before CDH4.1, Oozie users had to deal with XML files and command line programs. Now, this new application allows users to build, monitor and control their workflows within a single Web application. Moreover, the Hue File Browser (for listing and uploading workflows) and Job Browser (for accessing fine grained details of the jobs) are leveraged.

The next version of the Oozie application will focus on improving the general experience, increasing the number of supported Oozie workflows and prettifying the Editor.

In the meantime, feel free to report feedback and wishes to hue-user!


This article was originally posted 1 year ago.

Tags: oozie


Hue 2.0

Hue 2.0.1 has just been released. 2.0.1 represents major improvement on top of the Hue 1.x series. To list a few key new features:

  • Frontend has been re-implemented as full screen pages.
  • Hue supports LDAP (OpenLDAP and Active Directory). Hue can be configured to authenticate against LDAP. Additionally, Hue can import users and groups from LDAP, and refresh group membership from LDAP.
  • Hue supports per-application authorization. Administrators can grant or limit group access to applications.
  • Hue has a new Shell application, which allows access to the HBase shell, Pig shell, and more.
  • The Job Designer now submits jobs through Oozie, which is more secured.
Please see the release notes for a complete reference.
 
A New Frontend

In particular, I am really excited about the new frontend. The Hue 1.x frontend renders application UIs via Javascript in desktop-like windows, which coexist in a single browser window. This desktop-like frontend turns out to be hard to maintain, as well as inconvenient and inflexible for third party application developers. In Hue 2.0, each application gets its own browser window or tab: 

 Applications in multiple tabs

For end users, this means that every page view has its own URL and can be bookmarked. Users also have better control of the windowing behaviours (maximize, minimize, alt-tab) and browsing history. And for enterprise users, Hue 2.0 works on Internet Explorer, which is plagued by memory reclamation issues with Hue 1.x.

For third party application developers, this greatly reduces the complexity of writing an application frontend. Developers also have full control of the rendered HTML, and can therefore employ their favourite Javascript and CSS libraries (jQueryBootstrapknockout.jsHighcharts, etc.). Hue 2.0 itself uses jQuery and Bootstrap extensively, which has sped up our own frontend development cycles.

Compatibility

Applications written for Hue 1.x are not compatible with Hue 2.0. Fortunately, the transition is straightforward and is documented in the SDK guide. For example, Hue 1.x provides an “HtmlTable” widget that supports banding, column sorting and more. In Hue 2.0, the same functionality is provided by the DataTables.

Hue 2.0.1 is compatible with (and included in) CDH4.

  • It works with HA NameNode, since it communicates with HDFS via the HttpFS REST API.
  • It can submit jobs to YARN, since job submission is executed via Oozie. But it cannot browse any YARN jobs.
  • It supports Hive 0.8.1.

Acknowledgement

This release is possible thanks to the contributions from the team. Your feedback is greatly appreciated. Drop us a note in our user list.


This article was originally posted 2 years ago.

Tags: hdfs pig yarn hive oozie filebrowser


comments powered by Disqus

Blog Archive

Browse archive

Blog Tags

loading...