-
Notifications
You must be signed in to change notification settings - Fork 0
Negotiators versus workflow managers
It is important to differentiate between a resource scheduler (often referred to as a ‘negotiator’) and a workflow manager. Negotiators are important components of Hadoop. The scheduler produces processes on different nodes, allocating resources based on application requirements and cluster capacity. Hadoop’s Yet-Another-Resource-Negotiator (YARN) operates transparently to the user, and generally you do not have to deal with it. Workflow managers meanwhile manage complex Hadoop tasks. For example, multiple jobs that run sequentially, in parallel or in response to event triggers. A job can be many things, such as running individual Java apps, accessing the Hadoop file system / other data stores or running various Hadoop applications. The differences don’t end there. Hadoop workflow managers are also different in terms of the programming model/language, code complexity, property/parameter description format, supported applications, scalability, documentation and support. Open source workflow tools: What’s available? Apache Oozie When a group of Yahoo! engineers met around a table in Bangalore, India, to find a way to perform more complex, multistage Hadoop processing, the result was the Oozie framework. This open source project, based on Java technology, simplifies the creation of workflows and manages coordination among jobs. Apache Oozie (as it subsequently became known) enables developers to blend multiple jobs sequentially into one logical unit of work. The advantages are abundant. First, Apache Oozie is fully integrated with the Apache Hadoop stack and supports Hadoop jobs for Apache MapReduce, Pig, Hive, and Sqoop. Second, the open source framework can be used to schedule jobs specific to a system, such as Java programs. Third, Hadoop administrators can create complex data transformations that can combine the processing of different individual tasks and even sub-workflows. The result? More control over complex jobs and increased repeatability of jobs whenever needed. Azkaban Azkaban is an open source workflow engine aimed at the Hadoop ecosystem. Developed by LinkedIn and written in Java, Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track Big Data workflows. There are several common features between Azkaban and Oozie. Both are open source workflow engines for Hadoop job scheduling and both are written in Java. There, however, the similarities end. Azkaban is simple to use, with easy to define workflow schedules, whereas it is more complex to define workflows using Oozie. Azkaban job scheduling only supports time-based scheduling, while Oozie supports both time-based and input-data based scheduling. Meanwhile, Azkaban keeps the state of all running workflows in memory, but with Oozie a workflow state is in memory only when doing a state transition. Airflow The accommodation letting service Airbnb recently open-sourced Airflow, its own data workflow management framework, under the Apache license. Airflow is being used internally at Airbnb to build, monitor, and adjust data pipelines. The platform is written in Python, as are the workflows that run on it. Airflow enables developers of workflows to author, maintain, and run workflows based on a periodic schedule. The platform interacts with Hive, Presto, MySQL, HDFS, Postgres and S3. Hooks are also provided to make the system more extensible. Airflow provides a command line interface, as well as a web-based user interface that allows users to visualize pipeline dependencies, monitor progress and trigger tasks. How does Airflow differ from Oozie or Azkaban? Airflow pipelines are defined as code, as opposed to a markup language in Oozie or Azkaban. Moreover, tasks are instantiated dynamically, as opposed to creating tasks by deriving classes in Luigi (see below). As a result, Airflow is ideally suited to situations where pipelines are generated dynamically from configuration files or metadata of any form.