Skip to content
Vishnu Sankar edited this page May 19, 2018 · 1 revision

Welcome to the AzkabanWorkflow wiki!

The insight provided by Big Data has become a prerequisite for companies to remain competitive. As dictated by creating an open source software project, Hadoop has grown from being a small distribution to becoming an integral part of a company’s IT ecosystem. The speed at which a company can generate business insights out of Hadoop is vital to the data they have available when making important business decisions. The challenge is how to integrate new Big Data applications and processes into existing IT processes, enabling data scientists and analysts, without causing major disruption and impacting businesses-usual operations. This book explores why organizations need a workflow engine to support their Big Data Hadoop environment. It also highlights the main open source workflow solutions available today and exposes the limitations of these solutions. The book then moves on to demonstrate the importance of a workflow engine for Hadoop and Big Data processing. This book introduces native integration that simplifies and accelerates delivery of enterprise Hadoop applications and insight. It quickly ties Hadoop into larger business processes, exposing it to the broader business user, rather than just the data scientist or data engineer. Why a workflow engine is an imperative for Hadoop If you’re operating in the big data space, you have an abundance of technologies to consider, making standing up an environment complex. However, the reality is quite different; when it comes to creating a Big Data environment, the basic principles of a smaller standard environment are still applicable. You still need to integrate traditional data from your relational data structures into Big Data systems. And conversely you need to enable data from those Big Data systems to be integrated into the traditional environment to produce reports. So when it comes to executing Big Data workflows, these workflows typically combine Big Data technologies and legacy applications in a single business process. So why is a workflow engine an imperative component in a Hadoop development environment? When it comes to data processing, Hadoop developers frequently struggle to process Big Data in its raw format. Multiple, overlapping and above all time-consuming pre-processing operations, such as standard extract, transform, load (ETL) need to take place prior to the physical processing. To remain agile and meet the timely needs of the business for new services and updates, Hadoop developers need to automate this process by organizing the steps work into reusable workflows that can be used over and over again to industrialize and accelerate the development process. Put simply, automation eliminates the need to write new code.

Clone this wiki locally