Apache Tez is an alternative to the traditional MapReduce, which allows you to perform tasks to meet the requirements of rapid response and maximum throughput in petabytes.
Higher-level processing applications, such as Hive and Pig, need a runtime structure that can efficiently express complex query logic and then execute it at high performance, controlled by Tez. Tez achieves this goal by modeling data processing not as a single task, but rather as a data flow graph.
... with vertices in the graph representing application logic and edges representing data movement. The rich data stream definition API allows users to express complex query logic in an intuitive way, and it is a natural fit for query plans created by higher-level declarative applications such as Hive and Pig ... [The] data stream can be expressed as a single Tez job which will perform the entire calculation. Extending this logical graph to the physical task graph and executing it is borrowed by Tez.
The Apache Tez Blog Data Processing API describes a simple Java API used to express DAG data processing. API has three components
• DAG . this determines the overall work. The user creates a DAG for each data processing job.
• Vertex . this defines the user logic and the resources and environment necessary to execute the user logic. The user creates a Vertex object for each step of the job and adds it to the DAG.
• Edge . this determines the relationship between the points of producer and consumer. The user creates an Edge object and connects it to the producer and consumer peaks.
The edge properties defined by Tez allow it to create user tasks, configure their inputs and outputs, distribute them correctly, and determine how to route data between tasks. Tez also allows you to define parallelism for each vertex execution by specifying the user manual, size, and data resources.
Data movement . Defines data routing between tasks. One-to-one: data from the ith route of the manufacturer’s tasks to the ith consumer task.
Broadcast Data from the manufacturer’s tasks is directed to all consumer tasks.
Scatter-Gather : producer tasks scatter data into fragments and user tasks, collect fragments. The ith fragment from all producer tasks is directed to the ith consumer task.
Planning . Determines when a consumer job is scheduled. ◦Sequential: a consumer task can be scheduled after the completion of the manufacturer’s task. Parallel . The user's task should be planned in conjunction with the task of the manufacturer.
Data source . Determines the life expectancy / reliability of the task output. ◦ Translation. The exit will be available after the exit of the task. Exit may be lost later. Constantly Reliable : the output is securely stored and will always be available Ephemeral . The output is available only during the execution of the task of the manufacturer.
For more information on Tez architecture, see this Apache Tez Design Doc .