Apache Tez Architecture Explanation

I was trying to understand what makes Apache Tez with Hive much faster than shrinking a map with a hive. I can not understand the concept of DAG.
Everyone has a good recommendation for understanding the TEAT TEAM architecture.

+5
source share
5 answers

A presentation at the Hadoop Summit (slide 35) discussed how the DAG approach is optimal against the MapReduce paradigm:

http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212

In essence, this will allow higher-level tools (such as Hive and Pig) to define their general processing steps (also as a workflow, the so-called Directed Acyclical Graph) before starting work. DAG is a graph of all the steps necessary to complete a task (request for a hive, task for pigs, etc.). Since all stages of work can be calculated before runtime, the system can use caching of intermediate results of work "in memory". Whereas in MapReduce all intermediate data between MapReduce phases require writing to HDFS (disk), adding latency.

YARN also allows you to reuse the container for Tez tasks. For instance. each server is divided into several "containers", not the "map" or "reduce" slots. For any given job execution point, this allows Tez to use the entire cluster for map phases or reduction phases as needed. While in Hadoop v1 before YARN, the number of card slots (and slot reduction) was fixed / hardcoded at the platform level. Better utilization of all available cluster resources tends to speed up

+8
source

I am not using Tez yet, but I read about it. I think the main two reasons that will make Hive run faster than Tez are:

  • Tez will share data between Map Reduce memory tasks whenever possible, avoiding the overhead of writing / reading to / from HDFS
  • With Tez, you can run multiple graphic / shrinking DAGs defined in Hive in one Tez session without having to launch a new application wizard each time.

You can find a list of links to help you better understand Tez here: http://hortonworks.com/hadoop/tez/

+3
source

Apache Tez is an alternative to the traditional MapReduce, which allows you to perform tasks to meet the requirements of rapid response and maximum throughput in petabytes.

Higher-level processing applications, such as Hive and Pig, need a runtime structure that can efficiently express complex query logic and then execute it at high performance, controlled by Tez. Tez achieves this goal by modeling data processing not as a single task, but rather as a data flow graph.

... with vertices in the graph representing application logic and edges representing data movement. The rich data stream definition API allows users to express complex query logic in an intuitive way, and it is a natural fit for query plans created by higher-level declarative applications such as Hive and Pig ... [The] data stream can be expressed as a single Tez job which will perform the entire calculation. Extending this logical graph to the physical task graph and executing it is borrowed by Tez.

The Apache Tez Blog Data Processing API describes a simple Java API used to express DAG data processing. API has three components

DAG . this determines the overall work. The user creates a DAG for each data processing job.

Vertex . this defines the user logic and the resources and environment necessary to execute the user logic. The user creates a Vertex object for each step of the job and adds it to the DAG.

Edge . this determines the relationship between the points of producer and consumer. The user creates an Edge object and connects it to the producer and consumer peaks.

The edge properties defined by Tez allow it to create user tasks, configure their inputs and outputs, distribute them correctly, and determine how to route data between tasks. Tez also allows you to define parallelism for each vertex execution by specifying the user manual, size, and data resources.

Data movement . Defines data routing between tasks. One-to-one: data from the ith route of the manufacturer’s tasks to the ith consumer task.

Broadcast Data from the manufacturer’s tasks is directed to all consumer tasks.

Scatter-Gather : producer tasks scatter data into fragments and user tasks, collect fragments. The ith fragment from all producer tasks is directed to the ith consumer task.

Planning . Determines when a consumer job is scheduled. ◦Sequential: a consumer task can be scheduled after the completion of the manufacturer’s task. Parallel . The user's task should be planned in conjunction with the task of the manufacturer.

Data source . Determines the life expectancy / reliability of the task output. ◦ Translation. The exit will be available after the exit of the task. Exit may be lost later. Constantly Reliable : the output is securely stored and will always be available Ephemeral . The output is available only during the execution of the task of the manufacturer.

For more information on Tez architecture, see this Apache Tez Design Doc .

+3
source

The main difference from MR and TEZ is writing intermediate data to a local disk in MR. But, in TEZ, either the mapper / reducer function will be executed in one instance on each container, using in memory. In addition, TEZ performs operations such as transactions or actions in spark operations.

0
source

Tez - DAG (Directed acyclic graph) architecture. Typical card reduction work is done as follows:

  • Reading data from a file -> one disk access

  • Run mappers

  • Write card output → second disk access

  • Run shuffle and sort -> read map output, third disk access

  • write shuffle and sort -> write sorted data for gearboxes -> fourth disk access

  • Running gears that reads sorted data -> fifth drive output

  • Recording output signals from gearboxes → access to the sixth disk

Tez is very similar to Spark (Tez was created by Hortonworks long before Spark):

  • Execute the plan, but do not read data from the disk.

  • After you are ready to perform some calculations (similar to actions in sparks), get the data from the disk and follow all the steps and make a conclusion.

Only one read and write.

Pay attention to the efficiency that occurs when you do not switch to the disk several times. Intermediate results are stored in memory (cannot be written to discs). In addition, there is vectorization (batch processing of strings instead of one line at a time). All this improves efficiency during the request.

Links http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey https://community.hortonworks.com/questions/83394/difference-between-mr-and-tez.html

0
source

Source: https://habr.com/ru/post/1201210/


All Articles