What does cpu time mean for a Hadoop job?

I’m afraid I don’t understand the results of synchronizing the Map-Reduce job. For example, the work that I start gives me the following results from the job tracker.

Finished in: 1mins, 39sec

CPU Time (ms) 150,460 152,030 302,490

Records in elapsed processor time (ms) are for Map, Reduce, and Total, respectively. But then how is “processor time” measured and what does it mean? Is this the total total time spent on each of the cards and gears assigned to work? Is it possible to measure other times from frames, such as time for shuffling, sorting, partitioning, etc.? If so, how?

The second question that bothers me. I saw here a few posts ( Link1 , Link2 ) that suggest using getTime () in the driver class:

long start = new Date().getTime(); boolean status = job.waitForCompletion(true); long end = new Date().getTime(); System.out.println("Job took "+(end-start) + "milliseconds"); 

Doesn't that mean the first entry in the Job Tracker report provides? It's necessary? What is the best time method for working with haup, especially when I want to use I / O time, to calculate the time per node for each step?

+6
source share
1 answer

The phase of the card consists of: a recording device, a card, a combiner and a separator.

The reduction phase consists of: shuffle, sort, decrease, output.

The CPU time that you see is for the entire phase of the card and the entire phase of reduction ... and not just the function itself. This is a kind of confusion of terminology, because you have a map function and a decrease function, which are only part of the map phase and reduce the phase. This is the total processor time for all nodes in the cluster.

CPU time is very different from real time. CPU time is how much time is spent on processors, and in real time, what you and I experience as people. Think about it: suppose you have the same job running on the same data, but in a single cluster of 20 node, and then in a cluster of 200 node. In general, the same amount of processor time will be used on both clusters, but a 200 node cluster will work 10 times faster in real time. CPU time is a useful indicator when you have a common system with many tasks running at the same time.

I don’t know how you would go deeper to get processor time in each phase. Using a date timer is probably not what you are looking for.

+14
source

Source: https://habr.com/ru/post/921698/


All Articles