Hadoop Performance

I installed hadoop 1.0.0 and tried out an example of word counting (single node cluster). It took 2 m 48 sec. Then I tried the standard linux word count program, which runs in 10 milliseconds on the same dataset (180 kB). Am I doing something wrong, or is Hadoop really slow?

time hadoop jar /usr/share/hadoop/hadoop*examples*.jar wordcount someinput someoutput 12/01/29 23:04:41 INFO input.FileInputFormat: Total input paths to process : 30 12/01/29 23:04:41 INFO mapred.JobClient: Running job: job_201201292302_0001 12/01/29 23:04:42 INFO mapred.JobClient: map 0% reduce 0% 12/01/29 23:05:05 INFO mapred.JobClient: map 6% reduce 0% 12/01/29 23:05:15 INFO mapred.JobClient: map 13% reduce 0% 12/01/29 23:05:25 INFO mapred.JobClient: map 16% reduce 0% 12/01/29 23:05:27 INFO mapred.JobClient: map 20% reduce 0% 12/01/29 23:05:28 INFO mapred.JobClient: map 20% reduce 4% 12/01/29 23:05:34 INFO mapred.JobClient: map 20% reduce 5% 12/01/29 23:05:35 INFO mapred.JobClient: map 23% reduce 5% 12/01/29 23:05:36 INFO mapred.JobClient: map 26% reduce 5% 12/01/29 23:05:41 INFO mapred.JobClient: map 26% reduce 8% 12/01/29 23:05:44 INFO mapred.JobClient: map 33% reduce 8% 12/01/29 23:05:53 INFO mapred.JobClient: map 36% reduce 11% 12/01/29 23:05:54 INFO mapred.JobClient: map 40% reduce 11% 12/01/29 23:05:56 INFO mapred.JobClient: map 40% reduce 12% 12/01/29 23:06:01 INFO mapred.JobClient: map 43% reduce 12% 12/01/29 23:06:02 INFO mapred.JobClient: map 46% reduce 12% 12/01/29 23:06:06 INFO mapred.JobClient: map 46% reduce 14% 12/01/29 23:06:09 INFO mapred.JobClient: map 46% reduce 15% 12/01/29 23:06:11 INFO mapred.JobClient: map 50% reduce 15% 12/01/29 23:06:12 INFO mapred.JobClient: map 53% reduce 15% 12/01/29 23:06:20 INFO mapred.JobClient: map 56% reduce 15% 12/01/29 23:06:21 INFO mapred.JobClient: map 60% reduce 17% 12/01/29 23:06:28 INFO mapred.JobClient: map 63% reduce 17% 12/01/29 23:06:29 INFO mapred.JobClient: map 66% reduce 17% 12/01/29 23:06:30 INFO mapred.JobClient: map 66% reduce 20% 12/01/29 23:06:36 INFO mapred.JobClient: map 70% reduce 22% 12/01/29 23:06:37 INFO mapred.JobClient: map 73% reduce 22% 12/01/29 23:06:45 INFO mapred.JobClient: map 80% reduce 24% 12/01/29 23:06:51 INFO mapred.JobClient: map 80% reduce 25% 12/01/29 23:06:54 INFO mapred.JobClient: map 86% reduce 25% 12/01/29 23:06:55 INFO mapred.JobClient: map 86% reduce 26% 12/01/29 23:07:02 INFO mapred.JobClient: map 90% reduce 26% 12/01/29 23:07:03 INFO mapred.JobClient: map 93% reduce 26% 12/01/29 23:07:07 INFO mapred.JobClient: map 93% reduce 30% 12/01/29 23:07:09 INFO mapred.JobClient: map 96% reduce 30% 12/01/29 23:07:10 INFO mapred.JobClient: map 96% reduce 31% 12/01/29 23:07:12 INFO mapred.JobClient: map 100% reduce 31% 12/01/29 23:07:22 INFO mapred.JobClient: map 100% reduce 100% 12/01/29 23:07:28 INFO mapred.JobClient: Job complete: job_201201292302_0001 12/01/29 23:07:28 INFO mapred.JobClient: Counters: 29 12/01/29 23:07:28 INFO mapred.JobClient: Job Counters 12/01/29 23:07:28 INFO mapred.JobClient: Launched reduce tasks=1 12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=275346 12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/01/29 23:07:28 INFO mapred.JobClient: Launched map tasks=30 12/01/29 23:07:28 INFO mapred.JobClient: Data-local map tasks=30 12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=137186 12/01/29 23:07:28 INFO mapred.JobClient: File Output Format Counters 12/01/29 23:07:28 INFO mapred.JobClient: Bytes Written=26287 12/01/29 23:07:28 INFO mapred.JobClient: FileSystemCounters 12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_READ=71510 12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_READ=89916 12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=956282 12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26287 12/01/29 23:07:28 INFO mapred.JobClient: File Input Format Counters 12/01/29 23:07:28 INFO mapred.JobClient: Bytes Read=85860 12/01/29 23:07:28 INFO mapred.JobClient: Map-Reduce Framework 12/01/29 23:07:28 INFO mapred.JobClient: Map output materialized bytes=71684 12/01/29 23:07:28 INFO mapred.JobClient: Map input records=2574 12/01/29 23:07:28 INFO mapred.JobClient: Reduce shuffle bytes=71684 12/01/29 23:07:28 INFO mapred.JobClient: Spilled Records=6696 12/01/29 23:07:28 INFO mapred.JobClient: Map output bytes=118288 12/01/29 23:07:28 INFO mapred.JobClient: CPU time spent (ms)=39330 12/01/29 23:07:28 INFO mapred.JobClient: Total committed heap usage (bytes)=5029167104 12/01/29 23:07:28 INFO mapred.JobClient: Combine input records=8233 12/01/29 23:07:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=4056 12/01/29 23:07:28 INFO mapred.JobClient: Reduce input records=3348 12/01/29 23:07:28 INFO mapred.JobClient: Reduce input groups=1265 12/01/29 23:07:28 INFO mapred.JobClient: Combine output records=3348 12/01/29 23:07:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=4936278016 12/01/29 23:07:28 INFO mapred.JobClient: Reduce output records=1265 12/01/29 23:07:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=26102546432 12/01/29 23:07:28 INFO mapred.JobClient: Map output records=8233 real 2m48.886s user 0m3.300s sys 0m0.304s time wc someinput/* 178 1001 8674 someinput/capacity-scheduler.xml 178 1001 8674 someinput/capacity-scheduler.xml.bak 7 7 196 someinput/commons-logging.properties 7 7 196 someinput/commons-logging.properties.bak 24 35 535 someinput/configuration.xsl 80 122 1968 someinput/core-site.xml 80 122 1972 someinput/core-site.xml.bak 1 0 1 someinput/dfs.exclude 1 0 1 someinput/dfs.include 12 36 327 someinput/fair-scheduler.xml 45 192 2141 someinput/hadoop-env.sh 45 192 2139 someinput/hadoop-env.sh.bak 20 137 910 someinput/hadoop-metrics2.properties 20 137 910 someinput/hadoop-metrics2.properties.bak 118 582 4653 someinput/hadoop-policy.xml 118 582 4653 someinput/hadoop-policy.xml.bak 241 623 6616 someinput/hdfs-site.xml 241 623 6630 someinput/hdfs-site.xml.bak 171 417 6177 someinput/log4j.properties 171 417 6177 someinput/log4j.properties.bak 1 0 1 someinput/mapred.exclude 1 0 1 someinput/mapred.include 12 15 298 someinput/mapred-queue-acls.xml 12 15 298 someinput/mapred-queue-acls.xml.bak 338 897 9616 someinput/mapred-site.xml 338 897 9630 someinput/mapred-site.xml.bak 1 1 10 someinput/masters 1 1 18 someinput/slaves 57 89 1243 someinput/ssl-client.xml.example 55 85 1195 someinput/ssl-server.xml.example 2574 8233 85860 total real 0m0.009s user 0m0.004s sys 0m0.000s 
+4
source share
8 answers

It depends on a lot of factors, including your configuration, your machine, memory configuration, JVM settings, etc. You also need to subtract the start time of the JVM.

It works much faster for me. However, of course, it will be slower on small data sets than a special C program - think about what it does “behind the scenes”.

Try terabyte data distributed across several thousand files and see what happens.

+11
source

As Dave said, Hadoop is optimized for processing large amounts of data, rather than toy examples. There is a tax on “waking up the elephant” to get everything you don’t need when you work with smaller sets. You can take a look at Map Reduce Jobs Performance Review for more information on what's happening.

+6
source

Your input was small, so you noticed that hadoop took a long time. The process of creating work in a huope is difficult, because it involves many things. If the input were big, then you would see that chaop works better on wc.

+2
source

in addition to the other answers, there is another factor:
You have 30 files to process - so you need to complete 30 tasks. Hadoop MR The overhead of one task is 1 to 3 seconds. If you combine the data into one file, the performance will seriously improve, and you will still have the overhead of work.
I think the local hometown program will always be superior to howop MR. Hadoop is built for scalability and resiliency - in many cases, performance scarification.

+1
source

To improve Hadoop performance:

  • Set the number of Mappers and Reducers .

    [After seeing the output of your program, I think you used several gearboxes and cartographers. Use it according to your needs. Using too many cards or gears will not improve performance]

  • Use a larger piece of data. (in terabytes or at least in GB)

    [Hadoop has some fund block size of 64 MB.]

  • Configure Hadoop on other terminals and try to run multi node clusters . This will increase productivity.

Hadoop is the next big thing.

0
source

Hadoop usually has overhead compared to native applications that you can run using the terminal. You will certainly get a better time if you increase the number of cartographers to 2, which you should do. If the wordcount example that you have does not support tuning cards and reducers, try this

https://github.com/marek5050/Sparkie/tree/master/Java

using

 hadoop jar ./target/wordcount.jar -r 1 -m 4 <input> <output> 

The power of Hadoop lies in its ability to distribute work between several nodes for processing GB / TB data; it, as a rule, will not be more efficient than everything your computer can do in a few minutes.

0
source

Hmm .. there is confusion here, or let me create confusion here.

Suppose you have a problem that can be solved, say, O(n) complexity , which will make haop, if you apply, let's assume machines K , then it will reduce complexity by K times. Therefore, in your case, the task should be faster (hadoop task).

WHAT GOES WRONG ?????

Assuming you have a standard hadoop installation and all the standard hadoop configuration, as well as if you use hadoop in local mode by default.

1) You run the program on one node, so you do not expect the runtime of everything that is less than the standard. (The case would be different if you used a multi node cluster)

Now the question arises, since one machine is used, the operating time should be the same.

Answer: no, in hadoop the data is first read by the reader of the records, which emits pairs of key values, which are transmitted to the handler, which then processes and emits pairs of key values ​​(if the combiner is not used), then the data is sorted and shuffled, and then transferred to the reducer phase and then written to hdfs. Therefore, see here a lot more overhead. Because of these reasons, you experience a decline in performance.

You want to see what a house can do. Run the same task in the K node cluster and take the bytes of peta data, and also run the single-threaded application. I promise that you will be amazed.

0
source

Although Hadoop is not intended for this small file, we can still customize it to some extent. File size - 180 kb. But the number of blocks is 30. In hdfs-site.xml you need to reduce "dfs.block.size". Since the input splits are larger, the number of cartographers is also no longer needed in this case. Hadoop needs to be configured according to the number of nodes and input. Therefore, you must increase "dfs.block.size" to 64 MB to complete this number of words with a single mapper, which will greatly improve performance.

0
source

Source: https://habr.com/ru/post/1393657/


All Articles