R + Hadoop: How to read CSV file from HDFS and execute mapreduce?

In the following example:

small.ints = to.dfs(1:1000) mapreduce( input = small.ints, map = function(k, v) cbind(v, v^2)) 

The data entry for the mapreduce function is an object called small.ints, which refers to blocks in HDFS.

Now I have a CSV file already saved in HDFS as

 "hdfs://172.16.1.58:8020/tmp/test_short.csv" 

How to get an object for it?

And as far as I know (what could be wrong), if I want the data from the CSV file to be entered for mapreduce, I must first create a table in R that contains all the values ​​in the CSV file. I have a method like:

 data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=",")) mydata=data$val 

It seems correct to use this method to get mydata and then make object = to.dfs (mydata), but the problem is that the test_short.csv file is huge, which is due to the size of TB, and the memory cannot hold the output from .dfs !!

Actually, I am wondering if I use "hdfs: //172.16.1.58: 8020 / tmp / test_short.csv" as the input mapreduce file, and inside the map function does the thing from.dfs (), can I get data blocks ?

Please give me advice, whatever!

+6
source share
2 answers

mapreduce (input = path, input.format = make.input.format (...), map ...)

from.dfs - for small data. In most cases, you will not use from.dfs in the map function. The arguments contain part of the input already

+3
source

You can do something like below:

 r.file <- hdfs.file(hdfsFilePath,"r") from.dfs( mapreduce( input = as.matrix(hdfs.read.text.file(r.file)), input.format = "csv", map = ... )) 

Please indicate points and hope that someone finds this helpful.

Note. See postoffflow post for more details:

How to enter an HDFS file in R mapreduce for processing and get the result in an HDFS file

0
source

Source: https://habr.com/ru/post/951177/


All Articles