In the following example:
small.ints = to.dfs(1:1000) mapreduce( input = small.ints, map = function(k, v) cbind(v, v^2))
The data entry for the mapreduce function is an object called small.ints, which refers to blocks in HDFS.
Now I have a CSV file already saved in HDFS as
"hdfs://172.16.1.58:8020/tmp/test_short.csv"
How to get an object for it?
And as far as I know (what could be wrong), if I want the data from the CSV file to be entered for mapreduce, I must first create a table in R that contains all the values ββin the CSV file. I have a method like:
data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=",")) mydata=data$val
It seems correct to use this method to get mydata and then make object = to.dfs (mydata), but the problem is that the test_short.csv file is huge, which is due to the size of TB, and the memory cannot hold the output from .dfs !!
Actually, I am wondering if I use "hdfs: //172.16.1.58: 8020 / tmp / test_short.csv" as the input mapreduce file, and inside the map function does the thing from.dfs (), can I get data blocks ?
Please give me advice, whatever!