Get multiple rows of HDFS data

Question

I have data 2 GBin mine HDFS.

Is it possible to get data randomly. Like the Unix command line

cat iris2.csv |head -n 50

+4

Unmesha SreeVeni Feb 28 '14 at 9:15

5 answers

head tail Linux 10 10 . , , .

Linux shuffle - shuf , Hadoop , :

$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>

, , iris2.csv HDFS, , 50 :

$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50

. Linux sort, shuf .

+4

Hive, - :

SELECT column1, column2 FROM (
    SELECT iris2.column1, iris2.column2, rand() AS r
    FROM iris2
    ORDER BY r
) t
LIMIT 50;

EDIT: :

SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;

0

wlk 28 . '14 9:27

sudo -u hdfs hdfs dfs -cat "path of csv file" |head -n 50

50 - the number of rows (this can be configured by the user based on requirements)

0

Mohit singh May 26 '15 at 11:29

hdfs dfs -cat yourFile | shuf -n <number_of_line>

Will do the trick for you. Although its not available on Mac OS. You can install GNU coreutils.

0

John doe Jul 29 '17 at 8:05

Viacheslav Rodionov · Accepted Answer · 2014-02-28T11:54:58+0000

Native head

hadoop fs -cat /your/file | head

effective here, as the cat will close the stream as soon as head finishes reading all the lines.

To get the tail , hadoop has a special effective command:

hadoop fs -tail /your/file

Unfortunately, it returns the last kilobyte of data, not the specified number of rows.