Get multiple rows of HDFS data

I have data 2 GBin mine HDFS.

Is it possible to get data randomly. Like the Unix command line

cat iris2.csv |head -n 50
+4
source share
5 answers

Native head

hadoop fs -cat /your/file | head

effective here, as the cat will close the stream as soon as head finishes reading all the lines.

To get the tail , hadoop has a special effective command:

hadoop fs -tail /your/file

Unfortunately, it returns the last kilobyte of data, not the specified number of rows.

+16
source

head tail Linux 10 10 . , , .

Linux shuffle - shuf , Hadoop , :

$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>

, , iris2.csv HDFS, , 50 :

$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50

. Linux sort, shuf .

+4

Hive, - :

SELECT column1, column2 FROM (
    SELECT iris2.column1, iris2.column2, rand() AS r
    FROM iris2
    ORDER BY r
) t
LIMIT 50;

EDIT: :

SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;
0

sudo -u hdfs hdfs dfs -cat "path of csv file" |head -n 50

50 - the number of rows (this can be configured by the user based on requirements)

0
source
hdfs dfs -cat yourFile | shuf -n <number_of_line>

Will do the trick for you. Although its not available on Mac OS. You can install GNU coreutils.

0
source

Source: https://habr.com/ru/post/1529381/


All Articles