I use Hadoop to process data using python, which file format should I use?

I use Hadoop to process data using python, which file format should I use?

I have a project with a significant number of text pages.

Each text file has some header information that I need to save during processing; however, I do not want the headers to interfere with clustering algorithms.

I am using python on Hadoop (or is there an additional package that works better?)

How do I format text files and store these text files in Hadoop for processing?

+3
source share
2 answers

1) Files

Hadoop Streaming, , .

.

HDFS, . " " .

2)

, , , () , ( ). , .

- ( ), . (, + ) , , . , , MapReduce, : pfffrrrr;)

, , Java- . IF , : map() ( ) . , Java-Jobs:

, JAR-mapper (. ). , - , . - :

  • , -: keyx: filex, metadatax
  • HDFS
  • JAR-mapper, () -
    • . org.apache.hadoop.hdfs.DFSClient
  • match filex, keyx mapper
  • map() keyx
+4

Hadoop Streaming, ; sys.stdin, , . (, , ).

, , , , , - , - .

, , , ​​ . , Streaming, , , mapper - , .

Streaming , . . , , , , , .

, , . , . , .

Jython SWIG, , Streaming.

+1
source

Source: https://habr.com/ru/post/1730134/


All Articles