I use Hadoop to process data using python, which file format should I use?

Question

I use Hadoop to process data using python, which file format should I use?

I have a project with a significant number of text pages.

Each text file has some header information that I need to save during processing; however, I do not want the headers to interfere with clustering algorithms.

I am using python on Hadoop (or is there an additional package that works better?)

How do I format text files and store these text files in Hadoop for processing?

+3

python hadoop

lw2010 Jan 27 '10 at 2:21

source share

2 answers

Leonidas · Answer 1 · 2010-01-27T02:52:12+0000

1) Files

Hadoop Streaming, , .

.

HDFS, . " " .

2)

, , , () , ( ). , .

- ( ), . (, + ) , , . , , MapReduce, : pfffrrrr;)

, , Java- . IF , : map() ( ) . , Java-Jobs:

, JAR-mapper (. ). , - , . - :

, -: keyx: filex, metadatax
HDFS
JAR-mapper, () -
- . org.apache.hadoop.hdfs.DFSClient
match filex, keyx mapper
map() keyx

Karl Anderson · Answer 2 · 2010-01-27T19:00:30+0000

Hadoop Streaming, ; sys.stdin, , . (, , ).

, , , , , - , - .

, , , . , Streaming, , , mapper - , .

Streaming , . . , , , , , .

, , . , . , .

Jython SWIG, , Streaming.

I use Hadoop to process data using python, which file format should I use?

More articles: