I use Hadoop to process data using python, which file format should I use?
I have a project with a significant number of text pages.
Each text file has some header information that I need to save during processing; however, I do not want the headers to interfere with clustering algorithms.
I am using python on Hadoop (or is there an additional package that works better?)
How do I format text files and store these text files in Hadoop for processing?
1) Files
Hadoop Streaming, , .
.
HDFS, . " " .
2)
, , , () , ( ). , .
- ( ), . (, + ) , , . , , MapReduce, : pfffrrrr;)
, , Java- . IF , : map() ( ) . , Java-Jobs:
, JAR-mapper (. ). , - , . - :
Hadoop Streaming, ; sys.stdin, , . (, , ).
, , , , , - , - .
, , , . , Streaming, , , mapper - , .
Streaming , . . , , , , , .
, , . , . , .
Jython SWIG, , Streaming.
Source: https://habr.com/ru/post/1730134/More articles:Force output in Esper - esperStrange behavior when converting a string to a UTF-8 character - javaPaypal and international shipping calculation? - paypalis there a way to have a quarter of the annual intervals in the timeline widget simulator? - javascriptRegex - match alpha characters that don't match a subset of alpha characters (C #) - c #Hadoop is looking for words from one file in another file - mapreduceHow to set up a Django development environment without installing? - djangoTags created dynamically using jQuery disappear after submitting my form - jquerywinform to run and read from a file with a user extension - c #Setting Date and Time NULL with Rose :: DB :: Object and MySQL - mysqlAll Articles