HDFS: using the HDFS API to add to SequenceFile

I am trying to create and maintain a sequence file on HDFS using the Java API without doing a MapReduce job as a setting for future MapReduce work. I want to save all my input data for MapReduce job in one sequence file, but data is added during the day throughout the day. The problem is that if a SequenceFile exists, the next call will simply overwrite the SequenceFile instead of adding to it.

// fs and conf are set up for HDFS, not as a LocalFileSystem seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath), keyClass, valueClass, SequenceFile.CompressionType.NONE); seqWriter.append(new Text(key), new BytesWritable(value)); seqWriter.close(); 

Another problem is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day, because the MapReduce task can be started using this data anywhere.

I cannot find any other API call to add to SequenceFile and save its format. I also can't just concatenate two SequenceFiles because of their formatting needs.

I also wanted to avoid doing the MapReduce job for this, since it has a lot of overhead for the small amount of data that I add to the SequenceFile.

Any thoughts or workarounds? Thanks.

+4
source share
2 answers

Support for adding existing SequenceFiles was added in Apache Hadoop 2.6.1 and 2.7.2 through a JIRA enhancement: https://issues.apache.org/jira/browse/HADOOP-7139

For example, using a test case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ io / TestSequenceFileAppend.java # L63-L140

CDH5 users can find the same ability in CDH version 5.7.1 onwards.

+5
source

Sorry, the Hadoop FileSystem does not currently support uploads. But there are plans for the future in a future release.

+4
source

Source: https://habr.com/ru/post/1347281/


All Articles