How can I check out a Hadoop SequenceFile for which I lack complete schema information?

I have a compressed Hadoop SequenceFile from a client that I would like to check. Currently, I do not have full information about the scheme (which I work separately).

But for now (and hoping for a general solution), what are my options for checking the file?

I found the forqlift tool: http://www.exmachinatech.net/01/forqlift/

And tried the "forqlift file" in the file. He complains that he cannot load classes for a custom subclass of Writables. Therefore, I will need to track these implementations.

But is there another option available? I understand that, most likely, I canโ€™t extract the data, but is there any tool to scan how many key values โ€‹โ€‹and what type?

+6
source share
5 answers

Check out the SequenceFileReadDemo class in the "Hadoop: The Definitive Guide" - Sample Code . Sequence files have key / value types built into them. Use SequenceFile.Reader.getKeyClass () and SequenceFile.Reader.getValueClass () to get type information.

+10
source

From the shell:

$ hdfs dfs -text /user/hive/warehouse/table_seq/000000_0 

or directly from hive (which is much faster for small files, since it works in an already running JVM)

 hive> dfs -text /user/hive/warehouse/table_seq/000000_0 

works for sequence files.

+12
source

My first thought was to use the Java API for sequence files to try to read them. Even if you donโ€™t know which Writable file is being used, you can guess and check for error messages (maybe the best way I don't know).

For instance:

 private void readSeqFile(Path pathToFile) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); SequenceFile.Reader reader = new SequenceFile.Reader(fs, pathToFile, conf); Text key = new Text(); // this could be the wrong type Text val = new Text(); // also could be wrong while (reader.next(key, val)) { System.out.println(key + ":" + val); } } 

This program will fail if these are incorrect types, but in Exception you should indicate which type Writable has a key and a value.

Edit : Actually, if you do less file.seq , you can read part of the header and see what types of Writable (at least for the first key / value). For example, in one file, I see:

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable

+6
source

I just played with Dumbo . When you start Dumbo on a Hadoop cluster, the output is a sequence file. I used the following to upload the entire file of the generated Dumbo as plain text:

 $ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \ -input totals/part-00000 \ -output unseq \ -inputformat SequenceFileAsTextInputFormat $ bin/hadoop fs -cat unseq/part-00000 

I got the idea here .

By the way, Dumbo can also display plain text .

+3
source

I am not a Java or Hadoop programmer, so the solution to the problem may not be the best, but in any case.

I spent two days solving the problem of reading FileSeq locally (Linux debian amd64) without having hadoop installed.

Submitted sample

 while (reader.next(key, val)) { System.out.println(key + ":" + val); } 

works well for text, but does not work for BytesWritable compressed input.

What I've done? I downloaded this utility to create (write SequenceFiles Hadoop data) github_com / shsdev / sequencefile-utility / archive / master.zip, and got it working and then modified to read the Hadoop SeqFiles input.

Debian's instruction to run this utility from scratch:

 sudo apt-get install maven2 sudo mvn install sudo apt-get install openjdk-7-jdk edit "sudo vi /usr/bin/mvn", change `which java` to `which /usr/lib/jvm/java-7-openjdk-amd64/bin/java` Also I've added (probably not required) ' PATH="/home/mine/perl5/bin${PATH+:}${PATH};/usr/lib/jvm/java-7-openjdk-amd64/"; export PATH; export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ export JAVA_VERSION=1.7 ' to ~/.bashrc Then usage: sudo mvn install ~/hadoop_tools/sequencefile-utility/sequencefile-utility-master$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.0-jar-with-dependencies.jar -- and this doesn't break the default java 1.6 installation that is required for FireFox/etc. 

To solve the compatibility problem with FileSeq (for example, "It is impossible to load the native-hadoop library for your platform ... using the built-in Java classes, where applicable"), I used libs from the Hadoop main server as it is (some kind of hack) :

 scp root@10.15.150.223 :/usr/lib/libhadoop.so.1.0.0 ~/ sudo cp ~/libhadoop.so.1.0.0 /usr/lib/ scp root@10.15.150.223 :/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server/libjvm.so ~/ sudo cp ~/libjvm.so /usr/lib/ sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so.1 sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so 

One night I had a coffee and I wrote this code to read the FileSeq hadoop input files (using this cmd to run this code) / usr / lib / jvm / java-7-openjdk-amd64 / bin / java -jar./target/sequencefile -utility-1.3-jar-with-dependencies.jar -d test / -c NONE "):

 import org.apache.hadoop.io.*; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.ValueBytes; import java.io.DataOutputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; Path file = new Path("/home/mine/mycompany/task13/data/2015-08-30"); reader = new SequenceFile.Reader(fs, file, conf); long pos = reader.getPosition(); logger.info("GO from pos "+pos); DataOutputBuffer rawKey = new DataOutputBuffer(); ValueBytes rawValue = reader.createValueBytes(); int DEFAULT_BUFFER_SIZE = 1024 * 1024; DataOutputBuffer kobuf = new DataOutputBuffer(DEFAULT_BUFFER_SIZE); kobuf.reset(); int rl; do { rl = reader.nextRaw(kobuf, rawValue); logger.info("read len for current record: "+rl+" and in more details "); if(rl >= 0) { logger.info("read key "+new String(kobuf.getData())+" (keylen "+kobuf.getLength()+") and data "+rawValue.getSize()); FileOutputStream fos = new FileOutputStream("/home/mine/outb"); DataOutputStream dos = new DataOutputStream(fos); rawValue.writeUncompressedBytes(dos); kobuf.reset(); } } while(rl>0); 
  • I just added this piece of code to src / main / java / eu / scape_project / tb / lsdr / seqfileutility / SequenceFileWriter.java file right after the line

    writer = SequenceFile.createWriter (fs, conf, path, keyClass, valueClass, CompressionType.get (pc.getCompressionType ()));

Thanks to these sources of information: Links:

If you use hasoop-core instead of mahour, you will have to download asm-3.1.jar manually: search_maven_org / remotecontent? FilePath = org / OW 2 / Util / ASM / ASM / 3.1 / ASM-3.1.jar search_maven_org / # search | ha | 1 | ASM-3.1

List of available mahout repositories: repo1_maven_org / Maven2 / org / apache / drover / Introduction to Mahout: mahout_apache_org /

A good resource for learning the interfaces and sources of Hadoop Java classes (I used it to write my own code to read FileSeq): http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/ 0.2.7 / org / apache / hadoop / io / BytesWritable.java

The sources of the tb-lsdr-seqfilecreator project that I used to create my own FileSeq reader project: www_javased_com /? source_dir = petiole / TB-lsdr-seqfilecreator / SRC / Primary / Java / EU / scape_project / TB / lsdr / seqfileutility / ProcessParameters.java

stackoverflow_com / questions / 5096128 / sequence-files-in-hadoop - same example (read key, value does not work)

https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java - it helped me (I used reader.nextRaw so same as nextKeyValue () and other submarines)

Also I changed. /pom.xml for native apache.hadoop instead of mahout.hadoop, but probably not required because the errors for read-> next (key, value) are the same for both Instead, I had to use read-> nextRaw (keyRaw, valueRaw):

 diff ../../sequencefile-utility/sequencefile-utility-master/pom.xml ./pom.xml 9c9 < <version>1.0</version> --- > <version>1.3</version> 63c63 < <version>2.0.1</version> --- > <version>2.4</version> 85c85 < <groupId>org.apache.mahout.hadoop</groupId> --- > <groupId>org.apache.hadoop</groupId> 87c87 < <version>0.20.1</version> --- > <version>1.1.2</version> 93c93 < <version>1.1</version> --- > <version>1.1.3</version> 
+3
source

Source: https://habr.com/ru/post/898105/


All Articles