How can I check out a Hadoop SequenceFile for which I lack complete schema information?

Question

How can I check out a Hadoop SequenceFile for which I lack complete schema information?

I have a compressed Hadoop SequenceFile from a client that I would like to check. Currently, I do not have full information about the scheme (which I work separately).

But for now (and hoping for a general solution), what are my options for checking the file?

I found the forqlift tool: http://www.exmachinatech.net/01/forqlift/

And tried the "forqlift file" in the file. He complains that he cannot load classes for a custom subclass of Writables. Therefore, I will need to track these implementations.

But is there another option available? I understand that, most likely, I can’t extract the data, but is there any tool to scan how many key values and what type?

+6

apache hadoop

Mike repass Sep 26 '11 at 19:50

source share

5 answers

From the shell:

$ hdfs dfs -text /user/hive/warehouse/table_seq/000000_0

or directly from hive (which is much faster for small files, since it works in an already running JVM)

 hive> dfs -text /user/hive/warehouse/table_seq/000000_0

works for sequence files.

+12

slavo Sep 03 '14 at 13:13

source share

My first thought was to use the Java API for sequence files to try to read them. Even if you don’t know which Writable file is being used, you can guess and check for error messages (maybe the best way I don't know).

For instance:

 private void readSeqFile(Path pathToFile) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); SequenceFile.Reader reader = new SequenceFile.Reader(fs, pathToFile, conf); Text key = new Text(); // this could be the wrong type Text val = new Text(); // also could be wrong while (reader.next(key, val)) { System.out.println(key + ":" + val); } }

This program will fail if these are incorrect types, but in Exception you should indicate which type Writable has a key and a value.

Edit : Actually, if you do less file.seq , you can read part of the header and see what types of Writable (at least for the first key / value). For example, in one file, I see:

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable

+6

Matt d Sep 26 '11 at 21:14

source share

I just played with Dumbo . When you start Dumbo on a Hadoop cluster, the output is a sequence file. I used the following to upload the entire file of the generated Dumbo as plain text:

 $ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \ -input totals/part-00000 \ -output unseq \ -inputformat SequenceFileAsTextInputFormat $ bin/hadoop fs -cat unseq/part-00000

I got the idea here .

By the way, Dumbo can also display plain text .

+3

Adam monsen Feb 15 '13 at 18:41

source share

I am not a Java or Hadoop programmer, so the solution to the problem may not be the best, but in any case.

I spent two days solving the problem of reading FileSeq locally (Linux debian amd64) without having hadoop installed.

Submitted sample

 while (reader.next(key, val)) { System.out.println(key + ":" + val); }

works well for text, but does not work for BytesWritable compressed input.

What I've done? I downloaded this utility to create (write SequenceFiles Hadoop data) github_com / shsdev / sequencefile-utility / archive / master.zip, and got it working and then modified to read the Hadoop SeqFiles input.

Debian's instruction to run this utility from scratch:

 sudo apt-get install maven2 sudo mvn install sudo apt-get install openjdk-7-jdk edit "sudo vi /usr/bin/mvn", change `which java` to `which /usr/lib/jvm/java-7-openjdk-amd64/bin/java` Also I've added (probably not required) ' PATH="/home/mine/perl5/bin${PATH+:}${PATH};/usr/lib/jvm/java-7-openjdk-amd64/"; export PATH; export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ export JAVA_VERSION=1.7 ' to ~/.bashrc Then usage: sudo mvn install ~/hadoop_tools/sequencefile-utility/sequencefile-utility-master$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.0-jar-with-dependencies.jar -- and this doesn't break the default java 1.6 installation that is required for FireFox/etc.

To solve the compatibility problem with FileSeq (for example, "It is impossible to load the native-hadoop library for your platform ... using the built-in Java classes, where applicable"), I used libs from the Hadoop main server as it is (some kind of hack) :

 scp root@10.15.150.223 :/usr/lib/libhadoop.so.1.0.0 ~/ sudo cp ~/libhadoop.so.1.0.0 /usr/lib/ scp root@10.15.150.223 :/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server/libjvm.so ~/ sudo cp ~/libjvm.so /usr/lib/ sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so.1 sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so

One night I had a coffee and I wrote this code to read the FileSeq hadoop input files (using this cmd to run this code) / usr / lib / jvm / java-7-openjdk-amd64 / bin / java -jar./target/sequencefile -utility-1.3-jar-with-dependencies.jar -d test / -c NONE "):

 import org.apache.hadoop.io.*; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.ValueBytes; import java.io.DataOutputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; Path file = new Path("/home/mine/mycompany/task13/data/2015-08-30"); reader = new SequenceFile.Reader(fs, file, conf); long pos = reader.getPosition(); logger.info("GO from pos "+pos); DataOutputBuffer rawKey = new DataOutputBuffer(); ValueBytes rawValue = reader.createValueBytes(); int DEFAULT_BUFFER_SIZE = 1024 * 1024; DataOutputBuffer kobuf = new DataOutputBuffer(DEFAULT_BUFFER_SIZE); kobuf.reset(); int rl; do { rl = reader.nextRaw(kobuf, rawValue); logger.info("read len for current record: "+rl+" and in more details "); if(rl >= 0) { logger.info("read key "+new String(kobuf.getData())+" (keylen "+kobuf.getLength()+") and data "+rawValue.getSize()); FileOutputStream fos = new FileOutputStream("/home/mine/outb"); DataOutputStream dos = new DataOutputStream(fos); rawValue.writeUncompressedBytes(dos); kobuf.reset(); } } while(rl>0);

I just added this piece of code to src / main / java / eu / scape_project / tb / lsdr / seqfileutility / SequenceFileWriter.java file right after the line
writer = SequenceFile.createWriter (fs, conf, path, keyClass, valueClass, CompressionType.get (pc.getCompressionType ()));

Thanks to these sources of information: Links:

If you use hasoop-core instead of mahour, you will have to download asm-3.1.jar manually: search_maven_org / remotecontent? FilePath = org / OW 2 / Util / ASM / ASM / 3.1 / ASM-3.1.jar search_maven_org / # search | ha | 1 | ASM-3.1

List of available mahout repositories: repo1_maven_org / Maven2 / org / apache / drover / Introduction to Mahout: mahout_apache_org /

A good resource for learning the interfaces and sources of Hadoop Java classes (I used it to write my own code to read FileSeq): http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/ 0.2.7 / org / apache / hadoop / io / BytesWritable.java

The sources of the tb-lsdr-seqfilecreator project that I used to create my own FileSeq reader project: www_javased_com /? source_dir = petiole / TB-lsdr-seqfilecreator / SRC / Primary / Java / EU / scape_project / TB / lsdr / seqfileutility / ProcessParameters.java

stackoverflow_com / questions / 5096128 / sequence-files-in-hadoop - same example (read key, value does not work)

https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java - it helped me (I used reader.nextRaw so same as nextKeyValue () and other submarines)

Also I changed. /pom.xml for native apache.hadoop instead of mahout.hadoop, but probably not required because the errors for read-> next (key, value) are the same for both Instead, I had to use read-> nextRaw (keyRaw, valueRaw):

 diff ../../sequencefile-utility/sequencefile-utility-master/pom.xml ./pom.xml 9c9 < <version>1.0</version> --- > <version>1.3</version> 63c63 < <version>2.0.1</version> --- > <version>2.4</version> 85c85 < <groupId>org.apache.mahout.hadoop</groupId> --- > <groupId>org.apache.hadoop</groupId> 87c87 < <version>0.20.1</version> --- > <version>1.1.2</version> 93c93 < <version>1.1</version> --- > <version>1.1.3</version>

+3

Alexander Larkin Sep 09 '15 at 10:42

source share

Praveen sripati · Accepted Answer · 2011-09-27T03:11:36+0000

Check out the SequenceFileReadDemo class in the "Hadoop: The Definitive Guide" - Sample Code . Sequence files have key / value types built into them. Use SequenceFile.Reader.getKeyClass () and SequenceFile.Reader.getValueClass () to get type information.

How can I check out a Hadoop SequenceFile for which I lack complete schema information?

More articles: