How Apache AVRO serializes (large) data structures

I am looking at using AVRO on hadoop. But I'm concerned about serializing large data structures and how to add methods to classes (data-).

An example (taken from http://blog.voidsearch.com/bigdata/apache-avro-in-practice/ ) shows the facebook user model.

{ "namespace": "test.avro", "name": "FacebookUser", "type": "record", "fields": [ {"name": "name", "type": "string"}, ..., {"name": "friends", "type": "array", "items": "FacebookUser"} ] } 

Does avro share facebookuser's full social graph in this model?

[That is, if I want to serialize a single user, does serialization include all his friends and their friends, etc.?]

If so, I prefer to store friends identifier instead of links in order to search in my application when necessary. In this case, I would like to be able to add a method that returns actual friends instead of identifiers.

How can I wrap / extend the created Java classes to add methods?

(also add methods that return, for example, friend-count)

+4
source share
3 answers

Regarding the second question: How can I wrap / extend the created Java classes to add methods?

You can use AspectJ to introduce new methods to an existing / generated class. AspectJ is only required at compile time. The approach is shown below.

Define the Person entry as Avro IDL (person.avdl):

 @namespace("net.tzolov.avro.extend") protocol PersonProtocol { record Person { string firstName; string lastName; } } 

use maven and avro-maven-plugin to generate java sources from AVDL:

 <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.6.3</version> </dependency> ...... <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.6.3</version> <executions> <execution> <id>generate-avro-sources</id> <phase>generate-sources</phase> <goals> <goal>idl-protocol</goal> </goals> <configuration> <sourceDirectory>src/main/resources/avro</sourceDirectory> <outputDirectory>${project.build.directory}/generated-sources/java</outputDirectory> </configuration> </execution> </executions> </plugin> 

Above the configuration, it is assumed that the person.avid file is located in src / main / resources / avro. Sources are generated in target / generated sources / java.

The generated Person.java has two methods: getFirstName () and getLastName (). If you want to expand it in another way: getCompleteName () = firstName + lastName, you can introduce this method with the following aspect:

 package net.tzolov.avro.extend; import net.tzolov.avro.extend.Person; public aspect PersonAspect { public String Person.getCompleteName() { return this.getFirstName() + " " + this.getLastName(); } } 

Use aspectj-maven-plugin maven plugin to weave this aspect with generated code

 <dependency> <groupId>org.aspectj</groupId> <artifactId>aspectjrt</artifactId> <version>1.6.12</version> </dependency> <dependency> <groupId>org.aspectj</groupId> <artifactId>aspectjweaver</artifactId> <version>1.6.12</version> </dependency> .... <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>aspectj-maven-plugin</artifactId> <version>1.2</version> <dependencies> <dependency> <groupId>org.aspectj</groupId> <artifactId>aspectjrt</artifactId> <version>1.6.12</version> </dependency> <dependency> <groupId>org.aspectj</groupId> <artifactId>aspectjtools</artifactId> <version>1.6.12</version> </dependency> </dependencies> <executions> <execution> <goals> <goal>compile</goal> <goal>test-compile</goal> </goals> </execution> </executions> <configuration> <source>6</source> <target>6</target> </configuration> </plugin> 

and the result:

 @Test public void testPersonCompleteName() throws Exception { Person person = Person.newBuilder() .setFirstName("John").setLastName("Atanasoff").build(); Assert.assertEquals("John Atanasoff", person.getCompleteName()); } 
+3
source

First I try to answer the first question:
In my best understanding, AVRO is not built to store something that is not hierarchical. It also does not have object identifiers. It can store arrays, records of primitive types, or any combination of them. The ability to trace the graph of the objects you are referencing is the Java Serialization ability in which the AVRO binding is performed. Therefore, to store some graph, you must enter your own object identifiers and explicitly assign them to some fields. You can see the getSchema method here: http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/avro/reflect/ReflectData.java.htm it's pretty simple. .. This is the way AVRO generates a java class schema.
Regarding the second question - I don’t think it is a good idea to modify the generated code. I would suggest creating a class with all the methods / data you want to add and putting an AV class with "data" in it.
At the same time, I think that technically expanding generated classes should be fine.

+1
source

Besides trying to solve these problems with Avro that either may not work (I assume that the extension of the generated class will not work well, no matter how you try), you can use plain JSON (unless you have specific requirements for Avro) . Many libraries support arbitrary POJO mappings; and some (like Jackson ) also support serialization based on the object identifier (since 2.0.0).

0
source

Source: https://habr.com/ru/post/1403638/


All Articles