Simple data stream: go to super slow comparison with Java

As a Java developer, I am looking at Go now because I think it is an interesting language.

To get started with this, I decided to take a simple Java project that I wrote a few months ago and rewrite it in Go to compare features and (basically, actually) compare the readability / complexity of the code.

The sample Java code is as follows:

public static void main(String[] args) { long start = System.currentTimeMillis(); Stream<Container> s = Stream.from(new Iterator<Container>() { int i = 0; @Override public boolean hasNext() { return i < 10000000; } @Override public Container next() { return new Container(i++); } }); s = s.map((Container _source) -> new Container(_source.value * 2)); int j = 0; while (s.hasNext()) { s.next(); j++; } System.out.println(System.currentTimeMillis() - start); System.out.println("j:" + j); } public static class Container { int value; public Container(int v) { value = v; } } 

Where is the map function:

 return new Stream<R>() { @Override public boolean hasNext() { return Stream.this.hasNext(); } @Override public R next() { return _f.apply(Stream.this.next()); } }; 

And the Stream class is just an extension for java.util.Iterator to add custom methods to it. Other methods than map are different from the standard Java Stream API.

In any case, to reproduce this, I wrote the following Go code:

 package main import ( "fmt" ) type Iterator interface { HasNext() bool Next() interface{} } type Stream interface { HasNext() bool Next() interface{} Map(transformer func(interface{}) interface{}) Stream } /////////////////////////////////////// type incremetingIterator struct { i int } type SampleEntry struct { value int } func (s *SampleEntry) Value() int { return s.value } func (s *incremetingIterator) HasNext() bool { return si < 10000000 } func (s *incremetingIterator) Next() interface{} { si = si + 1 return &SampleEntry{ value: si, } } func CreateIterator() Iterator { return &incremetingIterator{ i: 0, } } /////////////////////////////////////// type stream struct { source Iterator } func (s *stream) HasNext() bool { return s.source.HasNext() } func (s *stream) Next() interface{} { return s.source.Next() } func (s *stream) Map(tr func(interface{}) interface{}) Stream { return &stream{ source: &mapIterator{ source: s, transformer: tr, }, } } func FromIterator(it Iterator) Stream { return &stream{ source: it, } } /////////////////////////////////////// type mapIterator struct { source Iterator transformer func(interface{}) interface{} } func (s *mapIterator) HasNext() bool { return s.source.HasNext() } func (s *mapIterator) Next() interface{} { return s.transformer(s.source.Next()) } /////////////////////////////////////// func main() { it := CreateIterator() ss := FromIterator(it) ss = ss.Map(func(in interface{}) interface{} { return &SampleEntry{ value: 2 * in.(*SampleEntry).value, } }) fmt.Println("Start") for ss.HasNext() { ss.Next() } fmt.Println("Over") } 

Both getting the same result, but when Java takes about 20 ms, Go takes 1050 ms (with 10 M elements, the test was run several times).

I am very new to Go (started a couple of hours ago), so please be lenient if I did something really bad :-)

Thanks!

+5
source share
3 answers

Another answer quite abruptly changed the original problem and returned to a simple cycle. I believe this is different code, and therefore it cannot be used to compare runtimes (this loop can also be written in Java, which will give shorter runtimes).

Now try to keep the "streaming method" of the problem.

Remember in advance:

One thing to note in advance. In Java, the granularity of System.currentTimeMillis() can be around 10 ms (!!), which is in the same order of magnitude as the result! This means that the error rate can be huge in Java for 20 ms! So instead, you need to use System.nanoTime() to measure the execution time of the code! See Measuring Time Difference Using System.currentTimeMillis () for more information .

Also, this is not the right way to measure runtime, as starting files for the first time may work several times slower. See Code order and performance for details.

Genesis

Your original Go sentence runs on my computer for about 1.1 seconds , which roughly matches yours.

Removing an interface{} element type

Go has no generics , trying to simulate this behavior using interface{} not the same and have a serious performance impact if the value you want to work with is a primitive type (like int ) or some simple structures (like Go equivalent equivalent to your Java Container type). See: Laws of reflection # Representation of the interface . To package int (or any other specific type) in the interface, you need to create a pair (type; value) that holds the dynamic type and the value that needs to be wrapped (creating this pair also involves copying the wrapped value, see Analysis in the answer How it can contain a snippet?. In addition, if you want to access the value, you must use the assertion type , which is the runtime , so the compiler cannot help with optimization (and verification will add to the runtime of the code)!

So, do not use interface{} for our elements, but instead use a specific type for our case:

 type Container struct { value int } 

We will use this in the iterator and stream of the following method: Next() Container , and in the mapper function:

 type Mapper func(Container) Container 

We can also use embedding , since the method set of Iterator is a subset of Stream .

Without further ado, here is a complete, executable example:

 package main import ( "fmt" "time" ) type Container struct { value int } type Iterator interface { HasNext() bool Next() Container } type incIter struct { i int } func (it *incIter) HasNext() bool { return it.i < 10000000 } func (it *incIter) Next() Container { it.i++ return Container{value: it.i} } type Mapper func(Container) Container type Stream interface { Iterator Map(Mapper) Stream } type iterStream struct { Iterator } func NewStreamFromIter(it Iterator) Stream { return iterStream{Iterator: it} } func (is iterStream) Map(f Mapper) Stream { return mapperStream{Stream: is, f: f} } type mapperStream struct { Stream f Mapper } func (ms mapperStream) Next() Container { return ms.f(ms.Stream.Next()) } func (ms mapperStream) Map(f Mapper) Stream { return nil // Not implemented / needed } func main() { s := NewStreamFromIter(&incIter{}) s = s.Map(func(in Container) Container { return Container{value: in.value * 2} }) fmt.Println("Start") start := time.Now() j := 0 for s.HasNext() { s.Next() j++ } fmt.Println(time.Since(start)) fmt.Println("j:", j) } 

Runtime: 210 ms . It's nice that we have already accelerated it 5 times , but we are far from the performance of Java Stream .

"Removing" Iterator and Stream Types

Since we cannot use generics, the Iterator and Stream interface types do not have to be interfaces, since we need new types if we wanted to use them to define iterators and other types of streams.

So, the following: we remove Stream and Iterator , and we use their specific types, their implementation is higher. This will not damage readability at all, in fact the solution is shorter:

 package main import ( "fmt" "time" ) type Container struct { value int } type incIter struct { i int } func (it *incIter) HasNext() bool { return it.i < 10000000 } func (it *incIter) Next() Container { it.i++ return Container{value: it.i} } type Mapper func(Container) Container type iterStream struct { *incIter } func NewStreamFromIter(it *incIter) iterStream { return iterStream{incIter: it} } func (is iterStream) Map(f Mapper) mapperStream { return mapperStream{iterStream: is, f: f} } type mapperStream struct { iterStream f Mapper } func (ms mapperStream) Next() Container { return ms.f(ms.iterStream.Next()) } func main() { s0 := NewStreamFromIter(&incIter{}) s := s0.Map(func(in Container) Container { return Container{value: in.value * 2} }) fmt.Println("Start") start := time.Now() j := 0 for s.HasNext() { s.Next() j++ } fmt.Println(time.Since(start)) fmt.Println("j:", j) } 

Execution time: 50 ms , we again accelerated it 4 times in comparison with our previous solution! Now that the Java solution is in the same order of magnitude, and we haven’t lost anything from the “streaming manner”. The total gain from the offer of scams: 22 times faster.

Given the fact that in Java you used System.currentTimeMillis() to measure execution, it might even be the same as Java performance. Asker confirmed: this is the same!

As for the same performance

Now we are talking about the "same" code that does fairly simple basic tasks in different languages. If they perform basic tasks, then not one language can do better than another.

Also keep in mind that Java is a mature adult (over 21 years old) and has great time for development and optimization; in fact, Java JIT (compilation right at the point in time) does a pretty good job for long processes like yours. Go is much younger, still just a child (he will be 5 years old in 11 days), and probably in the foreseeable future, productivity will probably improve than Java.

Further improvements

This "streaming" method may not be "Go" to approach the problem you are trying to solve. This is just the "mirror" code of your Java solution using Go's more idiomatic constructs.

Instead, you should take advantage of the excellent support for Go concurrency, namely goroutines (see go ), which are much more efficient than Java threads, and other language constructs such as channels (see the answer What do golang channels use for? ) And select .

By correctly sorting / breaking initially your large task into smaller ones, the goroutine working pool can be powerful enough to process a large amount of data. See Is this an idiomatic workflow pool in Go?

You also stated in your comment that "I do not have 10M elements for processing, but more than 10G that will not fit into memory." If so, think about the I / O time and the latency of the external system from which you are extracting data from the process. If this takes a considerable amount of time, this can lead to an increase in processing time in the application, and the execution time of the application may not matter (generally).

Go is not going to compress every nanosecond of runtime, but provides you with a simple, minimalist language and tools that make it easy (by writing simple code) to take control and use your available resources (for example, goroutines and a multi-core processor).

(Try to compare. Go to the language specification and the Java language specification . Personally, I read many times, but could never get to the end of Java.)

+5
source

This, I think, is an interesting question, as it goes to the bottom of the differences between Java and Go and highlights the difficulties of code porting. Here is the same thing in go minus all cars (time ~ 50 ms here):

 values := make([]int64, 10000000) start := time.Now() fmt.Println("Start") for i := int64(0); i < 10000000; i++ { values[i] = 2 * i } fmt.Println("Over after:", time.Now().Sub(start)) 

More seriously, this is the same thing with the map above the record fragment, which is a more idiomatic version of what you have above and can work with any input structure. This actually works faster on my machine in 30 ms than the for loop above (does anyone want to explain why?), So it probably looks like your version of Java:

 package main import ( "fmt" "time" ) type Entry struct { Value int64 } type EntrySlice []*Entry func New(l int64) EntrySlice { entries := make(EntrySlice, l) for i := int64(0); i < l; i++ { entries[i] = &Entry{Value: i} } return entries } func (entries EntrySlice) Map(fn func(i int64) int64) { for _, e := range entries { e.Value = fn(e.Value) } } func main() { entries := New(10000000) start := time.Now() fmt.Println("Start") entries.Map(func(v int64) int64 { return 2 * v }) fmt.Println("Over after:", time.Now().Sub(start)) } 

Things that make operations more expensive -

  • Bypass interface {}, do not do this
  • Creating a separate type of iterator - using a range or for loops
  • Distribution - the creation of new types for storing answers, transformation in place.

Re, using the {} interface, I would avoid this - this means that you need to write a separate map (say) for each type, not big difficulties. Instead of creating an iterator, a range is probably more appropriate. Re transforms into place, if you allocate new structures for each result, it will put pressure on the garbage collector using the Map func function, as it is an order of magnitude slower:

 entries.Map(func(e *Entry) *Entry { return &Entry{Value: 2 * e.Value} }) 

The stream breaks the data into pieces and does the same as above (keeping a memo of the last object, if you depend on previous calculations). If you have independent calculations (and not here), you can also deploy a bunch of goroutines doing the work, and do it faster if there are a lot of them (this has overhead, in simple examples it won't be faster),

Finally, if you are interested in processing data using go, I would recommend visiting this new site: http://gopherdata.io/ p>

+6
source

As a complement to the previous comments, I changed the code of both Java and Go to run the test 100 times.

Interestingly, Go takes a constant time between 69 and 72 ms.

At the same time, Java takes 71 ms for the first time (71 ms, 19 ms, 12 ms), and then between 5 and 7 ms.

From my test and understanding, this is due to the fact that the JVM takes a little time to load classes correctly and do some optimization.

In the end, I still experience this difference 10 times, but I don’t give up, and I’ll try to better understand how Go works in order to try faster :)

0
source

Source: https://habr.com/ru/post/1265566/


All Articles