How to find identical bytes [] - objects in two arrays at the same time?

Question

How to find identical bytes [] - objects in two arrays at the same time?

I am trying to apply a collision attack on hashes (I attend course cryptography). Therefore, I have two hash arrays (= byte sequences byte[] ) and you want to find the hashes that are present in both arrays. After some research and a lot of thought, I am sure that the best solution for a single-core machine would be a HashSet (add all the elements of the first array and check through contains if the elements of the second array are already present).

However, I want to implement a parallel solution, since I have access to a machine with 8 cores and 12 GB of RAM. The best solution I can think of is ConcurrentHashSet, which can be created using Collections.newSetFromMap(new ConcurrentHashMap<A,B>()) . Using this data structure, I could add all the elements of the first array in parallel and - after all the added elements - I can simultaneously check via contains for identical hashes.

So my question is: do you know the algorithm developed for this exact problem? If not, do you have experience using such a ConcurrentHashSet with respect to problems and effective execution complexity? Or can you recommend another pre-created data structure that could help me?

PS: If anyone is interested in details: I plan to use Skandium to parallelize my program.

+6

java multithreading concurrency hash-collision hash

Florian pilz Jan 2 '12 at 12:58

source share

2 answers

A simpler approach would be to simply split the first array into N equal (or almost equal) parts (with 8 cores, n = 8 seems reasonable). Then enable the program in the “normal” way by looking to see if there are any hashes in the 2nd array in N smaller sub-primary arrays. This can be done in parallel.

However, I had never heard of / dawgs attempts before, and I found the main discussion fascinating and informative. (I mainly work with numbers, not words)

This assumes that the byte bytes [] have some finite short length, so that you can actually split the source file into a parallel process. Is that the case?

EDIT ADDED

For an example of this idea, see the GPU Graphics Graphs edited in Chapter 11 by Wen-Mei W. Hugh, an article by Ligovsky, Rudnitsky, Liu and Schmidt. They parallelize a massive search for a protein sequence database, breaking up a huge single database into many smaller parts, and then running a regular algorithm on each sub-step. I like this quote. "The described algorithm is awkwardly parallel." In their case, they used CUDA and had to do a lot of memory optimization, but this principle should still apply to multi-core machines.

semi-PSEUDOCODE FOLLOW. I will use lists for incoming bytes of [] hashes, hope ok

Original, 1 main method

 originalProcess(List<byte[]> list1, List<byte[]> list2) { HashSet<byte[]> bigHugeHashOfList1 = new HashSet<byte[]>(); bigHugeHashOfList1.addAll(list1); for (byte[] hash : list2) if (bigHugeHashOfList1.contains(hash) // do something }

New method. Uses the exact processing method (later). No DAWGS or TRIES here ...

 preprocess(List<byte[]> list1, List<byte[]> list2) { List<byte[]>[] splitLists = new ArrayList<byte[]>[8]; for (int i=0; i<8; i++) splitLists[i] = new ArrayList<byte[]>(); for (byte[] hash : list1) { int idx = hash[0]&7; // I'm taking the 3 low order bits, YMMV splitLists[idx].add(hash); // a minor speedup would be to create the HashSet here instead of in originalProcess() } // now, using your favorite parallel/concurrency technique, // do the equivalent of for (int i=0; i<8; i++) originalProcess(splitLists[i], list2); }

0

user949300 Jan 2 '12 at 16:33

source share

OldCurmudgeon · Accepted Answer · 2012-01-02T14:02:57+0000

I think it would be a waste of time to use any form of HashMap . I assume that you compute multi-byte hashes of different data, this is hash es, there is no need to hash them in the future.

Although you do not state this, I assume your hashes are byte sequences. Clearly, either trie or dawg would be ideal to preserve them.

Therefore, I would suggest implementing trie/dawg and use it to store all hashes in the first array. Then you can use all the processing power in parallel to search for each element in the second array in this trie . No locks required.

Added

Here is a simple Dawg implementation that I came across. This seems to work.

 public class Dawg { // All my children. Dawg[] children = new Dawg[256]; // Am I a leaf. boolean isLeaf = false; // Add a new word. public void add ( byte[] word ) { // Finds its location, growing as necessary. Dawg loc = find ( word, 0, true ); loc.isLeaf = true; } // String form. public void add ( String word ) { add(word.getBytes()); } // Returns true if word is in the dawg. public boolean contains ( byte [] word ) { // Finds its location, no growing allowed. Dawg d = find ( word, 0, false ); return d != null && d.isLeaf; } // String form. public boolean contains ( String word ) { return contains(word.getBytes()); } // Find the Dawg - growing the tree as necessary if requested. private Dawg find ( byte [] word, int i, boolean grow ) { Dawg child = children[word[i]]; if ( child == null ) { // Not present! if ( grow ) { // Grow the tree. child = new Dawg(); children[word[i]] = child; } } // Found it? if ( child != null ) { // More to find? if ( i < word.length - 1 ) { child = child.find(word, i+1, grow); } } return child; } public static void main ( String[] args ) { Dawg d = new Dawg(); d.add("H"); d.add("Hello"); d.add("World"); d.add("Hell"); System.out.println("Hello is "+(d.contains("Hello")?"in":"out")); System.out.println("World is "+(d.contains("World")?"in":"out")); System.out.println("Hell is "+(d.contains("Hell")?"in":"out")); System.out.println("Hal is "+(d.contains("Hal")?"in":"out")); System.out.println("Hel is "+(d.contains("Hel")?"in":"out")); System.out.println("H is "+(d.contains("H")?"in":"out")); } }

Added

This can be a good start in a parallel version without blocking. These things are usually difficult to verify, so I can’t guarantee that this will work, but in my opinion, it must be.

 import java.util.concurrent.atomic.AtomicReferenceArray; public class LFDawg { // All my children. AtomicReferenceArray<LFDawg> children = new AtomicReferenceArray<LFDawg> ( 256 ); // Am I a leaf. boolean isLeaf = false; // Add a new word. public void add ( byte[] word ) { // Finds its location, growing as necessary. LFDawg loc = find( word, 0, true ); loc.isLeaf = true; } // String form. public void add ( String word ) { add( word.getBytes() ); } // Returns true if word is in the dawg. public boolean contains ( byte[] word ) { // Finds its location, no growing allowed. LFDawg d = find( word, 0, false ); return d != null && d.isLeaf; } // String form. public boolean contains ( String word ) { return contains( word.getBytes() ); } // Find the Dawg - growing the tree as necessary if requested. private LFDawg find ( byte[] word, int i, boolean grow ) { LFDawg child = children.get( word[i] ); if ( child == null ) { // Not present! if ( grow ) { // Grow the tree. child = new LFDawg(); if ( !children.compareAndSet( word[i], null, child ) ) { // Someone else got there before me. Get the one they set. child = children.get( word[i] ); } } } // Found it? if ( child != null ) { // More to find? if ( i < word.length - 1 ) { child = child.find( word, i + 1, grow ); } } return child; } public static void main ( String[] args ) { LFDawg d = new LFDawg(); d.add( "H" ); d.add( "Hello" ); d.add( "World" ); d.add( "Hell" ); System.out.println( "Hello is " + ( d.contains( "Hello" ) ? "in" : "out" ) ); System.out.println( "World is " + ( d.contains( "World" ) ? "in" : "out" ) ); System.out.println( "Hell is " + ( d.contains( "Hell" ) ? "in" : "out" ) ); System.out.println( "Hal is " + ( d.contains( "Hal" ) ? "in" : "out" ) ); System.out.println( "Hel is " + ( d.contains( "Hel" ) ? "in" : "out" ) ); System.out.println( "H is " + ( d.contains( "H" ) ? "in" : "out" ) ); } }

How to find identical bytes [] - objects in two arrays at the same time?

More articles: