Efficient way to find the stream for a string

Suppose there is a stream of text (or Reader in Java) that I would like to check for a specific line. The text stream can be very large, therefore, as soon as the search string is found, I would like to return true, and also try to avoid saving all the input in memory.

Naively, I could try to do something like this (in Java):

public boolean streamContainsString(Reader reader, String searchString) throws IOException { char[] buffer = new char[1024]; int numCharsRead; while((numCharsRead = reader.read(buffer)) > 0) { if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0) return true; } return false; } 

Of course, this cannot detect this search string if it occurs on the border of a 1k buffer:

Search text: "stackoverflow"
Stream buffer 1: "abc ......... stack"
Stream buffer 2: "overflow ....... xyz"

How can I change this code so that it correctly finds the given search string along the border of the buffer, but without loading the entire stream into memory?

Edit: Please note that when searching for a stream for a string, we try to minimize the number of reads from the stream (in order to avoid latency in the network / disk) and save constant memory regardless of the amount of data in the stream. The actual effectiveness of the algorithm is secondary, but it would obviously be nice to find a solution that uses one of the most efficient of these algorithms.

+48
java string algorithm stream search
May 10 '09 at 21:43
source share
15 answers

I made a few changes to the Knut Morris Pratt algorithm for a partial search. Since the actual comparison position is always less than or equal to the following, there is no need for additional memory. Code with the Makefile is also available on github , and it is written in Haxe to specify multiple programming languages ​​at once, including Java.

I also wrote a related article: Searching for substrings in streams: a small modification of the Knuth-Morris-Pratt algorithm in Haxe . The article mentions Jakarta RegExp , now a retired and vacationer at Apache Attic. The Jakarta Regexp library's β€œ match ” method in the RE class uses CharacterIterator as a parameter.

 class StreamOrientedKnuthMorrisPratt { var m: Int; var i: Int; var ss: var table: Array<Int>; public function new(ss: String) { this.ss = ss; this.buildTable(this.ss); } public function begin() : Void { this.m = 0; this.i = 0; } public function partialSearch(s: String) : Int { var offset = this.m + this.i; while(this.m + this.i - offset < s.length) { if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) { if(this.i == this.ss.length - 1) { return this.m; } this.i += 1; } else { this.m += this.i - this.table[this.i]; if(this.table[this.i] > -1) this.i = this.table[this.i]; else this.i = 0; } } return -1; } private function buildTable(ss: String) : Void { var pos = 2; var cnd = 0; this.table = new Array<Int>(); if(ss.length > 2) this.table.insert(ss.length, 0); else this.table.insert(2, 0); this.table[0] = -1; this.table[1] = 0; while(pos < ss.length) { if(ss.substr(pos-1,1) == ss.substr(cnd, 1)) { cnd += 1; this.table[pos] = cnd; pos += 1; } else if(cnd > 0) { cnd = this.table[cnd]; } else { this.table[pos] = 0; pos += 1; } } } public static function main() { var KMP = new StreamOrientedKnuthMorrisPratt("aa"); KMP.begin(); trace(KMP.partialSearch("ccaabb")); KMP.begin(); trace(KMP.partialSearch("ccarbb")); trace(KMP.partialSearch("fgaabb")); } } 
+9
Jan 21
source share

There are three good solutions here:

  • If you need something lightweight and reasonably fast, go without a buffer, and instead implement a simple non-deterministic state machine. Your state will be a list of indices in the string you are looking for, and your logic looks something like this (pseudo-code):

     String needle; n = needle.length(); for every input character c do add index 0 to the list for every index i in the list do if c == needle[i] then if i + 1 == n then return true else replace i in the list with i + 1 end else remove i from the list end end end 

    This will find the string if it exists and you will never need a buffer.

  • A little more work, but also faster: do an NFA-to-DFA conversion that predefines which index lists are possible and assign each one to a small integer. (If you read about Wikipedia's string search, this is called the poweret construct.) Then you have one state, and you make a state transition for each incoming character. The NFA you want is just the DFA for a string that is preceded by a state that is non-deterministic either throwing a character or trying to use the current character. You will also need an explicit error condition.

  • If you want something faster, create a buffer at least two times n , and the Boyer-Moore user will compile the state machine from needle . You will have a lot of unnecessary trouble, because Boyer-Moore is not trivial to implement (although you will find the code on the Internet), and because you have to arrange the movement of the line through the buffer. You will need to create or find a circular buffer that can β€œslide” without copying; otherwise, you are likely to return all the performance gains you can get from Boyer Moore.

+14
May 11 '09 at 4:43 a.m.
source share

Knuth-Morris-Pratt's search algorithm is never backed up; this is just the property you want to search for a stream. I used it before for this problem, although there may be simpler ways to use the available Java libraries. (When it came to me, I worked in C in the 90s.)

KMP is essentially a quick way to build string-compatible DFAs, such as the Norman Ramsey # 2 sentence.

+9
May 11 '09 at 5:01 a.m.
source share

This answer applied to the original version of the question, where the key was to read the stream only as needed, to match on a String if that String is present. This solution would not meet the requirement to guarantee a fixed memory usage, but it may be advisable to consider whether you found this question and are not bound by this restriction.

If you are bound by a restriction on the use of read-only memory, Java stores arrays of any type on the heap, and as such zeroing, the reference does not free memory in any way; I think that any solution involving arrays in a loop will consume memory on the heap and require GC.




For a simple implementation, there may be a Java 5 Scanner that can accept an InputStream and use java.util.regex.Pattern to search for input can save your attention on implementation details.

Here is an example of a potential implementation:

 public boolean streamContainsString(Reader reader, String searchString) throws IOException { Scanner streamScanner = new Scanner(reader); if (streamScanner.findWithinHorizon(searchString, 0) != null) { return true; } else { return false; } } 

I am thinking of a regex because it is like a job for Finate State Automaton that starts in its original state, changing the state character by character until it rejects the line (no match) or falls into accept state.

I think this is probably the most efficient matching logic you could use, and how you organize the reading of information can be separated from the matching logic to tune performance.

This is also how regular expressions work.

+5
May 10, '09 at 21:53
source share

Instead of making your buffer an array, use an abstraction that implements a circular buffer . The calculation of your index will be buf[(next+i) % sizeof(buf)] , and you must be careful to fill the buffer in half at a time. But while the search string fits into half the buffer, you will find it.

+4
May 11, '09 at 4:49
source share

I believe that the best solution to this problem is to try to simplify it. Remember: because I am reading from the stream, I want the number of reads from the stream to be minimal (because there may be a problem with network or disk latency), while the memory used is constant (since the stream can be very large in size ) The actual effectiveness of string matching is not number one goal (as has already been studied before death ).

Based on AlbertoPL's suggestion, here is a simple solution that compares the buffer with the character of a character search string. The key is that since only one character is searched at a time, backtracking is not required, and therefore, circular buffers or buffers of a certain size are not needed.

Now, if someone can come up with a similar implementation based on the Knuth-Morris-Pratt search algorithm , then we will have a good effective solution;)

 public boolean streamContainsString(Reader reader, String searchString) throws IOException { char[] buffer = new char[1024]; int numCharsRead; int count = 0; while((numCharsRead = reader.read(buffer)) > 0) { for (int c = 0; c < numCharsRead; c++) { if (buffer[c] == searchString.charAt(count)) count++; else count = 0; if (count == searchString.length()) return true; } } return false; } 
+4
May 11, '09 at 21:28
source share

The introduction of a sliding window. Make your buffer around, move all the elements in the buffer one forward, and enter one new character into the buffer at the end. If the buffer is equal to your search term, it is contained.

Of course, if you want to make it more efficient, you can see how to prevent all the elements from moving around in the buffer around, for example, using a circular buffer and presenting the lines that β€œloop” the same thing that makes the buffer, so you only need to check content matching. This saves the movement of all elements in the buffer.

+1
May 10 '09 at
source share

I think you need to buffer a small amount at the border between the buffers.

For example, if the size of your buffer is 1024, and the length of SearchString is 10, then, as well as to search in each 1024-byte buffer, you also need to look for every 18-byte transition between two buffers (9 bytes from the end, the previous buffer is combined with 9 bytes from the beginning of the next buffer).

+1
May 10, '09 at 21:48
source share

I would say switch to a character according to the character’s decision, in which case you scan the first character in the target text, then when you find that the character is incrementing the counter and looking for the next character. Each time you do not find the next consecutive character, restart the counter. It will work as follows:

 public boolean streamContainsString(Reader reader, String searchString) throws IOException { char[] buffer = new char[1024]; int numCharsRead; int count = 0; while((numCharsRead = reader.read(buffer)) > 0) { if (buffer[numCharsRead -1] == searchString.charAt(count)) count++; else count = 0; if (count == searchString.size()) return true; } return false; } 

The only problem is that you are in the middle of viewing characters ... in this case there should be a way to remember your variable count. I see no easy way to do this, except as a private variable for the entire class. In this case, you should not indicate the score inside this method.

+1
May 10 '09 at 21:53
source share

If you are not attached to using Reader, you can use the Java NIO API to efficiently download a file. For example (untested, but should be close to work):

 public boolean streamContainsString(File input, String searchString) throws IOException { Pattern pattern = Pattern.compile(Pattern.quote(searchString)); FileInputStream fis = new FileInputStream(input); FileChannel fc = fis.getChannel(); int sz = (int) fc.size(); MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz); CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder(); CharBuffer cb = decoder.decode(bb); Matcher matcher = pattern.matcher(cb); return matcher.matches(); } 

This is basically a mmap () file to search for and relies on the operating system to do the right thing regarding cache and memory usage. Note, however, that map () is more expensive by simply reading the file into a large buffer for files smaller than 10 kilobytes.

+1
May 10 '09 at 10:30
source share

A very fast stream search is implemented in the RingBuffer class from the Ujorm structure. See Sample:

  Reader reader = RingBuffer.createReader("xxx ${abc} ${def} zzz"); String word1 = RingBuffer.findWord(reader, "${", "}"); assertEquals("abc", word1); String word2 = RingBuffer.findWord(reader, "${", "}"); assertEquals("def", word2); String word3 = RingBuffer.findWord(reader, "${", "}"); assertEquals("", word3); 

An implementation of one class is available at SourceForge : For more information, see Link.

+1
Jan 20 '13 at 18:42
source share

You can increase the search speed for very large strings using the string search algorithm

0
May 10 '09 at 21:55
source share

If you are looking for a constant substring rather than a regular expression, I would recommend Boyer-Moore. There is a lot of source code on the Internet.

Also, use a circular buffer to not think too much about buffer boundaries.

Mike.

0
May 11 '09 at 4:00
source share

I also had a similar problem: skip bytes from InputStream to the specified string (or byte array). This is simple circular buffer based code. It is not very effective, but it works for my needs:

  private static boolean matches(int[] buffer, int offset, byte[] search) { final int len = buffer.length; for (int i = 0; i < len; ++i) { if (search[i] != buffer[(offset + i) % len]) { return false; } } return true; } public static void skipBytes(InputStream stream, byte[] search) throws IOException { final int[] buffer = new int[search.length]; for (int i = 0; i < search.length; ++i) { buffer[i] = stream.read(); } int offset = 0; while (true) { if (matches(buffer, offset, search)) { break; } buffer[offset] = stream.read(); offset = (offset + 1) % buffer.length; } } 
0
Jul 20 '12 at 18:20
source share

Perhaps you can implement a very fast solution using Fast Fourier Transforms, which, if executed correctly, will allow you to perform string matching at times O (nlog (m)), where n is the length of the longer string, be consistent, and m is shorter string length. For example, you can perform an FFT as soon as you receive an input stream of length m, and if it matches, you can return, and if it does not match, you can throw the first character into the stream, wait for a new character to appear through the stream, and then perform FFT again.

0
Mar 14 '14 at 17:48
source share



All Articles