Find the longest common starting substring in a rowset

Question

Find the longest common starting substring in a rowset

It is a difficult task to come up with the most elegant JavaScript, Ruby, or another solution for a relatively trivial problem.

This problem is a more specific case. The longest common substring problem . I need to find only the longest common starting substring in the array. This greatly simplifies the problem.

For example, the longest substring in [interspecies, interstelar, interstate] is "inters". However, I do not need to find "ific" in [specifics, terrific] .

I solved the problem by quickly copying the solution in JavaScript as part of my answer about a shell-like shell ( test page here ). Here is a slightly modified solution:

 function common_substring(data) { var i, ch, memo, idx = 0 do { memo = null for (i=0; i < data.length; i++) { ch = data[i].charAt(idx) if (!ch) break if (!memo) memo = ch else if (ch != memo) break } } while (i == data.length && idx < data.length && ++idx) return (data[0] || '').slice(0, idx) }

This code is available in this Gist along with a similar solution in Ruby. You can clone the gist as a git repository to try:

 $ git clone git://gist.github.com/257891.git substring-challenge

I am not very happy with these decisions. I have a feeling that they can be solved with more elegance and less complexity, so I am posting this problem.

I am going to take as a response a decision that I find the most elegant or concise. Here, for example, the crazy ruby hack I came up with is the definition of the & operator on String:

 # works with Ruby 1.8.7 and above class String def &(other) difference = other.to_str.each_char.with_index.find { |ch, idx| self[idx].nil? or ch != self[idx].chr } difference ? self[0, difference.last] : self end end class Array def common_substring self.inject(nil) { |memo, str| memo.nil? ? str : memo & str }.to_s end end

JavaScript or Ruby solutions are preferable, but you can demonstrate a smart solution in other languages while you explain what is going on. Only code from the standard library.

Update: My Favorite Decisions

I chose Kennebec's JavaScript sorting solution as the “answer” because it hit me both unexpected and ingenious. If we ignore the complexity of the actual sorting (assuming that it is infinitely optimized using a language implementation), the complexity of the solution lies in the simple comparison of two strings.

Other great solutions:

“regex greed” by FM takes a minute or two to understand, but then the grace of it amazes you. Yehuda Katz also made a regex solution , but it's more complicated
commonprefix in Python - Roberto Bonvallé used a function designed to handle file system paths to solve this problem.
Haskell one-liner is short, as if it were compressed, and beautiful
simple single line Ruby

Thank you for participating! As you can see from the comments, I learned a lot (even about Ruby).

+39

javascript python ruby haskell

mislav Dec 16 '09 at 17:22

source share

31 answers

In Python:

 >>> from os.path import commonprefix >>> commonprefix('interspecies interstelar interstate'.split()) 'inters'

+38

Roberto Bonvallet Dec 16 '09 at 18:22

source share

Ruby one-liner:

 l=strings.inject{|l,s| l=l.chop while l!=s[0...l.length];l}

+9

AShelly Dec 16 '09 at 18:00

source share

You just need to go through all the lines until they are different, and then fine-tune to that point.

pseudo code:

 loop for i upfrom 0 while all strings[i] are equal finally return substring[0..i]

Common Lisp:

 (defun longest-common-starting-substring (&rest strings) (loop for i from 0 below (apply #'min (mapcar #'length strings)) while (apply #'char= (mapcar (lambda (string) (aref string i)) strings)) finally (return (subseq (first strings) 0 i))))

+9

Svante Dec 16 '09 at 18:02

source share

My Haskell with one layer:

 import Data.List commonPre :: [String] -> String commonPre = map head . takeWhile (\(x:xs)-> all (==x) xs) . transpose

EDIT: barkmadley gave a good explanation for the code below. I would also add that haskell uses lazy evaluation, so we can be lazy about our use of transpose ; he will only carry our lists as necessary to find the end of a common prefix.

+9

jberryman Dec 16 '09 at 23:03

source share

Another way to do this is to use regex greed.

 words = %w(interspecies interstelar interstate) j = '=' str = ['', *words].join(j) re = "[^#{j}]*" str =~ /\A (?: #{j} ( #{re} ) #{re} ) (?: #{j} \1 #{re} )* \z/x p $1

And one liner, kindly provided by mislav (50 characters):

 p ARGV.join(' ').match(/^(\w*)\w*(?: \1\w*)*$/)[1]

+6

FMc Dec 17 '09 at 14:27

source share

In Python, I would not use anything other than the existing commonprefix function, which I showed in another answer, but I could not help reinvent the wheel :P This is my iterator based approach:

 >>> a = 'interspecies interstelar interstate'.split() >>> >>> from itertools import takewhile, chain, izip as zip, imap as map >>> ''.join(chain(*takewhile(lambda s: len(s) == 1, map(set, zip(*a))))) 'inters'

Edit: An explanation of how this works.

zip generates tuples of elements that take one of each element a at a time:

 In [6]: list(zip(*a)) # here I use list() to expand the iterator Out[6]: [('i', 'i', 'i'), ('n', 'n', 'n'), ('t', 't', 't'), ('e', 'e', 'e'), ('r', 'r', 'r'), ('s', 's', 's'), ('p', 't', 't'), ('e', 'e', 'a'), ('c', 'l', 't'), ('i', 'a', 'e')]

By matching set on these elements, I get a series of unique letters:

 In [7]: list(map(set, _)) # _ means the result of the last statement above Out[7]: [set(['i']), set(['n']), set(['t']), set(['e']), set(['r']), set(['s']), set(['p', 't']), set(['a', 'e']), set(['c', 'l', 't']), set(['a', 'e', 'i'])]

takewhile(predicate, items) accepts elements from this, while the predicate is True; in this particular case, when set has one element, i.e. all words have the same letter in this position:

 In [8]: list(takewhile(lambda s: len(s) == 1, _)) Out[8]: [set(['i']), set(['n']), set(['t']), set(['e']), set(['r']), set(['s'])]

At this moment, we have the iterability of the sets, each of which contains one letter of the prefix we were looking for. To build a string, we chain them into one iterable, from which we get the letters in join to the final string.

The magic of using iterators is that all elements are generated on demand, so when takewhile stops asking for elements, zping stops at that moment and no unnecessary work is done. Each function call in my single-line line has an implicit for and an implicit break .

+4

Roberto Bonvallet Dec 16 '09 at 19:02

source share

This is probably not the most concise solution (it depends if you already have a library for this), but one of the elegant methods is to use trie. I use attempts to perform tabs in my Schema interpreter:

http://github.com/jcoglan/heist/blob/master/lib/trie.rb

For example:

 tree = Trie.new %w[interspecies interstelar interstate].each { |s| tree[s] = true } tree.longest_prefix('') #=> "inters"

I also use them to match channel names with wildcards for the Bayeux protocol; see the following:

http://github.com/jcoglan/faye/blob/master/client/channel.js

http://github.com/jcoglan/faye/blob/master/lib/faye/channel.rb

+3

jcoglan Dec 16 '09 at 17:52

source share

Just for fun, here is the version written in (SWI-) PROLOG:

 common_pre([[C|Cs]|Ss], [C|Res]) :- maplist(head_tail(C), [[C|Cs]|Ss], RemSs), !, common_pre(RemSs, Res). common_pre(_, []). head_tail(H, [H|T], T).

Launch:

 ?- S=["interspecies", "interstelar", "interstate"], common_pre(S, CP), string_to_list(CPString, CP).

gives:

 CP = [105, 110, 116, 101, 114, 115], CPString = "inters".

Explanation:

(SWI-) PROLOG treats strings as lists of character codes (numbers). The entire predicate common_pre/2 makes it a recursive pattern matching to select the first code ( C ) from the head of the first list (line, [C|Cs] ) in the list of all lists (all lines, [[C|Cs]|Ss] ) and adds the corresponding C code to the result if it is common to all (remaining) chapters of all lists (lines), otherwise it ends.

Nice, clean, simple and efficient ... :)

+3

sharky Dec 17 '09 at 6:13

source share

Javascript version based on @Svante algorithm:

 function commonSubstring(words){ var iChar, iWord, refWord = words[0], lRefWord = refWord.length, lWords = words.length; for (iChar = 0; iChar < lRefWord; iChar += 1) { for (iWord = 1; iWord < lWords; iWord += 1) { if (refWord[iChar] !== words[iWord][iChar]) { return refWord.substring(0, iChar); } } } return refWord; }

+3

Mariano Desanze Sep 14 '10 at 15:51

source share

Combining the answers from kennebec , Florian F, and jberryman , you get the following single-line Haskell:

 commonPrefix l = map fst . takeWhile (uncurry (==)) $ zip (minimum l) (maximum l)

With Control.Arrow you can get a dissolute form:

 commonPrefix = map fst . takeWhile (uncurry (==)) . uncurry zip . (minimum &&& maximum)

+3

Bolo Dec 04 2018-11-11T00:

source share

 Python 2.6 (r26:66714, Oct 4 2008, 02:48:43) >>> a = ['interspecies', 'interstelar', 'interstate'] >>> print a[0][:max( [i for i in range(min(map(len, a))) if len(set(map(lambda e: e[i], a))) == 1] ) + 1] inters

i for i in range(min(map(len, a))) , the number of maximum requests cannot be greater than the length of the shortest line; in this example, it will be evaluated as [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
len(set(map(lambda e: e[i], a))) , 1) create an array of i-th character for each line in the list; 2) make a set of it; 3) determine the size of the set
[i for i in range(min(map(len, a))) if len(set(map(lambda e: e[i], a))) == 1] , include only characters for which the size of the set is 1 (all characters in this position were the same ..); here he will evaluate the value [0, 1, 2, 3, 4, 5]
finally take max , add it and get a substring ...

Note: the above does not work for a = ['intersyate', 'intersxate', 'interstate', 'intersrate'] , but this:

  >>> index = len( filter(lambda l: l[0] == l[1], [ x for x in enumerate( [i for i in range(min(map(len, a))) if len(set(map(lambda e: e[i], a))) == 1] )])) >>> a[0][:index] inters

+2

miku Dec 16 '09 at 17:43

source share

It doesn't seem like it's tricky if you're not too concerned about maximum performance:

 def common_substring(data) data.inject { |m, s| s[0,(0..m.length).find { |i| m[i] != s[i] }.to_i] } end

One useful feature of an injection is the ability to pre-seed with the first element of the array that interacts. This avoids the zero-memo check.

 puts common_substring(%w[ interspecies interstelar interstate ]).inspect # => "inters" puts common_substring(%w[ feet feel feeble ]).inspect # => "fee" puts common_substring(%w[ fine firkin fail ]).inspect # => "f" puts common_substring(%w[ alpha bravo charlie ]).inspect # => "" puts common_substring(%w[ fork ]).inspect # => "fork" puts common_substring(%w[ fork forks ]).inspect # => "fork"

Update: if here is a golf game, then 67 characters:

 def f(d)d.inject{|m,s|s[0,(0..m.size).find{|i|m[i]!=s[i]}.to_i]}end

+2

tadman Dec 16 '09 at 17:58

source share

This is very similar to Roberto Bonvalle's decision, with the exception of ruby.

 chars = %w[interspecies interstelar interstate].map {|w| w.split('') } chars[0].zip(*chars[1..-1]).map { |c| c.uniq }.take_while { |c| c.size == 1 }.join

The first line replaces each word with an array of characters. Then I use zip to create this data structure:

[["i", "i", "i"], ["n", "n", "n"], ["t", "t", "t"], ...

map and uniq reduce this to [["i"],["n"],["t"], ...

take_while draws characters from the array until it finds one where the size is not one (which means not all characters are the same). Finally, I join them back together.

+2

Ben Marini Dec 17 '09 at 4:37

source share

the decision is broken (for example, it returns a for strings like ['a', 'ba'] ). The fix is very simple, you literally need to change only 3 characters (from indexOf(tem1) == -1 to indexOf(tem1) != 0 ), and the function will work as expected.

Unfortunately, when I tried to edit the answer to correct a typo, SO told me that "editing should be at least 6 characters long." I could change more of these three characters to improve naming and readability, but this is a bit like.

So below is a fixed and improved (at least from my point of view) version of kennebec solution:

 function commonPrefix(words) { max_word = words.reduce(function(a, b) { return a > b ? a : b }); prefix = words.reduce(function(a, b) { return a > b ? b : a }); // min word while(max_word.indexOf(prefix) != 0) { prefix = prefix.slice(0, -1); } return prefix; }

(on jsFiddle )

Note that it uses the reduce method (JavaScript 1.8) to find the alphanumeric max / min value instead of sorting the array and then extracting the first and last elements of it.

+2

Alexis Jul 23 '13 at 23:33

source share

When reading these answers with all the fantastic functional programming, sorting, and regular expressions and much more, I just thought: what's wrong with C? So, here's a dumb looking little program.

 #include <stdio.h> int main (int argc, char *argv[]) { int i = -1, j, c; if (argc < 2) return 1; while (c = argv[1][++i]) for (j = 2; j < argc; j++) if (argv[j][i] != c) goto out; out: printf("Longest common prefix: %.*s\n", i, argv[1]); }

Compile it, run it with a list of strings as command line arguments, and then let me go to use goto !

+2

skagedal Mar 30 '15 at 12:33

source share

Here is a solution using regular expressions in Ruby:

 def build_regex(string) arr = [] arr << string.dup while string.chop! Regexp.new("^(#{arr.join("|")})") end def substring(first, *strings) strings.inject(first) do |accum, string| build_regex(accum).match(string)[0] end end

+1

Yehuda Katz Dec 16 '09 at 17:47

source share

I would do the following:

Take the first line of the array as the initial start substring.
Take the next line of the array and compare the characters until the end of one of the lines is reached or a mismatch is found. If a mismatch is found, reduce the starting substring to the length where the mismatch is found.
Repeat step 2 until all lines are tested.

There is a JavaScript implementation:

 var array = ["interspecies", "interstelar", "interstate"], prefix = array[0], len = prefix.length; for (i=1; i<array.length; i++) { for (j=0, len=Math.min(len,array[j].length); j<len; j++) { if (prefix[j] != array[i][j]) { len = j; prefix = prefix.substr(0, len); break; } } }

+1

Gumbo Dec 16 '09 at 18:18

source share

Instead of sorting, you can just get min and max rows.

For me, elegance in a computer program is a balance of speed and simplicity. It should not do unnecessary calculations, and it should be simple enough to make its correctness obvious.

I could call the sorting solution smart, but not elegant.

+1

Florian F Feb 21 2018-11-21T00:

source share

It is often more elegant to use a mature open source library rather than collapse it yourself. Then, if this does not fully meet your needs, you can expand it or modify it to improve it, and let the community decide whether it belongs in the library.

diff-lcs is a good Ruby stone for the smallest common substring.

+1

nertzy Jan 23 2018-12-12T00:

source share

My solution in Java:

 public static String compute(Collection<String> strings) { if(strings.isEmpty()) return ""; Set<Character> v = new HashSet<Character>(); int i = 0; try { while(true) { for(String s : strings) v.add(s.charAt(i)); if(v.size() > 1) break; v.clear(); i++; } } catch(StringIndexOutOfBoundsException ex) {} return strings.iterator().next().substring(0, i); }

+1

fferri Jun 06 '13 at 21:25

source share

The Golfed JS solution is just for fun:

 w=["hello", "hell", "helen"]; c=w.reduce(function(p,c){ for(r="",i=0;p[i]==c[i];r+=p[i],i++){} return r; });

+1

Dan Prince Aug 08 '13 at 22:36

source share

Here's an effective solution in ruby. I based the idea of a hi / lo guessing strategy game where you are iteratively null on the longest prefix.

Someone will correct me if I am wrong, but I believe that the complexity is O (n log n), where n is the length of the shortest line, and the number of lines is considered a constant.

 def common(strings) lo = 0 hi = strings.map(&:length).min - 1 return '' if hi < lo guess, last_guess = lo, hi while guess != last_guess last_guess = guess guess = lo + ((hi - lo) / 2.0).ceil if strings.map { |s| s[0..guess] }.uniq.length == 1 lo = guess else hi = guess end end strings.map { |s| s[0...guess] }.uniq.length == 1 ? strings.first[0...guess] : '' end

And some checks in which it works:

 >> common %w{ interspecies interstelar interstate } => "inters" >> common %w{ dog dalmation } => "d" >> common %w{ asdf qwerty } => "" >> common ['', 'asdf'] => ""

+1

Ben Lee Dec 17 '13 at 21:51

source share

An interesting alternative Ruby solution:

 def common_prefix(*strings) chars = strings.map(&:chars) length = chars.first.zip( *chars[1..-1] ).index{ |a| a.uniq.length>1 } strings.first[0,length] end p common_prefix( 'foon', 'foost', 'forlorn' ) #=> "fo" p common_prefix( 'foost', 'foobar', 'foon' ) #=> "foo" p common_prefix( 'a','b' ) #=> ""

This can speed chars = strings.sort_by(&:length).map(&:chars) up if you used chars = strings.sort_by(&:length).map(&:chars) , since the shorter the first line, the shorter the arrays created by zip . However, if you care about speed, you probably shouldn't use this solution. :)

+1

Phrogz Feb 14 '14 at 16:03

source share

This is by no means elegant, but if you want it succinctly:

Ruby, 71 chars

 def f(a)b=a[0];b[0,(0..b.size).find{|n|a.any?{|i|i[0,n]!=b[0,n]}}-1]end

If you want the sweep to look like this:

 def f(words) first_word = words[0]; first_word[0, (0..(first_word.size)).find { |num_chars| words.any? { |word| word[0, num_chars] != first_word[0, num_chars] } } - 1] end

0

Jordan Running Dec 16 '09 at 17:59

source share

Javascript Clone Anyway, a great answer.

Requires Array#reduce , which is only supported in firefox.

 var strings = ["interspecies", "intermediate", "interrogation"] var sub = strings.reduce(function(l,r) { while(l!=r.slice(0,l.length)) { l = l.slice(0, -1); } return l; });

0

Chetan Sastry Dec 16 '09 at 19:09

source share

This is not a golf code, but you asked for it somewhat elegantly, and I tend to think that recursion is fun. Java

 /** Recursively find the common prefix. */ public String findCommonPrefix(String[] strings) { int minLength = findMinLength(strings); if (isFirstCharacterSame(strings)) { return strings[0].charAt(0) + findCommonPrefix(removeFirstCharacter(strings)); } else { return ""; } } /** Get the minimum length of a string in strings[]. */ private int findMinLength(final String[] strings) { int length = strings[0].size(); for (String string : strings) { if (string.size() < length) { length = string.size(); } } return length; } /** Compare the first character of all strings. */ private boolean isFirstCharacterSame(String[] strings) { char c = string[0].charAt(0); for (String string : strings) { if (c != string.charAt(0)) return false; } return true; } /** Remove the first character of each string in the array, and return a new array with the results. */ private String[] removeFirstCharacter(String[] source) { String[] result = new String[source.length]; for (int i=0; i<result.length; i++) { result[i] = source[i].substring(1); } return result; }

0

Dean J Dec 16 '09 at 19:31

source share

Ruby version based on @Svante algorithm. It runs ~ 3 times faster than the first.

  def common_prefix set i=0 rest=set[1..-1] set[0].each_byte{|c| rest.each{|e|return set[0][0...i] if e[i]!=c} i+=1 } set end

0

AShelly Dec 17 '09 at 4:19

source share

My Javascript solution :

IMOP, using sorting is too complicated. My solution compares a letter with a letter, looping an array. The returned string if no letter is specified.

This is my decision:

 var longestCommonPrefix = function(strs){ if(strs.length < 1){ return ''; } var p = 0, i = 0, c = strs[0][0]; while(p < strs[i].length && strs[i][p] === c){ i++; if(i === strs.length){ i = 0; p++; c = strs[0][p]; } } return strs[0].substr(0, p); };

0

Eric Chen Apr 01 '15 at 9:47

source share

Understanding the risk of turning this turn into a code golf match (or is this the intention?), Here is my solution using sed , copied from my answer to another SO question and reduced to 36 characters (30 of which are the actual sed expression). He expects the lines (each on a separate line) to be delivered on standard input or in files passed as extra arguments.

 sed 'N;s/^\(.*\).*\n\1.*$/\1\n\1/;D'

A script with sed in a shebang line weighs 45 characters:

 #!/bin/sed -f N;s/^\(.*\).*\n\1.*$/\1\n\1/;D

Testing a script (named longestprefix ) with lines presented as "document here":

 $ ./longestprefix <<EOF > interspecies > interstelar > interstate > EOF inters $

0

ack Dec 12 '15 at 18:07

source share

kennebec · Accepted Answer · 2009-12-16 19:26

This is a matter of taste, but this is a simple version of javascript: It sorts the array and then looks only at the first and last elements.

// the longest regular starting substring in the array

 function sharedStart(array){ var A= array.concat().sort(), a1= A[0], a2= A[A.length-1], L= a1.length, i= 0; while(i<L && a1.charAt(i)=== a2.charAt(i)) i++; return a1.substring(0, i); }

DEMONSTRATIONS

 sharedStart(['interspecies', 'interstelar', 'interstate']) //=> 'inters' sharedStart(['throne', 'throne']) //=> 'throne' sharedStart(['throne', 'dungeon']) //=> '' sharedStart(['cheese']) //=> 'cheese' sharedStart([]) //=> '' sharedStart(['prefix', 'suffix']) //=> ''

Find the longest common starting substring in a rowset

Update: My Favorite Decisions

Ruby, 71 chars

More articles: