Trim too long words from sentences in C #?

Question

Trim too long words from sentences in C #?

I have C # lines containing sentences. Sometimes these sentences are fine, sometimes they are just user random characters. What I would like to do is trim the words inside these sentences. For example, enter the following line:

var stringWithLongWords = "Here a text with tooooooooooooo long words";

I would like to run this through a filter:

 var trimmed = TrimLongWords(stringWithLongWords, 6);

And get an output where each word can only contain up to 6 characters:

 "Here a text with tooooo long words"

Any ideas how this can be done with good performance? Is there anything in .NET that can handle this automatically?

I am currently using the following code:

  private static string TrimLongWords(string original, int maxCount) { return string.Join(" ", original.Split(' ').Select(x => x.Substring(0, x.Length > maxCount ? maxCount : x.Length))); }

Which theoretically works, but it gives a bad result if the long word ends with a delimiter other than the place. For instance:

This is sweeeeeeeeeeeeeeeeeet! And one more thing.

Ends as follows:

This is sweeeeeeee and something else.

Update:

OK, the comments were so good that I realized that this could have too much “what if”. Perhaps it would be better if the delimiters were forgotten. Instead, if a word is truncated, it can be shown with three dots. Here are some examples with words truncated to 5 characters:

Apocalypse now! → Apoka ... now!

Apocalypse! → Apoka ...

! Example! →! Exam...

This is sweeeeeeeeeeeeeeeeeet! And one more thing. → This is sweee ... And more ... more.

+6

c #

Mikael koskinen Jul 11 '13 at 11:33

source share

9 answers

I would recommend using StringBuilder along with loops:

 public string TrimLongWords(string input, int maxWordLength) { StringBuilder sb = new StringBuilder(input.Length); int currentWordLength = 0; bool stopTripleDot = false; foreach (char c in input) { bool isLetter = char.IsLetter(c); if (currentWordLength < maxWordLength || !isLetter) { sb.Append(c); stopTripleDot = false; if (isLetter) currentWordLength++; else currentWordLength = 0; } else if (!stopTripleDot) { sb.Append("..."); stopTripleDot = true; } } return sb.ToString(); }

It will be faster than Regex or Linq.
Expected Results for maxWordLength == 6 :

 "UltraLongWord" -> "UltraL..." "This-is-not-a-long-word" -> "This-is-not-a-long-word"

And the edge register maxWordLength == 0 will result in:

 "Please don't trim me!!!" -> "... ...'... ... ...!!!" // poor, poor string...

[This answer has been updated to accommodate `"..."` as requested in the question]

(I just realized that replacing trimmed substrings with "..." introduced quite a few errors, and fixing them showed my code a bit cumbersome, sorry)

+4

Nolonar Jul 11 '13 at 11:47

source share

Try the following:

 private static string TrimLongWords(string original, int maxCount) { return string.Join(" ", original.Split(' ') .Select(x => { var r = Regex.Replace(x, @"\W", ""); return r.Substring(0, r.Length > maxCount ? maxCount : r.Length) + Regex.Replace(x, @"\w", ""); })); }

Then TrimLongWords("This is sweeeeeeeeeeeeeeeet! And something more.", 5) becomes "This is sweee! And somet more."

+2

dav_i Jul 11 '13 at 11:41

source share

This is more efficient than the regex or Linq approach. However, it is not divided by words or does not add ... White spaces (including linear breaks or tabs) should also be reduced imho.

 public static string TrimLongWords(string original, int maxCount) { if (null == original || original.Length <= maxCount) return original; StringBuilder builder = new StringBuilder(original.Length); int occurence = 0; for (int i = 0; i < original.Length; i++) { Char current = original[i]; if (current == original.ElementAtOrDefault(i-1)) occurence++; else occurence = 1; if (occurence <= maxCount) builder.Append(current); } return builder.ToString(); }

+2

Tim schmelter Jul 11 '13 at 11:47

source share

You can use regex to search for repetitions:

 string test = "This is sweeeeeeeeeeeeeeeet! And sooooooomething more."; string result = Regex.Replace(test, @"(\w)\1+", delegate(Match match) { string v = match.ToString(); return v[0].ToString(); });

Result:

 This is swet! And something more.

And maybe you can check the managed words with the spellchecker service: http://wiki.webspellchecker.net/doku.php?id=installationandconfiguration:web_service

+2

cansik Jul 11 '13 at 11:56

source share

Try the following:

 class Program { static void Main(string[] args) { var stringWithLongWords = "Here a text with tooooooooooooo long words"; var trimmed = TrimLongWords(stringWithLongWords, 6); } private static string TrimLongWords(string stringWithLongWords, int p) { return Regex.Replace(stringWithLongWords, String.Format(@"[\w]{{{0},}}", p), m => { return m.Value.Substring(0, p-1) + "..."; }); } }

+2

Alex filipovici Jul 11 '13 at 12:01

source share

Using a simple Regex with a positive zero-width lookbehind statement ( LinqPad - sample code):

 void Main() { foreach(var s in new [] { "Here a text with tooooooooooooo long words", "This is sweeeeeeeeeeeeeeeet! And something more.", "Apocalypse now!", "Apocalypse!", "!Example!"}) Regex.Replace(s, @"(?<=\w{5,})\S+", "...").Dump(); }

It searches for any non-spatial character after 5 word characters and replaces the match with ...

Result:

Here is the text with toooo ... long words
This is sweee ... And somet ... more.
Apoca ... now!
Apoca ...
! Examp ...

+2

sloth Jul 11 '13 at 12:07

source share

A more practical approach might be like @ Kurt suggested in the comments.

I can’t immediately think of any English words that contain 3 identical letters in a row. Instead of just cutting off a word after 6 characters, you can try this approach: whenever you encounter the same character twice in a row, delete all subsequent successive occurrences of it. Thus, "sweeeeeet" becomes "sweet" and "tooooooo" becomes "too".

This would have an additional side effect, limiting the number of identical punctuation or space characters to 2, in case someone was too jealous of these !!!!!!!!

If you want to account for ellipses (...), then just do the "maximum consecutive characters" count == 3 instead of 2.

+2

BTownTKD Jul 11 '13 at 12:43

source share

The following will limit the number of repeated characters to 6. So, for your input, "This is sweeeeeeeeeeeeeeeeee! And something else." output will be:

"This is sweeeeeet! And something else."

 string s = "heloooooooooooooooooooooo worrrllllllllllllld!"; char[] chr = s.ToCharArray(); StringBuilder sb = new StringBuilder(); char currentchar = new char(); int charCount = 0; foreach (char c in chr) { if (c == currentchar) { charCount++; } else { charCount = 0; } if ( charCount < 6) { sb.Append(c); } currentchar = c; } Console.WriteLine(sb.ToString()); //Output heloooooo worrrlllllld!

EDIT: Truncate words longer than 6 characters:

 string s = "This is sweeeeeeeeeeeeeeeet! And something more."; string[] words = s.Split(' '); StringBuilder sb = new StringBuilder(); foreach (string word in words) { char[] chars = word.ToCharArray(); if (chars.Length > 6) { for (int i = 0; i < 6; i++) { sb.Append(chars[i]); } sb.Append("...").Append(" "); } else { sb.Append(word).Append(" "); } } sb.Remove(sb.Length - 1, 1); Console.WriteLine(sb.ToString()); //Output: "This is sweeee... And someth... more."

+1

Riv Jul 11 '13 at 12:10

source share

Joey · Accepted Answer · 2013-07-11T11:43:08+0000

EDIT: With changing requirements, I will stay in the spirit with regular expressions:

 Regex.Replace(original, string.Format(@"(\p{{L}}{{{0}}})\p{{L}}+", maxLength), "$1...");

Output with maxLength = 6:

 Here a text with tooooo... long words This is sweeee...! And someth... more.

The old answer is below because I liked the approach, although it is a bit ... dirty :-).

I made a small regex replacement to do this. This is in PowerShell at the moment (for prototyping, I will convert it to C # later):

 'Here' a text with tooooooooooooo long words','This is sweeeeeeeeeeeeeeeet! And something more.' | % { [Regex]::Replace($_, '(\w*?)(\w)\2{2,}(\w*)', { $m = $args[0] if ($m.Value.Length -gt 6) { $l = 6 - $m.Groups[1].Length - $m.Groups[3].Length $m.Groups[1].Value + $m.Groups[2].Value * $l + $m.Groups[3].Value } }) }

Output:

 Here a text with tooooo long words This is sweeet! And something more.

What this does is search for character runs ( \w for now, should be changed to something reasonable) that follow the pattern (something)(repeated character more than two times)(something else) . To replace it, he uses a function that checks to see if its length has the required maximum length, then calculates how long the repeated part can really still match the total length, and then shortens only the repeated part to that length.

This is messy. He will not be able to truncate words that are otherwise very long (for example, “something” in the second test sentence), and also to change the set of characters that make up words. Think that this can be a starting point if you want to go this route, but not a ready-made solution.

C # code:

 public static string TrimLongWords(this string original, int maxCount) { return Regex.Replace(original, @"(\w*?)(\w)\2{2,}(\w*)", delegate(Match m) { var first = m.Groups[0].Value; var rep = m.Groups[1].Value; var last = m.Groups[2].Value; if (m.Value.Length > maxCount) { var l = maxCount - first.Length - last.Length; return first + new string(rep[0], l) + last; } return m.Value; }); }

A nicer option for a character class is likely to be similar to \p{L} , depending on your needs.

Trim too long words from sentences in C #?

[This answer has been updated to accommodate "..." as requested in the question]

More articles:

[This answer has been updated to accommodate `"..."` as requested in the question]