LINQ splits a string into quotes based on quotes

How to divide the text into sentences in the text; with dots, question marks, exclamation marks, etc. I try to get each sentence one by one, with the exception of quotation marks.

For example, split this:

Walked. Turned back. But why? And said "Hello world. Damn this string splitting things!" without a shame.

Like this:

Walked. 
Turned back. 
But why? 
And said "Hello world. Damn this string splitting things!" without a shame.

I am using this code:

 private List<String> FindSentencesWhichContainsWord(string text, string word)
        {
            string[] sentences = text.Split(new char[] { '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries);

            // Define the search terms. This list could also be dynamically populated at runtime.
            string[] wordsToMatch = { word };

            // Find sentences that contain all the terms in the wordsToMatch array.
            // Note that the number of terms to match is not specified at compile time.
            var sentenceQuery = from sentence in sentences
                                let w = sentence.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' },
                                                        StringSplitOptions.RemoveEmptyEntries)
                                where w.Distinct().Intersect(wordsToMatch).Count() == wordsToMatch.Count()
                                select sentence;

            // Execute the query. Note that you can explicitly type
            // the iteration variable here even though sentenceQuery
            // was implicitly typed. 

            List<String> rtn = new List<string>();
            foreach (string str in sentenceQuery)
            {
                rtn.Add(str);
            }
            return rtn;
        }

But this gives the result below, which I do not like.

Walked. 
Turned back. 
But why? 
And said "Hello world.
Damn this string splitting things!
" without a shame.
+4
source share
4 answers

I think this problem can be solved in two stages:

  • Use TextFieldParserfor correct identification of quoted words

    string str = "Walked. Turned back. But why? And said \"Hello world. Damn this string splitting things!\" without a shame.";
    string[] words = null;
    using (TextFieldParser parser = new TextFieldParser(new StringReader(str))){
        parser.Delimiters = new string[] { " " };
        parser.HasFieldsEnclosedInQuotes = true;
        words = parser.ReadFields();                
    }    
    
  • Using an earlier result to tune a new array stringbased on your particular behavior.

    List<string> newWords = new List<string>();
    string accWord = "";
    foreach (string word in words) {
        if (word.Contains(" ")) //means this is multiple items
            accWord += (accWord.Length > 0 ? " " : "") + "\"" + word + "\"";
        else {
            accWord += (accWord.Length > 0 ? " " : "") + word;
            if (word.EndsWith(".") || word.EndsWith("!") || word.EndsWith("?")) {
                newWords.Add(accWord);
                accWord = "";
            }
        }
    }
    

Result newWords:

[2016-01-28 08:29:48.534 UTC] Walked.
[2016-01-28 08:29:48.536 UTC] Turned back.
[2016-01-28 08:29:48.536 UTC] But why?
[2016-01-28 08:29:48.536 UTC] And said "Hello world. Damn this string splitting things!" without a shame.

, , List<string>

+2

, " ". ...

, , " " Manning and Schutze.

, , , Nubilosoft, .

  • . , , . , MS Word DOC (X) HTML, , .
  • . , , (, "dr." ) .
  • . - , . ( ).
  • , , , - "" .

; , .

, 99% , " " .

, - ... , - , .

+1

This is not a bulletproof solution, but it can be implemented like this. I made a suggestion and quoted it manually

void Main()
{
    var text = "Walked. Turned back. But why? And said \"Hello world. Damn this string splitting things!\" without a shame.";
    var result = SplitText(text);
}

private static List<String> SplitText(string text)
{
    var result = new List<string>();

    var sentenceEndings = new HashSet<char> { '.', '?', '!' };

    var startIndex = 0;
    var length = 0;

    var isQuote = false;
    for (var i = 0; i < text.Length; i++)
    {
        var c = text[i];
        if (c == '"' && !isQuote)
        {
            isQuote = true;
            continue;
        }

        if (c == '"' && isQuote)
        {
            isQuote = false;
            continue;
        }

        if (!isQuote && sentenceEndings.Contains(c))
        {
            length = i + 1 - startIndex;
            var part = text.Substring(startIndex, length);
            result.Add(part);
            startIndex = i + 2;
        }
    }
    return result;
}
+1
source

I used TakeWhile. until the character becomes a separator. or if it was inside quotation marks.

var seperator = new[] {'.', '?', '!'};

string str =
    @"Walked. Turned back. But why? And said ""Hello world. Damn this string splitting things!"" without a shame.";

List<string> result = new List<string>();
int index = 0;
bool quotes = false;
while (index < str.Length)
{
    var word = str.Skip(index).TakeWhile(ch =>
    {
        index++;
        if (ch == '"') quotes = !quotes;
        return quotes || !seperator.Contains(ch);
    });

    result.Add(string.Join("", word).Trim());
}
+1
source

Source: https://habr.com/ru/post/1626306/


All Articles