Removing noise words in SQL Server 2005 full-text indexing

In a fairly typical scenario, I have a “Search” text box in my web application that injects user input directly into a stored procedure, which then uses full text indexing to search across two fields in two tables that are joined using the appropriate keys.

I use the CONTAINS predicate to search for fields. Before passing the search string, I do the following:

SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'

Replacing the lock with "*" OR "lock *", for example. This is necessary because I want people to be able to search the casino and get results for the castle.

WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)

The problem is that now that I added a wildcard at the end of each word, noise words (like) also have a wildcard and therefore no longer appear. This means that a castle search will return items with words such as theater, etc.

Changing OR to AND was my first thought, but it seems like there is simply no match if the query uses the noise word.

All I'm trying to do is allow the user to enter a few space-separated words that represent either the whole or the prefix of the words they are looking for, in any order - and discard noise words such as from their input (otherwise, when they look for a lock they get a large list of items, as a result of which they need somewhere in the middle of the list.

, , .

!

+2
5

, , .

:

1) C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseENU.txt .

2) , ",", :

public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

3) , :

List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
   if (!_noiseWords.Contains(word))
      goodWords.Add(word);
}

, , . .

+1

. noiseENU.txt - \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData.

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function
+1

. id: http://msdn.microsoft.com/en-us/library/ms190303.aspx

Dim queryTextWithoutNoise As String = removeNoiseWords (queryText, ConnectionString, 1033)

removeNoiseWords (ByVal inputText As String,                                    ByVal cnStr As String,                                    ByVal languageID As Integer) As String

    Dim r As New System.Text.StringBuilder
    Try
        If inputText.Contains(CChar("""")) Then
            r.Append(inputText)
        Else
            Using cn As New SqlConnection(cnStr)

                Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
                cn.Open()
                Dim cmd As New SqlCommand(q, cn)
                With cmd.Parameters
                    .Add(New SqlParameter("@q", """" & inputText & """"))
                    .Add(New SqlParameter("@l", languageID))
                End With
                Dim dr As SqlDataReader = cmd.ExecuteReader
                While dr.Read
                    If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
                        r.Append(dr.Item("display_term").ToString)
                        r.Append(" ")
                    End If
                End While
            End Using
        End If
    Catch ex As Exception
        ' ...        
    End Try
    Return r.ToString

End Function
+1

.

, .., nvarchar (100) . 50 000 .

My solution was to remove all noise words from a text file and allow the indexer to compile an index including all words. It still consists of just a few thousand records.

Then I do a space-replacement in the search bar as described in my original post to get CONTAINS to work on multiple words and for a single word.

It seems to work very well, but I will closely monitor the performance.

0
source

Source: https://habr.com/ru/post/1737407/


All Articles