Find a large cluster of a specific word in a block of text

I have a block of text (of arbitrary length) with a specific word highlighted in yellow whenever it appears. I want to show only a text of four hundred words, but I want to show a piece with the most highlighted words.

Does anyone know a good algorithm for this?

I have a character position for each selected word, so should the algorithm find the densest cluster of unevenly spaced integers?

+3
source share
4 answers

I'm not sure how you know they are highlighted, but here is a simple O (n) aproach, which I will try.

( 400), , , , , , . , . , 400 .

, .

+6

( 400 ), . , , 400 .

+2

, , - , (charPos ). : '/' , 4200/2000 = 2.

if hasKey(charPositionHashtable[charPos/2000]):
    charPositionHashtable[charPos/2000]) += 1
else:
    charPositionHashtable[charPos/2000]) = 1

charPositionHashtable /, "" 2000 , . max , . , , O (n), ( ).

+1

. , - , , "" ( ). " ", , . "" "" , .

A method to find out how many highlighted indicators are within the "block size" in front of your example can be better done, I think.

Pseudo

string GetHighestDensityChunk(){

// {chunk size} = 400 * average word length
// {possible start positions} = 0, highlighted indicies, and (sample - {chunk size})

int position
int bestPositionSoFar = 0
int maxHighLightedCountSoFar = 0


for each position in {possible start position}
{
    highlightedCount = GetNumberOfHighlightedWithinChunkSize(position)

    if(highlightedCount > maxHighLightedCountSoFar) 
    {
        maxHighLightedCountSoFar = highlightedCount
        bestPositionSoFar = position
    }
}

// "round up" to nearest word end
// gives index of next space after end of chunk starting from current best position
{revised chunk size} = sample.indexOf(' ', startingAt = bestPositionSoFar + {chunk size}) - bestPositionSoFar

return sample.substring(bestPositionSoFar, {revised chunk size})
}   


 int GetNumberOfHighlightedWithinChunkSize(position)
{
    numberOfHighlightedInRange = 0

    // starts from current position and scans forward counting highlighted indicies that are in range
    for(int i= {possible start position}.indexOf(position); i<= {possible start position}.length; i++){
        if({possible start position}[i] < position + {chunk size}){
            numberOfHighlightedInRange++;
        } else {
            break;
        }
    }
    return numberOfHighlightedInRange;
}
+1
source

Source: https://habr.com/ru/post/1712884/


All Articles