The easiest way or the easiest library to get bigrams and trigrams in Java?

I would prefer not to run Lingpipe if possible, which leaves me wondering if there are any quick and easy ways in java to extract all the bigrams and trigrams from a line of text?

thank

+3
source share
3 answers

Always the easiest way is to use an existing library. You can look in the library of simmetrics . You can also use lucene NgramTokenizer . You can also implement this algorithm yourself. First, you must find all the words (using StringTokenizer ) in the text and generate the n-grams you need.

0
source
public class NGramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NGramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}
+3
source

str " ". StringTokenizer, , , "I", "am", "sample" ..

, , 2 , . , , while, s1 . s2 , s1 s2 s3, arrayList.

s1 = "I"; s2 = "am" s3 = s1 + " " + s2;//makes s3 = "I am" s3 = s1 + " " + s2;//makes s3 = "I am"

, s2, s1, s2 . () , , s2 . , , s2 .

import java.util.*;

class Test
{
    public static void main(String [] args)
    {
        String str = "I am sample string and will be tokenized on space";
        ArrayList<String> bigrams = new ArrayList<String>();
        StringTokenizer itr = new StringTokenizer(str);
        if(itr.countTokens() > 1)
        {
            System.out.println("String array size : " + itr.countTokens());
            String s1 = "";
            String s2 = "";
            String s3 = "";
            while (itr.hasMoreTokens())
            {
                if(s1.isEmpty())
                    s1 = itr.nextToken();
                s2 = itr.nextToken();
                s3 = s1 + " " + s2;
                bigrams.add(s3);
                s1 = s2;
                s2 = "";
            }

        }
        else
            System.out.println("Tokens is 1 or 0");
        int i = 0;
        while (i < bigrams.size()) 
        {
            System.out.println(bigrams.get(i));
            i++;
        }
    }
}
0

Source: https://habr.com/ru/post/1766840/


All Articles