Unicode character search in Python

Question

Unicode character search in Python

I am working on a Python / NLTK based NLP project with unicode text without text. To do this, I need to look for the unicode string inside the sentence.

There is a .txt file saved with some sentences not related to English Unicode. Using the NLTK PunktSentenceTokenizer , I broke them and saved them in a python list.

sentences = PunktSentenceTokenizer().tokenize(text)

Now I can iterate through the list and get each sentence separately.

What I need to do is go through sentence and determine which word has unicode data.

Example -

 sentence = 'AASFG BBBSDC FEKGG SDFGF'

Suppose the text above is a non-English unicode, and I need to find words ending in GF , and then return the whole word (maybe the index of that word).

 search = 'SDFGF'

Similarly, I need to find words starting with BB , get his word.

 search2 = 'BBBSDC'

0

python unicode nltk

ChamingaD Aug 04 '13 at 12:39 on

source share

1 answer

dbr · Accepted Answer · 2013-08-04 12:53

If I understand correctly, you just need to divide the sentence into words, iterate over each of them and check whether it ends or begins with the required characters, for example:

 >>> sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF'] >>> [word for word in sentence.split() if word.endswith("GF")] ['SDFGF']

sentence.split() can probably be replaced with something like nltk.tokenize.word_tokenize(sentence)

Update , regarding the comment:

How to get a word before and after it

The enumerate function can be used to give each word a number, for example:

 >>> print list(enumerate(sentence)) [(0, 'AASFG'), (1, 'BBBSDC'), (2, 'FEKGG'), (3, 'SDFGF')]

Then, if you do the same loop but keep the index:

 >>> results = [(idx, word) for (idx, word) in enumerate(sentence) if word.endswith("GG")] >>> print results [(2, 'FEKGG')]

.. you can use the index to get the next or previous element:

 >>> for r in results: ... r_idx = r[0] ... print "Prev", sentence[r_idx-1] ... print "Next", sentence[r_idx+1] ... Prev BBBSDC Next SDFGF

You will need to handle the case when the match matches the first or last word ( if r_idx == 0 , if r_idx == len(sentence) )

Unicode character search in Python

More articles: