Convert Early Modern English to 20th Century Spelling Using NLTK

Question

Convert Early Modern English to 20th Century Spelling Using NLTK

I have a list of lines that all early modern English words end with 'th.' They include: appoints, requires, etc. - all are paired for a single third party.

As part of a much larger project (using my computer to transform the Gutenberg etext of Gargantua and Pantagruel into something more similar to 20th-century English, so that I will be easier to read it). I want to remove the last two or three characters from all these words and replace them with "s", then use the slightly modified function for words that have not yet been modernized, both are included below.

My main problem is that I just can't get Python input right. I find that this part of the language is really confusing at the moment.

Here is a function that removes th:

from __future__ import division
import nltk, re, pprint

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'

Here is a function that removes extraneous e:

def ethremb(word):
    if word.endswith('es'):
        return word[:-2] + 's'

therefore, the words "abateth" and "accuseth" will go through ethrema, but not through ethremb (ethrema), while the word "abhorreth" should go through both.

If anyone can think of a more efficient way to do this, I'm all ears.

Here is the result of my very amateurish attempt to use these functions in a tokenized list of words that need modernization:

>>> eth1 = [w.ethrema() for w in text]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'ethrema'

So yes, this is really a print issue. These are the first functions that I have ever written in Python, and I have no idea how to apply them to real objects.

+3

python text nlp nltk

magnetar 28 . '10 17:16

1

Studer · Accepted Answer · 2010-08-28T17:19:54+0000

ethrema() str, :

eth1 = [ethrema(w) for w in text]
#AND
eth2 = [ethremb(w) for w in text]

EDIT ( ):

ethremb(ethrema(word)) , :

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'
    else
        return word

def ethremb(word):
    if word.endswith('es'):
        return word[:-2] + 's'
    else
        return word

#OR

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'
    elif word.endswith('es'):
        return word[:-2] + 's'
    else
        return word

Convert Early Modern English to 20th Century Spelling Using NLTK

More articles: