I have a problem accessing the Project Gutenberg library ... I am using Python 2.7.3. I can access the NLTK library and work with python, but when I try to access raw texts, this does not allow me.
The text I was referring to is a crime and punishment, it len (raw) should equal 1176831, but instead gives me len (raw) 288. Here is the code I used:
>>> from __future__ import division >>> import nltk, re, pprint >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) <type 'str'> >>> len(raw) 288 >>> raw '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access /files/2554/2554.txt\non this server.</p>\n<hr>\n<address>Apache Server at www.gutenberg.org Port 80</address>\n</body></html>\n' >>>
source share