Unable to access Gutenberg project text

Question

Unable to access Gutenberg project text

I have a problem accessing the Project Gutenberg library ... I am using Python 2.7.3. I can access the NLTK library and work with python, but when I try to access raw texts, this does not allow me.

The text I was referring to is a crime and punishment, it len (raw) should equal 1176831, but instead gives me len (raw) 288. Here is the code I used:

>>> from __future__ import division >>> import nltk, re, pprint >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) <type 'str'> >>> len(raw) 288 >>> raw '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access /files/2554/2554.txt\non this server.</p>\n<hr>\n<address>Apache Server at www.gutenberg.org Port 80</address>\n</body></html>\n' >>>

+4

python python-2.7 urllib

user1799092 Nov 05 '12 at 3:15

source share

2 answers

 from urllib import urlopen url = "http://www.gutenberg.org/files/2554/2554**-0**.txt" raw = urlopen(url).read()

-1

Valentin Vrzheshch Aug 30 '17 at 23:56

source share

Ray toal · Accepted Answer · 2012-11-05T03:18:39+0000

The reason for the HTTP 403 response can be found here . Basically, the site is intended for users (non-automated) users. Any perceived use of automated tools to access our website will result in a temporary or permanent block of your IP address or subnet.

Your code "should work", but the website determines that you access the site through the code, not the browser. That is all I will say. :)

Unable to access Gutenberg project text

More articles: