Unable to access Gutenberg project text

I have a problem accessing the Project Gutenberg library ... I am using Python 2.7.3. I can access the NLTK library and work with python, but when I try to access raw texts, this does not allow me.

The text I was referring to is a crime and punishment, it len ​​(raw) should equal 1176831, but instead gives me len (raw) 288. Here is the code I used:

>>> from __future__ import division >>> import nltk, re, pprint >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) <type 'str'> >>> len(raw) 288 >>> raw '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don\'t have permission to access /files/2554/2554.txt\non this server.</p>\n<hr>\n<address>Apache Server at www.gutenberg.org Port 80</address>\n</body></html>\n' >>> 
+4
source share
2 answers

The reason for the HTTP 403 response can be found here . Basically, the site is intended for users (non-automated) users. Any perceived use of automated tools to access our website will result in a temporary or permanent block of your IP address or subnet.

Your code "should work", but the website determines that you access the site through the code, not the browser. That is all I will say. :)

+4
source
 from urllib import urlopen url = "http://www.gutenberg.org/files/2554/2554**-0**.txt" raw = urlopen(url).read() 
-1
source

Source: https://habr.com/ru/post/1444032/


All Articles