Python Goose cannot extract date

I am using Python Goose. You can find it in this link.

I want to retrieve the published date, but when I ran:

g = Goose() entity = g.extract(url="mylink") date = entity.publish_date 

As a result, I got None

I tried this on many sites and the results were None

Any tips?

+4
source share
2 answers

I just checked the relevant part of the source: crawler.py The publish_date output is currently commented out

 # TODO # article.publish_date = config.publishDateExtractor.extract(doc) 

Further research showed that if you uncomment the line above, you can define your custom date extractor. However, Goose does not use the default date unlock. See This Method: set_publishdate_extractor at https://github.com/grangier/python-goose/blob/master/goose/configuration.py

+1
source

Since 2014, this function has been implemented in python-goose at extractors/publishdate.py , so article.publish_date returns some date. But only if they are available in the following metadata fields:

 rnews:datePublished article:published_time OriginalPublicationDate datePublished 
0
source

Source: https://habr.com/ru/post/1502566/


All Articles