List of dates in the text

I have a text document with 32 articles, and I want to indicate each date of the article. I noticed that the date is indicated in the fifth row of each article. So far, I have divided the text into 32 articles using:

import re sections = [] current = [] with open("Aberdeen2005.txt") as f: for line in f: if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line): sections.append("".join(current)) current = [line] else: current.append(line) print(len(sections)) 

I would like to create a list containing a date for each article, only MONTH and YEAR: enter image description here

As you can see, the date is indicated in the format from the above figure, but sometimes the day does not turn on, for example. Thursday.

Any ideas?

Yours faithfully,

Andres

Ps. Here is another example of the 16th document: enter image description here

+5
source share
3 answers

Using regex under the if , you can replace the day:

 regx = re.compile(ur'(\w+\s\d{1,2},\s\d{4})\s\w{6,9}') line = re.sub(regx, "\\1", line) 

Example:

https://regex101.com/r/pJ0nZ8/1

linecache method:

Using the linecache module, you can specifically capture line 5 and write it to a file; if the date includes a weekday, it will be truncated. It is possible to do much more with this functionality, although I will leave more detailed information to you.

 import linecache w = 'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday' l = linecache.getline("Aberdeen2005.txt",5) m = [d in l for d in w] c = '2005','2016' # years (optional) if any(y in l for y in c): # check for years (optional) if any(x in l for x in w): r = [i for i,v in enumerate(m,0) if v] l = l.replace(' '+w[r[0]],'') with open("dates.txt", "a") as article_dates: article_dates.write(l) linecache.clearcache() 
+1
source

Or you can find the pattern inside your string using re. For instance:

 date1 = 'December 29, 2005 Thursday' date2 = 'February 1, 2015' re.findall("[A-Za-z]+ [0-9]{1,2}, [0-9]{4}", date1) ['December 29, 2005'] re.findall("[A-Za-z]+ [0-9]{1,2}, [0-9]{4}", date2) ['February 1, 2015'] 

If the function returns something, you can treat the entire string as a date.

+1
source

I would try the dateutil.parser library. I was a little awkward at work, but the challenge is to take strings that look like dates and convert those dates to dates. I found this to be pretty competent.

The documentation is here , and the function you want is parsing () (e.g. dateutil.parser.parse ()).

0
source

Source: https://habr.com/ru/post/1241307/


All Articles