I have a series of HTML files that are processed in a single text file using Beautiful Soup. HTML files are formatted so that their output always consists of three lines in a text file, so the output will look something like this:
Hello! How are you? Well, Bye!
But it would be just as easy
83957 And I ain't coming back! hgu39hgd
In other words, the content of the HTML files is not standard for each of them, but they always create three lines.
So, I was wondering where I should start, if I want, then take a text file that is created from Beautiful Soup and parse it into a CSV file with columns like (using the examples above):
Title Intro Tagline Hello! How are you? Well, Bye! 83957 And I ain't coming back! hgu39hgd
Python code to remove HTML from text files:
import os import glob import codecs import csv from bs4 import BeautifulSoup path = "c:\\users\\me\\downloads\\" for infile in glob.glob(os.path.join(path, "*.html")): markup = (infile) soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read()) with open("extracted.txt", "a") as myfile: myfile.write(soup.get_text())
And I understand that I can use this to customize the columns in my CSV file:
csv.put_HasColumnNames(True) csv.SetColumnName(0,"title") csv.SetColumnName(1,"intro") csv.SetColumnName(2,"tagline")
Where I draw an empty one is like iterating through a text file (extract.txt) one line at a time, and as I get into a new line, set it to the correct cell in the CSV file. The first few lines of the file are empty, and there are many blank lines between each grouping of text. So, first I need to open the file and read it:
file = open("extracted.txt") for line in file.xreadlines(): pass
Also, I donβt know how to tell Python to just keep reading the file and add it to the CSV file until it finishes. In other words, there is no way to know exactly how many common lines there will be in HTML files, and therefore I canβt just csv.SetCell(0,0) to cdv.SetCell(999,999)