This is my first time using Stack Overflow, so if I did something wrong, let me know.
I am currently trying to write a "scraper" due to the lack of a better term that will extract html and replace some inline CSS styles with HTML copies. For example, I have this HTML:
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What here doesn't matter so much as what needs to happen around it.</span></p>
I want to replace font-weight:boldwith <b>, font-style:italicby <i>, text-align:centerby <center>. Subsequently, I will use regex to remove all uncharacteristic HTML tags and any attributes. KISS definitely applies here.
I read this question: Convert CSS style attributes to HTML attributes using Perl and several others regarding the use of HTML :: TreeBuilder and other modules (like HTML :: TokeParser), but so far I've come across everything.
I am new to Perl, but not new to coding in general. The logic of this does not change.
Here is what I still have:
use warnings;
use strict;
use HTML::TreeBuilder;
my $newcont = ""; #Has to be set to something? I've seen other scripts where it doesn't...this is confusing.
my $html = <<HTML;
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What here doesn't matter so much as what needs to happen around it.</span> And sometimes not all the text is styled the same.</p>
HTML
my $tb = HTML::TreeBuilder->new_from_content($html);
my @spans = $tb->look_down(_tag => q{span}) or die qq{look_down for tag failed: $!\n};
for my $span (@spans){
}
print $tb->as_HTML;
Hope someone can help me, show me what I may have done wrong, etc. I sincerely wonder what other possible ways to do this. Or if it has ever been done before.
Also, if someone could help by suggesting those tags that I should have used, that would be great. The only thing I know for sure is perl.
source
share