Replace specific inline CSS with HTML copy in Perl

This is my first time using Stack Overflow, so if I did something wrong, let me know.

I am currently trying to write a "scraper" due to the lack of a better term that will extract html and replace some inline CSS styles with HTML copies. For example, I have this HTML:

<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What here doesn't matter so much as what needs to happen around it.</span></p>

I want to replace font-weight:boldwith <b>, font-style:italicby <i>, text-align:centerby <center>. Subsequently, I will use regex to remove all uncharacteristic HTML tags and any attributes. KISS definitely applies here.

I read this question: Convert CSS style attributes to HTML attributes using Perl and several others regarding the use of HTML :: TreeBuilder and other modules (like HTML :: TokeParser), but so far I've come across everything.

I am new to Perl, but not new to coding in general. The logic of this does not change.

Here is what I still have:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

my $newcont = ""; #Has to be set to something? I've seen other scripts where it doesn't...this is confusing.
my $html = <<HTML;
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What here doesn't matter so much as what needs to happen around it.</span> And sometimes not all the text is styled the same.</p>
HTML

my $tb = HTML::TreeBuilder->new_from_content($html);
my @spans = $tb->look_down(_tag => q{span}) or die qq{look_down for tag failed: $!\n};

for my $span (@spans){
    #What next?? A print gives HASH, not really workable. Split doesn't seem to work...I've never felt like such a noobie coder before.
}

print $tb->as_HTML;

Hope someone can help me, show me what I may have done wrong, etc. I sincerely wonder what other possible ways to do this. Or if it has ever been done before.

Also, if someone could help by suggesting those tags that I should have used, that would be great. The only thing I know for sure is perl.

+3
source share
3 answers

HTML:: TreeBuilder, ; CSS CSS::DOM. , .

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;
use CSS::DOM::Style;

my $html = <<HTML;
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What here doesn't matter so much as what needs to ha>
HTML

my $tb = HTML::TreeBuilder->new_from_content($html);


my @replacements = (
    { property => 'font-style', value => 'italic', replacement => 'em' },
    { property => 'font-weight', value => 'bold', replacement => 'strong' },
    { property => 'text-align', value => 'center', replacement => 'center' },
);

# build a sensible list of tag names (or just use sub { 1 })
my @nodes = $tb->look_down(sub { $_[0]->tag =~ /^(p|span)$/ });

for my $el (@nodes) {
    if ($el->attr('style')) {
        my $st = CSS::DOM::Style::parse($el->attr('style'));
        if ($st) {
            foreach my $h (@replacements) {
                if ($st->getPropertyValue($h->{property}) eq $h->{value}) {
                    $st->removeProperty($h->{property});
                    my $new = HTML::Element->new($h->{replacement});
                    foreach my $inner ($el->detach_content) {
                        $new->push_content($inner);
                    }
                    $el->push_content($new);
                }
            }
            $el->attr('style', $st->cssText ? $st->cssText : undef);
        }
    }
}

print $tb->as_HTML(undef, "\t");
+1

HTML:: Element , look_down() HTML:: Element. Perl, , ( ) - HASH $span.

, for-loop

 $span->method()

- HTML:: Element. all_attr(), as_text() replace_with() .

, SO gnarly CPAN, :

https://metacpan.org/pod/HTML::Element

+3

,
, Perl , , , - . , /, / . .
$span HTML:: Element - Ben . , .

+2

Source: https://habr.com/ru/post/1722402/


All Articles