当前位置: 动力学知识库 > 问答 > 编程问答 >

regex - Fixing HTML attribute values with double quotes in them

问题描述:

I have a set of HTML files with illegal syntax in the href attribute of <a> tags. For example,

<a name="Conductor, "neutral""></a>

or

<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />

or

<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>

I'm trying to process the files with Perl's XML::Twig module using parsefile_html($file_name). When it reads a file that has this syntax, it gives this error:

x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893

What I need is either a way to make the module accept the bad syntax and deal with it, or a regular expression to find and replace double quotes in attributes with single quotes.

网友答案:

Given your html sample, the code below works:

use Modern::Perl;

my $html = <<end;
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
end

$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;

say $html;

Output:

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>

I'm concerned that a variable length look-behind is not implemented, so if there's some space before or after the equals signs, the pattern match will fail. However, it's most likely that the pages were consistently created, so the match will not fail.

Of course, try the substitutions on copies of the files, first.

网友答案:

The only way I can think of to do this reasonably safely is to use two nested evaluated (/e) substitutions. The program below uses this idea and works with your data.

The outer substitution finds all tags in the string, and replaces them with a tag containing adjusted attribute values.

The inner subtitution finds all attribute values in the tag, and replaces them with the same value with all double-quotes removed.

use strict;
use warnings;

my $html = <<'HTML';
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">
HTML

$html =~ s{(<[^>]+>)}{

  my $tag = $1;

  $tag =~ s{ \w+= " \K ( [^=<>]+ ) (?= " (?: \s+\w+= | \s*/?> )) }
  {
    (my $attr = $1) =~ tr/"//d;
    $attr;
  }egx;

  $tag;
}eg;

print $html;

output

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>
<a href="1.html" title="Page 1: What are series and parallel circuits?">
分享给朋友:
您可能感兴趣的文章:
随机阅读: