regex code advice

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
sconard
Forum Newbie
Posts: 2
Joined: Fri Feb 27, 2015 10:12 am

regex code advice

Post by sconard »

I could not find the regex forum in the list so I am posting in scripting.

I am attempting to limit this to one match of the below.

Code: Select all

<a title="anythingText" href="http://www.anythingText.com" target="_blank"><img class="image" alt="Image" src="GetFile.aspx?Page=anythingText.anythingText&File=anythingText.jpg"></a><p class="imagedescription">anythingText<br /></div><br /><br />anythingText<br /><br /><a title="anythingText" class="internallink" href="GetFile.aspx?Page=anythingText.anythingText&File=anythingText.docx" target="_blank">anythingText</a>
This is what I have and it matches both anchor tags as one match.

Code: Select all

(<a title="(.)*?" class="internallink" (target="_blank" )?href="GetFile\.aspx\?(Provider=((.)*?)&)?Page=((.)+?)&File=((.)+?)">)((.|\n|\r)+?)(</a>))
How can I separate?
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: regex code advice

Post by requinix »

Don't use regular expressions for interpreting HTML. Assuming you are using PHP,

Load the source into a DOMDocument. Then run an XPath expression to find all the a.internallink elements. Loop over each and check that its href matches your criteria using parse_url() then parse_str() on the query string.

Code: Select all

<?php

$html = <<<HTML
<html>
<body>
<div><a title="anythingText" href="http://www.anythingText.com" target="_blank"><img class="image" alt="Image" src="GetFile.aspx?Page=anythingText.anythingText&File=anythingText.jpg"></a><p class="imagedescription">anythingText<br /></div><br /><br />anythingText<br /><br /><a title="anythingText" class="internallink" href="GetFile.aspx?Page=anythingText.anythingText&File=anythingText.docx" target="_blank">anythingText</a>
</body>
</html>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[@class='internallink']");

foreach ($nodes as $node) {
        if ($node->hasAttribute("href")) {
                $href = parse_url($node->getAttribute("href"));
                if (!empty($href["query"])) {
                        parse_str($href["query"], $qs);
                } else {
                        $qs = array();
                }
                if (isset($href["path"]) && $href["path"] == "GetFile.aspx" && isset($qs["Page"], $qs["File"])) {
                        // matches
                        echo "Found Page={$qs["Page"]}, File={$qs["File"]}\n";
                }
        }
}

Code: Select all

Found Page=anythingText.anythingText, File=anythingText.docx
It's a little more work, sure, but it's much safer than a regex on unknown input and it's a lot easier to understand.
sconard
Forum Newbie
Posts: 2
Joined: Fri Feb 27, 2015 10:12 am

Re: regex code advice

Post by sconard »

Existing code is in csharp.

Regex is used for many other matches so although I agree with you I still need to use regex.

Thank you
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: regex code advice

Post by requinix »

Regex will suck.

.*? isn't enough. You have to restrict the characters allowed so that it won't leave the attribute values, such as with a [^"]+

And if the attributes are in a different order, like as with the second <a> and the one you're trying to match, then the regex gets exponentially more complex.
Say there's the title (A), class (B), href (C), and target (D). To match the full tag you have to do something like

Code: Select all

<a (A(B(CD|DC)|C(BD|DB)|D(BC|CB))|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|C(A(BD|DB)|B(AD|DA)|D(AB|BA))|D(A(BC|CB)|B(AC|CA)|C(AB|BA)))>
(remember to expand A,B,C,D into the regex patterns to match each respective component) which doesn't even account for optional attributes.
Regardless, capturing is a pain because you have so many different places in the regex where the desired information can show up.

Unfortunately .NET doesn't seem to have a decent HTML parser. There is a compromise you can make:

Code: Select all

<a (title='[^']*'|title="[^"]*"|title=[^ \t]+|class=['"]?internallink['"]?|href=('[^']+')|href=("[^"]+")|href=([^ \t]+)|target=['"]?_blank['"]?)+>
which will match pretty much every A tag containing one or more of those attributes. You then go through each one and check that it matched one of those href capturing groups: if not then skip it, otherwise HTML-decode the string, use System.Uri to parse the URL, and use HttpUtility.ParseQueryString to get the individual query string values. Then filter out the ones that don't have the Page or File keys.
Post Reply