PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Thu Sep 21, 2017 9:04 am

All times are UTC - 5 hours




Post new topic Reply to topic  [ 4 posts ] 
Author Message
 Post subject: regex code advice
PostPosted: Fri Feb 27, 2015 11:15 am 
Offline
Forum Newbie

Joined: Fri Feb 27, 2015 11:12 am
Posts: 2
I could not find the regex forum in the list so I am posting in scripting.

I am attempting to limit this to one match of the below.
Syntax: [ Download ] [ Hide ]
<a title="anythingText" href="http://www.anythingText.com" target="_blank"><img class="image" alt="Image" src="GetFile.aspx?Page=anythingText.anythingText&amp;File=anythingText.jpg"></a><p class="imagedescription">anythingText<br /></div><br /><br />anythingText<br /><br /><a title="anythingText" class="internallink" href="GetFile.aspx?Page=anythingText.anythingText&amp;File=anythingText.docx" target="_blank">anythingText</a>
 


This is what I have and it matches both anchor tags as one match.
Syntax: [ Download ] [ Hide ]
  1. (<a title="(.)*?" class="internallink" (target="_blank" )?href="GetFile\.aspx\?(Provider=((.)*?)&amp;)?Page=((.)+?)&amp;File=((.)+?)">)((.|\n|\r)+?)(</a>)) 


How can I separate?


Top
 Profile  
 
 Post subject: Re: regex code advice
PostPosted: Fri Feb 27, 2015 4:52 pm 
Offline
Spammer :|
User avatar

Joined: Wed Oct 15, 2008 2:35 am
Posts: 6551
Location: WA, USA
Don't use regular expressions for interpreting HTML. Assuming you are using PHP,

Load the source into a DOMDocument. Then run an XPath expression to find all the a.internallink elements. Loop over each and check that its href matches your criteria using parse_url() then parse_str() on the query string.
Syntax: [ Download ] [ Hide ]
<?php

$html = <<<HTML
<html>
<body>
<div><a title="anythingText" href="http://www.anythingText.com" target="_blank"><img class="image" alt="Image" src="GetFile.aspx?Page=anythingText.anythingText&amp;File=anythingText.jpg"></a><p class="imagedescription">anythingText<br /></div><br /><br />anythingText<br /><br /><a title="anythingText" class="internallink" href="GetFile.aspx?Page=anythingText.anythingText&amp;File=anythingText.docx" target="_blank">anythingText</a>
</body>
</html>
HTML
;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[@class='internallink']");

foreach ($nodes as $node) {
        if ($node->hasAttribute("href")) {
                $href = parse_url($node->getAttribute("href"));
                if (!empty($href["query"])) {
                        parse_str($href["query"], $qs);
                } else {
                        $qs = array();
                }
                if (isset($href["path"]) && $href["path"] == "GetFile.aspx" && isset($qs["Page"], $qs["File"])) {
                        // matches
                        echo "Found Page={$qs["Page"]}, File={$qs["File"]}\n";
                }
        }
}

Code:
Found Page=anythingText.anythingText, File=anythingText.docx


It's a little more work, sure, but it's much safer than a regex on unknown input and it's a lot easier to understand.


Top
 Profile  
 
 Post subject: Re: regex code advice
PostPosted: Tue Mar 03, 2015 7:37 am 
Offline
Forum Newbie

Joined: Fri Feb 27, 2015 11:12 am
Posts: 2
Existing code is in csharp.

Regex is used for many other matches so although I agree with you I still need to use regex.

Thank you


Top
 Profile  
 
 Post subject: Re: regex code advice
PostPosted: Tue Mar 03, 2015 4:00 pm 
Offline
Spammer :|
User avatar

Joined: Wed Oct 15, 2008 2:35 am
Posts: 6551
Location: WA, USA
Regex will suck.

.*? isn't enough. You have to restrict the characters allowed so that it won't leave the attribute values, such as with a [^"]+

And if the attributes are in a different order, like as with the second <a> and the one you're trying to match, then the regex gets exponentially more complex.
Say there's the title (A), class (B), href (C), and target (D). To match the full tag you have to do something like
Code:
<a (A(B(CD|DC)|C(BD|DB)|D(BC|CB))|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|C(A(BD|DB)|B(AD|DA)|D(AB|BA))|D(A(BC|CB)|B(AC|CA)|C(AB|BA)))>

(remember to expand A,B,C,D into the regex patterns to match each respective component) which doesn't even account for optional attributes.
Regardless, capturing is a pain because you have so many different places in the regex where the desired information can show up.

Unfortunately .NET doesn't seem to have a decent HTML parser. There is a compromise you can make:
Code:
<a (title='[^']*'|title="[^"]*"|title=[^ \t]+|class=['"]?internallink['"]?|href=('[^']+')|href=("[^"]+")|href=([^ \t]+)|target=['"]?_blank['"]?)+>

which will match pretty much every A tag containing one or more of those attributes. You then go through each one and check that it matched one of those href capturing groups: if not then skip it, otherwise HTML-decode the string, use System.Uri to parse the URL, and use HttpUtility.ParseQueryString to get the individual query string values. Then filter out the ones that don't have the Page or File keys.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group