Am I doing something wrong?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

nirali35
Forum Newbie
Posts: 6
Joined: Mon May 26, 2008 1:24 pm

Am I doing something wrong?

Post by nirali35 »

I want to parse a link and its title from the code below:

Code: Select all

 
 
<tr class="class" >
                                        <td>
                        <a href="http://www.website.com/?var=val">
                        Link title                      </a>
                                                </td>
 
                                </tr>
 
 
And here is a code I am using:

Code: Select all

 
 
<?php
    $url = 'http://www.website.com/';
    $agent = 'Bot';
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    curl_close($ch);
    
    $rule = '<tr class="(.*?)">.*?<td>.*?<a href="(.*?)">.*?<\/a>.*?<\/td>.*?<\/tr>';
 
    preg_match_all(
        '/' . $rule . '/',
        $html,
        $pieces,
        PREG_SET_ORDER
    );
    
    foreach ($pieces as $piece) {
        $val1 = $piece[3];
        $val2 = $piece[4];
        
        echo $val1 . ' -> ' . $val2 . '<br>';
    }
?>
 
 
But I am not having luck so far... could you please help me to find out what am I doing wrong?

Thank you :)
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Am I doing something wrong?

Post by GeertDD »

Two remarks for now:

#1 Your html contains

Code: Select all

<tr class="class" >
which is valid but your regex won't match it because of the space before the closing bracket. So allow for it in your regex by adding \s*.

#2 By default a dot does match any character except for newlines! Your html contains newlines, though. Add the s modifier to the regex in order to make the dot match newlines as well.
nirali35
Forum Newbie
Posts: 6
Joined: Mon May 26, 2008 1:24 pm

Re: Am I doing something wrong?

Post by nirali35 »

Thank buddy :)

I tried this, didn't worked :(

Code: Select all

$rule = '<tr class="(.*?)"\s*.>\s*.<td>\s*.<a href="(.*?)">.*?<\/a>\s*.<\/td>\s*.<\/tr>';
GeertDD wrote:Two remarks for now:

#1 Your html contains

Code: Select all

<tr class="class" >
which is valid but your regex won't match it because of the space before the closing bracket. So allow for it in your regex by adding \s*.

#2 By default a dot does match any character except for newlines! Your html contains newlines, though. Add the s modifier to the regex in order to make the dot match newlines as well.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Am I doing something wrong?

Post by prometheuzz »

nirali35 wrote:Thank buddy :)

I tried this, didn't worked :(
...
That's because you didn't use the "s" (DOTALL) flag after your regex to let the . (dot) also match new line characters, as Geert suggested.

Here's a demo:

Code: Select all

#!/usr/bin/php
<?php
$text = '
<tr class="class" >
  <td>
    <a href="http://www.website.com/?var=val">
      Link title</a>
  </td>
</tr>';
 
$regex = '/.*?<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++).*+/si';
 
if(preg_match($regex, $text, $matches)) {
  print "url   = $matches[1]\ntitle = $matches[2]\n";
} else {
  print "No match for: $text\n";
}
 
/* output
url   = http://www.website.com/?var=val
title = Link title
*/
?>
nirali35
Forum Newbie
Posts: 6
Joined: Mon May 26, 2008 1:24 pm

Re: Am I doing something wrong?

Post by nirali35 »

Thanks... it did worked... but it just parsed only the first link.
And I have a sequence like this:

Code: Select all

 
 
                    <tr class="class1" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1234&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
                    <tr class="class2" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1235&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
                    <tr class="class1" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1236&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
                    <tr class="class2" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1237&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
 
 
And here is my code:

Code: Select all

 
 
<?php
    $url = 'http://www.website.com/';
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    curl_close($ch);
    
    $rule = '/.*?<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++).*+/si';
 
    preg_match_all(
        $rule,
        $html,
        $pieces,
        PREG_SET_ORDER // formats data into an array of posts
    );
    
    foreach ($pieces as $piece) {
        $val1 = $piece[1];
        $val2 = $piece[2];
        
        echo $val1 . ' -> ' . $val2 . '<br>';
    }
?>
 
 
Thanks again :)
prometheuzz wrote:
nirali35 wrote:Thank buddy :)

I tried this, didn't worked :(
...
That's because you didn't use the "s" (DOTALL) flag after your regex to let the . (dot) also match new line characters, as Geert suggested.

Here's a demo:

Code: Select all

#!/usr/bin/php
<?php
$text = '
<tr class="class" >
  <td>
    <a href="http://www.website.com/?var=val">
      Link title</a>
  </td>
</tr>';
 
$regex = '/.*?<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++).*+/si';
 
if(preg_match($regex, $text, $matches)) {
  print "url   = $matches[1]\ntitle = $matches[2]\n";
} else {
  print "No match for: $text\n";
}
 
/* output
url   = http://www.website.com/?var=val
title = Link title
*/
?>
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Am I doing something wrong?

Post by GeertDD »

prometheuzz wrote:

Code: Select all

$regex = '/.*?<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++).*+/si';
 
if(preg_match($regex, $text, $matches)) {
  print "url   = $matches[1]\ntitle = $matches[2]\n";
} else {
  print "No match for: $text\n";
}
nirali35 wrote:Thanks... it did worked... but it just parsed only the first link.
First of all, preg_match() will only match once and then quit. You figured that out yourself already and used preg_match_all() instead. Good.

Secondly, look closely at the regex. It will match the whole string at once because the <a> tag is surrounded by .*? and .*+ which will consume all text before the first link and all text after it. See?

Try this:

Code: Select all

$regex = '/<a\s++href="([^"]++)"[^>]*+>/i';
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Am I doing something wrong?

Post by prometheuzz »

GeertDD wrote:...
Secondly, look closely at the regex. It will match the whole string at once ...
Ha, I keep forgetting that preg_match(...) does not have to match the entire string! Java's String.matches(...) is the cause of this (and my own amnesia, of course)!
; )
nirali35
Forum Newbie
Posts: 6
Joined: Mon May 26, 2008 1:24 pm

Re: Am I doing something wrong?

Post by nirali35 »

Getting closer my friend. Thank you very much :)

But two things:
1. I don't want all the links on the page.
2. This expression doesn't give us a title!

Thanks :)
GeertDD wrote:
prometheuzz wrote:

Code: Select all

$regex = '/.*?<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++).*+/si';
 
if(preg_match($regex, $text, $matches)) {
  print "url   = $matches[1]\ntitle = $matches[2]\n";
} else {
  print "No match for: $text\n";
}
nirali35 wrote:Thanks... it did worked... but it just parsed only the first link.
First of all, preg_match() will only match once and then quit. You figured that out yourself already and used preg_match_all() instead. Good.

Secondly, look closely at the regex. It will match the whole string at once because the <a> tag is surrounded by .*? and .*+ which will consume all text before the first link and all text after it. See?

Try this:

Code: Select all

$regex = '/<a\s++href="([^"]++)"[^>]*+>/i';
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Am I doing something wrong?

Post by prometheuzz »

nirali35 wrote:Getting closer my friend. Thank you very much :)

But two things:
1. I don't want all the links on the page.
2. This expression doesn't give us a title!

Thanks :)
...
What Geert hinted at was to remove the leading .*? and trailing .*+ from my first regex, and use preg_match_all(...) instead of preg_match(...).

Something like this (UNTESTED!):

Code: Select all

$regex = '/<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++)/si';
 
if(preg_match_all($regex, $text, $matches)) {
  print_r($matches);
}
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Am I doing something wrong?

Post by GeertDD »

nirali35 wrote:Getting closer my friend. Thank you very much :)

But two things:
1. I don't want all the links on the page.
2. This expression doesn't give us a title!
1. Could you clarify the conditions for which links to match and which not?

2. Fair enough. I guess I removed a bit too much earlier on. This regex will match the link title/text again, in $matches[2]. It is recommended to trim() these values, though.

Code: Select all

$regex = '/<a\s++href="([^"]++)"[^>]*+>([^<]++)/i';
nirali35
Forum Newbie
Posts: 6
Joined: Mon May 26, 2008 1:24 pm

Re: Am I doing something wrong?

Post by nirali35 »

Well, for now I solved the problem using strpos but I would always like to have a better way using RegEx:

Code: Select all

 
 
<?php
    $url = 'http://www.website.com/index.php?option=com_content&task=category&sectionid=34&id=35&Itemid=345';
    $agent = 'Agent';
    $validations = array('task=view', 'Itemid=123');
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    curl_close($ch);
    
    $rule = '/<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++)/si';
 
    preg_match_all(
        $rule,
        $html,
        $pieces,
        PREG_SET_ORDER // formats data into an array of posts
    );
    
    foreach ($pieces as $piece) {
        $show = $piece[1];
        $title = $piece[2];
        
        $valid = true;
        foreach($validations as $validation){
            if(!strpos($show, $validation)){
                $valid = false;
            }
        }
        
        if($valid)
            echo $show . ' -> ' . $title . '<br>';
    }
?>
 
 
Here is how the target code looks like:

Code: Select all

 
 
                    <tr class="class1" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1234&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
                    <tr class="class2" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1235&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
                    <tr class="class1" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1236&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
                    <tr class="class2" >
                                        <td>
                        <a href="http://www.website.com/index.php?option=com_content&task=view&id=1237&Itemid=123">
                        Link Title                      </a>
                                                </td>
                                </tr>
 
 
GeertDD wrote:
nirali35 wrote:Getting closer my friend. Thank you very much :)
1. Could you clarify the conditions for which links to match and which not?
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Am I doing something wrong?

Post by GeertDD »

Some things are better handled by other PHP functions. Using strpos() in this case does not seem like bad practice to me. It allows you to use a more simple and faster regex to match all the links.

You could implement it in the regex itself as well, of course. Below is a modified regex that will only match links that contain either "task=view" or "itemid=123". That is what you want, right?

Code: Select all

$regex = '/<a\s++href="([^"]*\b(?:task=view|itemid=123)\b[^"]*)"[^>]*+>([^<]++)/i';
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Am I doing something wrong?

Post by prometheuzz »

nirali35 wrote:Well, for now I solved the problem using strpos but I would always like to have a better way using RegEx:
...
Okay, you want your urls to contain the following substrings: 'task=view' and 'Itemid=123', right?
If so, you only need to adjust the regex:

Code: Select all

$rule = '/<a\s++href="([^"]++)"[^>]*+>\s*+([^<]++)/si';
You've given a couple of demo's on how to match url's and title's. You have not asked anything about these regexes, so I assume everything is clear to you. ; )
So, why don't you give this one a try yourself? If you get stuck, you can post back here, ok?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Am I doing something wrong?

Post by prometheuzz »

GeertDD wrote:...
Below is a modified regex that will only match links that contain either "task=view" or "itemid=123". That is what you want, right?
...
I believe he's only interested in url's that contain both substrings, but with all of the example's given to him/her, surely s/he is able to adjust it to his/her needs!
; )
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: Am I doing something wrong?

Post by GeertDD »

prometheuzz wrote:I believe he's only interested in url's that contain both substrings, but with all of the example's given to him/her, surely s/he is able to adjust it to his/her needs!
; )
Yeah, you are probably right. One more argument to go for the strpos() check instead of trying to make the regex do all the work. Why? Because you don't know for sure which part comes first within the query string: task or itemid? The regex would have to check both possibilities. Let me just quickly try to cook something up (not tested), just as an example of how ugly it gets. ;)

Code: Select all

$regex = '/<a\s++href="([^"]*\b(?:task=view\b[^"]+\bitemid=123|itemid=123\b[^"]+\btask=view)\b[^"]*+)"[^>]*+>([^<]++)/i';
Post Reply