Page 1 of 2
condition?
Posted: Fri Oct 27, 2006 12:13 pm
by visonardo
i need take urls from href, but the problem is that in some it appears without quotes. the regex code would be thus (when use quotes)
$regexcode="<[\w]* (href)=(\"|')(.*)\\2[^>]*>";
the big problem is that if doesnt start with ' or ", the url cant contain white spaces, it would be something thus
$regexcode="<[\w]* (href)=([\S]*)[^>]*>";
i need some condition like
Code: Select all
<?
if(\\2!='')
(.*)
else
([\S]*)
?>
understand what i want?
it would be something thus
$regexcode="<[\w]* (href)=(\"|')?((if(\\2!='').else[\S])*)\\2?[^>]>";

Posted: Fri Oct 27, 2006 12:27 pm
by timvw
Simplified: you want to match two situations: [ a | b ].... (Might want to read
http://www.dotnetcoders.com/web/Learnin ... actor.aspx... )
Posted: Fri Oct 27, 2006 12:30 pm
by Chris Corbyn
Maybe (untested):
Code: Select all
$re = '/<\w+ href=("|\'|\b)(.*)\\1[^>]*>/is';
Posted: Fri Oct 27, 2006 7:53 pm
by printf
just a basic extractor...
Code: Select all
$str = 'some html page';
preg_match_all ( "|href\=\"?'?`?([[:alnum:]:?=&@/#._-]+)\"?'?`?|i", $str, $url );
print_r ( $url[1] );
printf
Posted: Fri Oct 27, 2006 7:55 pm
by visonardo
d11wtq wrote:Maybe (untested):
Code: Select all
$re = '/<\w+ href=("|\'|\b)(.*)\\1[^>]*>/is';
didnt work.
other thing, i have this example that i was testing by that told timvw
$re = "/<\w+ href=((("|')(.*)\\1)|(\S*))[^>]*>/i";
but in this case that i used to test that
Code: Select all
function ech($aver)
{
print_r($aver);
return $aver[0];
}
$re = "/<\w+ href=((("|')(.*)\\1)|(\S*))[^>]*>/i";
$a2="<link href=' hola che!' boder>";
$a2=preg_replace_callback($re,"ech",$a2);
OUTPUT
Code: Select all
<p>Array
(
[0] => <link href=' hola che!' boder>
[1] => '
[2] =>
[3] =>
[4] =>
[5] => '
)
Posted: Fri Oct 27, 2006 8:41 pm
by feyd
Code: Select all
<?php
$test = '<link test=`foo`href=\' hola che!\'more-test=\'ploop\' boder>';
preg_match('#<\s*[a-z:-]+\s+(?:\s*[a-z]+(?:\s*=\s*([\'"`]?).*?\\1)?)*\s*href\s*=\s*([\'"`]?)(.*?)\\2(?:\s*[a-z]+(?:\s*=\s*([\'"`]?).*?\\4)?)*[^>]*>#is', $test, $match);
var_dump($match);
?>
Code: Select all
array(4) {
[0]=>
string(57) "<link test=`foo`href=' hola che!'more-test='ploop' boder>"
[1]=>
string(1) "`"
[2]=>
string(1) "'"
[3]=>
string(10) " hola che!"
}
slight problem is (depending on how you look at it) it will only support zero or one leading and following attributes.
Posted: Sat Oct 28, 2006 8:28 am
by visonardo
thank feyd, but it really dont work when i find href without quotes, its like to say that are not white spaces in the url, your regex doesnt take that, but i used two regex to do that. But, i insist, must be a shape to do all in one

Posted: Sat Oct 28, 2006 8:37 am
by feyd
visonardo wrote:thank feyd, but it really dont work when i find href without quotes, its like to say that are not white spaces in the url, your regex doesnt take that, but i used two regex to do that. But, i insist, must be a shape to do all in one

I have almost no idea what you just said.
Posted: Sat Oct 28, 2006 8:51 am
by visonardo
this two codes in one would be.
Code: Select all
$regex1="/<[^>]+(href)\s*=(\S*)[^>]*>/is";
$regex2="/<[^>]+(href)\s*=\s*(\"|'|`)(.*)\\2[^>]*>/is";
individually work perfectly, but i would like to do all in one regex code. I tested doing thus
Code: Select all
$regex="/<[^>]+(href)\s*=(((\"|'|`)(.*)\\4)|(\S*))[^>]*>/is";
but didnt work
It must take the url in href thus
Code: Select all
<a href = "http://forums.devnetwork.net/">
and in
Code: Select all
<a href=http://devnetwork.net/ something=value>
as you saw, in the last href the url´s end is a white space or the >
Posted: Sat Oct 28, 2006 9:13 am
by feyd
Code: Select all
<?php
$tests = array(
'<link test1=`foo1` test2=\'foo2\' test3="foo3" href=\' hola che!\'more-test=\'ploop\' boder>',
'<a href=http://devnetwork.net/ something=value>',
);
foreach( $tests as $test )
{
preg_match('#<\s*[a-z:-]+\s+.*?\s*href\s*=\s*(?:'.'([\'"`])(.*?)\\1|([^\s]+))[^>]*>#is', $test, $match);
var_dump($match);
}
?>
Code: Select all
array(3) {
[0]=>
string(86) "<link test1=`foo1` test2='foo2' test3="foo3" href=' hola che!'more-test='ploop' boder>"
[1]=>
string(1) "'"
[2]=>
string(10) " hola che!"
}
array(4) {
[0]=>
string(47) "<a href=http://devnetwork.net/ something=value>"
[1]=>
string(0) ""
[2]=>
string(0) ""
[3]=>
string(22) "http://devnetwork.net/"
}
Posted: Sat Oct 28, 2006 9:16 am
by visonardo
thank again

. But a detail, why you used [^\s] and not [\S]

do you see some difference?
Posted: Sat Oct 28, 2006 9:19 am
by feyd
No particular reason, I just prefer to use the positive forms.
Posted: Sat Oct 28, 2006 9:20 am
by Chris Corbyn
visonardo wrote:thank again

. But a detail, why you used [^\s] and not [\S]

do you see some difference?
No difference; it's just sometimes what comes into your mind whilst you're building a pattern, it will work either way
By the way, I'm sure it's not intentional but using the eye-rolling emticon (

) often looks like you're trying to be abrasive

Posted: Sat Oct 28, 2006 9:37 am
by visonardo
d11wtq wrote:visonardo wrote:thank again

. But a detail, why you used [^\s] and not [\S]

do you see some difference?
No difference; it's just sometimes what comes into your mind whilst you're building a pattern, it will work either way
By the way, I'm sure it's not intentional but using the eye-rolling emticon (

) often looks like you're trying to be abrasive

sorry but i wont use an emoticons of doubt. that you say was not my intention.
Other thing.
i have this two regex:
Code: Select all
$z1="#<[^>]+\s+.*?\s*href\s*=\s*((['"`])(.*?)\\2|([^\s]+))[^>]*>#is";
$z2='#<\s*[a-z:-]+\s+.*?\s*href\s*=\s*(?:'.'([\'"`])(.*?)\\1|([^\s]+))[^>]*>#is';
$z2 is feyd´s regex and $z1 is that i changed believing that was the same. if you see i just changed
by
and ordening to capture the url in a same order (\\3) i took out ?: of the parenteses, the end result is that the mine didnt work and regex´s feyd yes. why? if $z1 sould work
Code: Select all
$a2=" <link href=' hola che!' boder> <a href=http://holaaaa something=value>";
$z1="#<[^>]+\s+.*?\s*href\s*=\s*((['"`])(.*?)\\2|([^\s]+))[^>]*>#is";
$z2='#<\s*[a-z:-]+\s+.*?\s*href\s*=\s*(?:'.'([\'"`])(.*?)\\1|([^\s]+))[^>]*>#is';
preg_match_all($z1,$a2,$match1);
print_r($match1);
echo '<p>';
preg_match_all($z2,$a2,$match2);
print_r($match2);
OUTPUT
Code: Select all
Array
(
[0] => Array
(
[0] => <link href=' hola che!' boder> <a href=http://holaaaa something=value>
)
[1] => Array
(
[0] => http://holaaaa
)
[2] => Array
(
[0] =>
)
[3] => Array
(
[0] =>
)
[4] => Array
(
[0] => http://holaaaa
)
)
Array
(
[0] => Array
(
[0] => <link href=' hola che!' boder>
[1] => <a href=http://holaaaa something=value>
)
[1] => Array
(
[0] => '
[1] =>
)
[2] => Array
(
[0] => hola che!
[1] =>
)
[3] => Array
(
[0] =>
[1] => http://holaaaa
)
)
Posted: Sat Oct 28, 2006 9:57 am
by feyd
Adding parentheses around each in the original regex pattern will illustrate the differences.