Page 1 of 1
match href|src="" but not href|src=http://xxx
Posted: Fri Jan 30, 2009 3:13 am
by papa
I've been trying and trying and reading Chris Corbyn's tutorial over and over but can't get my regex to work...
Code: Select all
$src_regex = array('#(\s(href|src)=["|\'])(\./|/)?[^http://]?#i');
$src_replace = array('<font color="red">$1'.$url.'/</font>');
$line = '<pre>
img src="
img SRC=\'test.gif\'>
img src="/image.gif">
img src="./image.gif">
img src="http://www.example.org/image/image.gif">
img source="test"
a href="
a href=\'
a href="/image.gif">
a href="./image.gif">
a href="http://www.example.org/image/image.gif">
</pre>
';
echo preg_replace($src_regex, $src_replace, $line);
Quite simply, if the href or src have a url, don't change it. I can match the http:// but if I negate it with [] it strips out characters from the url...
(the ' in the regex is escaped but doesn't show here for some reason)
Re: match href|src="" but not href|src=http://xxx
Posted: Fri Jan 30, 2009 3:32 am
by prometheuzz
Hi, a couple of questions:
- what is your understanding of an URL? Do you make a distinction between absolute and relative URLs?
- w.r.t. my previous question, could you post the desired output of
$line after successfully replacing the unwanted parts?
- why are you using arrays to replace certain parts?
- what's this
$url variable?
And some observations:
- everything between [ and ] will always match just ONE character, so, this is what your classes do:
Code: Select all
["|'] // match one of the following characters: '"' (double quote), '|' (pipe) or '\'' (single quote)
[^http://] // match any character except 'h', 't', 'p', ':' and '/'
And as you can see, the "normal" meta characters loose their special meaning inside them (the '|' just matches the pipe character and is NOT the exclusive OR!).
Re: match href|src="" but not href|src=http://xxx
Posted: Fri Jan 30, 2009 3:41 am
by papa
Hi, thank you for the quick reply.
The array is for future purposes. The first step is to replace ./ / in relative urls with my $url (user input). So ./img/myimage.gif will be $url/img/myimage.gif. If a url (http://) is already in the src or href, ignore that and leave it as it is.
So "./bla" will be "
http://example.com/bla"
"
http://example.org/bla" will be ignored
[^http://] is my big problem. Trying to learn regex but I'm not that bright it seems. I've tried (^http://) to just match that word but don't know how to do it properly.
Your last statement was very helpful
So:
Code: Select all
'#(\s(href|src)=("|\')?)(\./|/)?[b][^http://]?[/b]#i'
thanks
Re: match href|src="" but not href|src=http://xxx
Posted: Fri Jan 30, 2009 4:18 am
by prometheuzz
Okay, I understand. Try something like this:
Code: Select all
echo preg_replace("#(\s(?:href|src)=[\"'])(?!http://)\.?/?([^\"'\r\n]+)#i", "$1$url/$2", $line);
And a small explanation:
Code: Select all
// Regex:
( # start group 1
\s # match a white space character
(?:href|src) # match either 'href' or 'src' and don't store it in a group (the '?:' will do that)
= # match a '='
[\"'] # match a '"' or '\''
) # end group 1
(?!http://) # group one CANNOT be directly followed by 'http://' (Google for 'regex-lookarounds' for more info!)
\.? # match a '.' reluctantly (aka: not-greedy)
/? # match a '/' reluctantly
( # start group 2
[^\"'\r\n]+ # match one or more characters of any type, except '"', '\'' or new line characters
) # end group 2
// Replacement:
$1 # replace what is matched by the regex above by group 1 from that regex
$url # followed by $url
/ # followed by a '/'
$2 # and lastly, add group 2 from the regex above as the replacement
Re: match href|src="" but not href|src=http://xxx
Posted: Fri Jan 30, 2009 4:22 am
by papa
Awesome thanks!
Very helpful with the detailed explanation!!!
edit.
Re: match href|src="" but not href|src=http://xxx
Posted: Fri Jan 30, 2009 4:26 am
by prometheuzz
papa wrote:Awesome thanks!
Very helpful with the detailed explanation!!!
No problem.