match href|src="" but not href|src=http://xxx

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

match href|src="" but not href|src=http://xxx

Post by papa »

I've been trying and trying and reading Chris Corbyn's tutorial over and over but can't get my regex to work...

Code: Select all

 
$src_regex = array('#(\s(href|src)=["|\'])(\./|/)?[^http://]?#i');
$src_replace = array('<font color="red">$1'.$url.'/</font>');
 
$line = '<pre>
img src="
img SRC=\'test.gif\'>
img src="/image.gif">
img src="./image.gif">
img src="http://www.example.org/image/image.gif">
img source="test"
a href="
a href=\'
a href="/image.gif">
a href="./image.gif">
a href="http://www.example.org/image/image.gif">
</pre>
';
 
echo preg_replace($src_regex, $src_replace, $line);
 
Quite simply, if the href or src have a url, don't change it. I can match the http:// but if I negate it with [] it strips out characters from the url...

(the ' in the regex is escaped but doesn't show here for some reason)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: match href|src="" but not href|src=http://xxx

Post by prometheuzz »

Hi, a couple of questions:
- what is your understanding of an URL? Do you make a distinction between absolute and relative URLs?
- w.r.t. my previous question, could you post the desired output of $line after successfully replacing the unwanted parts?
- why are you using arrays to replace certain parts?
- what's this $url variable?

And some observations:
- everything between [ and ] will always match just ONE character, so, this is what your classes do:

Code: Select all

["|']       // match one of the following characters: '"' (double quote), '|' (pipe) or '\'' (single quote)
[^http://]  // match any character except 'h', 't', 'p', ':' and '/'
And as you can see, the "normal" meta characters loose their special meaning inside them (the '|' just matches the pipe character and is NOT the exclusive OR!).
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

Re: match href|src="" but not href|src=http://xxx

Post by papa »

Hi, thank you for the quick reply.

The array is for future purposes. The first step is to replace ./ / in relative urls with my $url (user input). So ./img/myimage.gif will be $url/img/myimage.gif. If a url (http://) is already in the src or href, ignore that and leave it as it is.

So "./bla" will be "http://example.com/bla"
"http://example.org/bla" will be ignored


[^http://] is my big problem. Trying to learn regex but I'm not that bright it seems. I've tried (^http://) to just match that word but don't know how to do it properly.

Your last statement was very helpful :)

So:

Code: Select all

'#(\s(href|src)=("|\')?)(\./|/)?[b][^http://]?[/b]#i'
thanks
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: match href|src="" but not href|src=http://xxx

Post by prometheuzz »

Okay, I understand. Try something like this:

Code: Select all

echo preg_replace("#(\s(?:href|src)=[\"'])(?!http://)\.?/?([^\"'\r\n]+)#i", "$1$url/$2", $line);
And a small explanation:

Code: Select all

// Regex:
(                   # start group 1
  \s                #   match a white space character
  (?:href|src)      #   match either 'href' or 'src' and don't store it in a group (the '?:' will do that)
  =                 #   match a '='
  [\"']             #   match a '"' or '\''
)                   # end group 1
(?!http://)         # group one CANNOT be directly followed by 'http://' (Google for 'regex-lookarounds' for more info!)
\.?                 # match a '.' reluctantly (aka: not-greedy)
/?                  # match a '/' reluctantly
(                   # start group 2
  [^\"'\r\n]+       #   match one or more characters of any type, except '"', '\'' or new line characters
)                   # end group 2
 
// Replacement:
$1                  # replace what is matched by the regex above by group 1 from that regex
$url                # followed by $url
/                   # followed by a '/'
$2                  # and lastly, add group 2 from the regex above as the replacement
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

Re: match href|src="" but not href|src=http://xxx

Post by papa »

Awesome thanks!

Very helpful with the detailed explanation!!!

edit.
Last edited by papa on Fri Jan 30, 2009 6:12 am, edited 2 times in total.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: match href|src="" but not href|src=http://xxx

Post by prometheuzz »

papa wrote:Awesome thanks!

Very helpful with the detailed explanation!!!
No problem.
Post Reply