Page 1 of 1
lookahead with lookbehind syntax?
Posted: Fri Mar 30, 2012 4:13 pm
by kc11
Hi,
I am scanning through some html, but my basic regex is being tripped up by .jpeg files, which I would like to exclude. I've added a lookahead clause which has helped, but I would like to add an additional lookbehind clause, which is causing the entire regex to fail.
Code: Select all
preg_match_all(",..basic regex..(?!.*\.jpg"),",$html, $all_matches); // THIS LOOKAHEAD WORKS
Code: Select all
preg_match_all(",..basic regex..(?!.*\.jpg")(?<!.*\.jpg)",$html, $all_matches); // THIS LOOKAHEAD AND LOOKBEHIND DOES NOT WORK
Can anyone see anything wrong? can you do a lookahead and lookbehind together?
Thanks in advance,
KC
Re: lookahead with lookbehind syntax?
Posted: Fri Mar 30, 2012 6:23 pm
by ragax
Hi kc,
You can definitely use
lookaheads and lookbehinds together (the link is to the lookaround page on my tut). You can even use a lookbehind inside a lookahead inside a lookbehind!
Two comments on your regex:
1. for the negative lookahead, I assume you realize that with the dot-star, depending on your html, the regex might fail just because of an unrelated .jpg far far away in the code, for instance "ok_file.txt lots of stuff unrelated.jpg".
2. php's pcre regex engine does not allow variable-length lookbehinds, which is what you have here with the .* inside the negative lookbehind. The length is variable, as opposed to the literal characters in "Fixed_Length".
Can you give me examples of the strings you are trying to exclude with that negative lookbehind? I am not seeing it right now, but if you explain it, I will look for a solution.
Re: lookahead with lookbehind syntax?
Posted: Sat Mar 31, 2012 3:07 pm
by kc11
Thank you Rajax,
first let me say that I am trying to match phone numbers, so my basic regex right now is:
Code: Select all
preg_match_all(',\d{3}\D?\D?\D?\d{3}\D?\d{4}(?!.*jpg"),',$html, $all_matches);
Example strings that are giving false match include:
[text]
http://www.bugbr.com/resize.php?pic=661 ... height=450
<a href="
http://www.facebook.com/media/set/?set= ... 569&type=3"
[/text]
Another option, I'm thinking of is to use php's xpath to get text nodes only, and then do the regex on only those nodes, to avoid all these image related false matches
Thank you,
KC
Re: lookahead with lookbehind syntax?
Posted: Sat Mar 31, 2012 3:30 pm
by ragax
Hi KC,
Thanks for sending these false positives.
Two ideas.
1. Could some kind of boundary solve your problem? For instance, making sure the string starts with a space or at the beginning of a line: (^|\s) and ends in a similar fashion: ($|\s), because why would a phone number be embedded in text415-555-1212? Unless it was part of a url.
2. Are all these false positives part of url? If so, are these all strings ending in a quote or double quote, without any space? If so you could use a negative lookahead such as (?!\S+['"]) which means "not followed by any number of characters that are not spaces then a quote or double quote".
Wishing you a fun weekend.
Re: lookahead with lookbehind syntax?
Posted: Tue Apr 03, 2012 9:51 am
by kc11
Hi Ragax,
Thanks for your insights, I'm going to work on trying to combine your ideas into my regex. It may turn out to simpler than I thought.
Best regards,
KC
Re: lookahead with lookbehind syntax?
Posted: Tue Apr 03, 2012 11:49 am
by kc11
Ragax,
With respect to your first point, how would one get a word that sorounds a match. i.e., lets say my match is 12345 , and it is embedded in
http://ghj.com/781234589.jpg, how would I get '
http://ghj.com/781234589.jpg'
If I could do that, I think it would be pretty easy to test for strings at the beginning and end, like “http and .jpg” respectively.
Thanks,
KC
Re: lookahead with lookbehind syntax?
Posted: Tue Apr 03, 2012 11:50 am
by kc11
Ragax,
With respect to your first point, how would one get a word that sorounds a match. i.e., lets say my match is 12345 , and it is embedded in
http://ghj.com/781234589.jpg, how would I get '
http://ghj.com/781234589.jpg'
If I could do that, I think it would be pretty easy to test for strings at the beginning and end, like “http and .jpg” respectively.
Thanks,
KC
Re: lookahead with lookbehind syntax?
Posted: Tue Apr 03, 2012 3:42 pm
by ragax
Above, you have a string STRING, and you want to match it within a larger string made of any characters except spaces? So we are looking for SPACE or BEGINNING, then STUFF that includes STRING?
For that, you can write:
Code: Select all
$regex="~(?:^|\s)\S+?YOURSTRING\S+~";
Not tested, I am not my dev computer.
I still don't understand why you wouldn't simply match what you do want---it's a phone number, right? Maybe optional parentheses for the area code, then area code, optional space or dashes, etc? There are many example of phone number regexes out there. It's hard to answer your question because it sounds like you have special requirements compared to the task of just matching a phone number, and I am not understanding what these are.
Re: lookahead with lookbehind syntax?
Posted: Wed Apr 04, 2012 7:35 am
by kc11
Thanks again Ragax,
I think my problem is that I am trying to match phone numbers in free text, and so people seem to use many formats. One of the formats is 10 digits with no spaces (XXXXXXXXXX), and this seems to also match long number strings, as in my provided samples. If the text was more structured, I would probably be having less trouble.
Regards,
KC