lookahead with lookbehind syntax?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
kc11
Forum Commoner
Posts: 73
Joined: Mon Sep 27, 2010 3:26 pm

lookahead with lookbehind syntax?

Post by kc11 »

Hi,

I am scanning through some html, but my basic regex is being tripped up by .jpeg files, which I would like to exclude. I've added a lookahead clause which has helped, but I would like to add an additional lookbehind clause, which is causing the entire regex to fail.

Code: Select all

preg_match_all(",..basic regex..(?!.*\.jpg"),",$html, $all_matches); // THIS LOOKAHEAD WORKS

Code: Select all

preg_match_all(",..basic regex..(?!.*\.jpg")(?<!.*\.jpg)",$html, $all_matches); // THIS LOOKAHEAD AND LOOKBEHIND DOES NOT WORK
Can anyone see anything wrong? can you do a lookahead and lookbehind together?

Thanks in advance,

KC
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: lookahead with lookbehind syntax?

Post by ragax »

Hi kc,

You can definitely use lookaheads and lookbehinds together (the link is to the lookaround page on my tut). You can even use a lookbehind inside a lookahead inside a lookbehind! :)

Two comments on your regex:
1. for the negative lookahead, I assume you realize that with the dot-star, depending on your html, the regex might fail just because of an unrelated .jpg far far away in the code, for instance "ok_file.txt lots of stuff unrelated.jpg".
2. php's pcre regex engine does not allow variable-length lookbehinds, which is what you have here with the .* inside the negative lookbehind. The length is variable, as opposed to the literal characters in "Fixed_Length".

Can you give me examples of the strings you are trying to exclude with that negative lookbehind? I am not seeing it right now, but if you explain it, I will look for a solution.
kc11
Forum Commoner
Posts: 73
Joined: Mon Sep 27, 2010 3:26 pm

Re: lookahead with lookbehind syntax?

Post by kc11 »

Thank you Rajax,

first let me say that I am trying to match phone numbers, so my basic regex right now is:

Code: Select all

 preg_match_all(',\d{3}\D?\D?\D?\d{3}\D?\d{4}(?!.*jpg"),',$html, $all_matches);
Example strings that are giving false match include:

[text]
http://www.bugbr.com/resize.php?pic=661 ... height=450
<a href="http://www.facebook.com/media/set/?set= ... 569&type=3"
[/text]

Another option, I'm thinking of is to use php's xpath to get text nodes only, and then do the regex on only those nodes, to avoid all these image related false matches

Thank you,

KC
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: lookahead with lookbehind syntax?

Post by ragax »

Hi KC,

Thanks for sending these false positives.
Two ideas.

1. Could some kind of boundary solve your problem? For instance, making sure the string starts with a space or at the beginning of a line: (^|\s) and ends in a similar fashion: ($|\s), because why would a phone number be embedded in text415-555-1212? Unless it was part of a url.

2. Are all these false positives part of url? If so, are these all strings ending in a quote or double quote, without any space? If so you could use a negative lookahead such as (?!\S+['"]) which means "not followed by any number of characters that are not spaces then a quote or double quote".

Wishing you a fun weekend.
kc11
Forum Commoner
Posts: 73
Joined: Mon Sep 27, 2010 3:26 pm

Re: lookahead with lookbehind syntax?

Post by kc11 »

Hi Ragax,

Thanks for your insights, I'm going to work on trying to combine your ideas into my regex. It may turn out to simpler than I thought.

Best regards,

KC
kc11
Forum Commoner
Posts: 73
Joined: Mon Sep 27, 2010 3:26 pm

Re: lookahead with lookbehind syntax?

Post by kc11 »

Ragax,

With respect to your first point, how would one get a word that sorounds a match. i.e., lets say my match is 12345 , and it is embedded in

http://ghj.com/781234589.jpg, how would I get 'http://ghj.com/781234589.jpg'

If I could do that, I think it would be pretty easy to test for strings at the beginning and end, like “http and .jpg” respectively.

Thanks,

KC
kc11
Forum Commoner
Posts: 73
Joined: Mon Sep 27, 2010 3:26 pm

Re: lookahead with lookbehind syntax?

Post by kc11 »

Ragax,

With respect to your first point, how would one get a word that sorounds a match. i.e., lets say my match is 12345 , and it is embedded in

http://ghj.com/781234589.jpg, how would I get 'http://ghj.com/781234589.jpg'

If I could do that, I think it would be pretty easy to test for strings at the beginning and end, like “http and .jpg” respectively.

Thanks,

KC
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: lookahead with lookbehind syntax?

Post by ragax »

Above, you have a string STRING, and you want to match it within a larger string made of any characters except spaces? So we are looking for SPACE or BEGINNING, then STUFF that includes STRING?
For that, you can write:

Code: Select all

$regex="~(?:^|\s)\S+?YOURSTRING\S+~";
Not tested, I am not my dev computer.

I still don't understand why you wouldn't simply match what you do want---it's a phone number, right? Maybe optional parentheses for the area code, then area code, optional space or dashes, etc? There are many example of phone number regexes out there. It's hard to answer your question because it sounds like you have special requirements compared to the task of just matching a phone number, and I am not understanding what these are.
kc11
Forum Commoner
Posts: 73
Joined: Mon Sep 27, 2010 3:26 pm

Re: lookahead with lookbehind syntax?

Post by kc11 »

Thanks again Ragax,

I think my problem is that I am trying to match phone numbers in free text, and so people seem to use many formats. One of the formats is 10 digits with no spaces (XXXXXXXXXX), and this seems to also match long number strings, as in my provided samples. If the text was more structured, I would probably be having less trouble.

Regards,

KC
Post Reply