PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Tue Dec 11, 2018 9:47 am

All times are UTC - 5 hours




Post new topic Reply to topic  [ 9 posts ] 
Author Message
PostPosted: Fri Mar 30, 2012 4:13 pm 
Offline
Forum Commoner

Joined: Mon Sep 27, 2010 3:26 pm
Posts: 73
Hi,

I am scanning through some html, but my basic regex is being tripped up by .jpeg files, which I would like to exclude. I've added a lookahead clause which has helped, but I would like to add an additional lookbehind clause, which is causing the entire regex to fail.

Syntax: [ Download ] [ Hide ]
preg_match_all(",..basic regex..(?!.*\.jpg"),",$html, $all_matches); // THIS LOOKAHEAD WORKS


Syntax: [ Download ] [ Hide ]
preg_match_all(",..basic regex..(?!.*\.jpg")(?<!.*\.jpg)",$html, $all_matches); // THIS LOOKAHEAD AND LOOKBEHIND DOES NOT WORK


Can anyone see anything wrong? can you do a lookahead and lookbehind together?

Thanks in advance,

KC


Top
 Profile  
 
PostPosted: Fri Mar 30, 2012 6:23 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Hi kc,

You can definitely use lookaheads and lookbehinds together (the link is to the lookaround page on my tut). You can even use a lookbehind inside a lookahead inside a lookbehind! :)

Two comments on your regex:
1. for the negative lookahead, I assume you realize that with the dot-star, depending on your html, the regex might fail just because of an unrelated .jpg far far away in the code, for instance "ok_file.txt lots of stuff unrelated.jpg".
2. php's pcre regex engine does not allow variable-length lookbehinds, which is what you have here with the .* inside the negative lookbehind. The length is variable, as opposed to the literal characters in "Fixed_Length".

Can you give me examples of the strings you are trying to exclude with that negative lookbehind? I am not seeing it right now, but if you explain it, I will look for a solution.


Top
 Profile  
 
PostPosted: Sat Mar 31, 2012 3:07 pm 
Offline
Forum Commoner

Joined: Mon Sep 27, 2010 3:26 pm
Posts: 73
Thank you Rajax,

first let me say that I am trying to match phone numbers, so my basic regex right now is:

Syntax: [ Download ] [ Hide ]
 preg_match_all(',\d{3}\D?\D?\D?\d{3}\D?\d{4}(?!.*jpg"),',$html, $all_matches);
 


Example strings that are giving false match include:

Syntax: [ Download ] [ Hide ]
http://www.bugbr.com/resize.php?pic=661 ... height=450
<a href="http://www.facebook.com/media/set/?set=a.1........06570.386347.263042886569&amp;type=3"
 


Another option, I'm thinking of is to use php's xpath to get text nodes only, and then do the regex on only those nodes, to avoid all these image related false matches

Thank you,

KC


Top
 Profile  
 
PostPosted: Sat Mar 31, 2012 3:30 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Hi KC,

Thanks for sending these false positives.
Two ideas.

1. Could some kind of boundary solve your problem? For instance, making sure the string starts with a space or at the beginning of a line: (^|\s) and ends in a similar fashion: ($|\s), because why would a phone number be embedded in text415-555-1212? Unless it was part of a url.

2. Are all these false positives part of url? If so, are these all strings ending in a quote or double quote, without any space? If so you could use a negative lookahead such as (?!\S+['"]) which means "not followed by any number of characters that are not spaces then a quote or double quote".

Wishing you a fun weekend.


Top
 Profile  
 
PostPosted: Tue Apr 03, 2012 9:51 am 
Offline
Forum Commoner

Joined: Mon Sep 27, 2010 3:26 pm
Posts: 73
Hi Ragax,

Thanks for your insights, I'm going to work on trying to combine your ideas into my regex. It may turn out to simpler than I thought.

Best regards,

KC


Top
 Profile  
 
PostPosted: Tue Apr 03, 2012 11:49 am 
Offline
Forum Commoner

Joined: Mon Sep 27, 2010 3:26 pm
Posts: 73
Ragax,

With respect to your first point, how would one get a word that sorounds a match. i.e., lets say my match is 12345 , and it is embedded in

http://ghj.com/781234589.jpg, how would I get 'http://ghj.com/781234589.jpg'

If I could do that, I think it would be pretty easy to test for strings at the beginning and end, like “http and .jpg” respectively.

Thanks,

KC


Top
 Profile  
 
PostPosted: Tue Apr 03, 2012 11:50 am 
Offline
Forum Commoner

Joined: Mon Sep 27, 2010 3:26 pm
Posts: 73
Ragax,

With respect to your first point, how would one get a word that sorounds a match. i.e., lets say my match is 12345 , and it is embedded in

http://ghj.com/781234589.jpg, how would I get 'http://ghj.com/781234589.jpg'

If I could do that, I think it would be pretty easy to test for strings at the beginning and end, like “http and .jpg” respectively.

Thanks,

KC


Top
 Profile  
 
PostPosted: Tue Apr 03, 2012 3:42 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Above, you have a string STRING, and you want to match it within a larger string made of any characters except spaces? So we are looking for SPACE or BEGINNING, then STUFF that includes STRING?
For that, you can write:
Syntax: [ Download ] [ Hide ]
$regex="~(?:^|\s)\S+?YOURSTRING\S+~";

Not tested, I am not my dev computer.

I still don't understand why you wouldn't simply match what you do want---it's a phone number, right? Maybe optional parentheses for the area code, then area code, optional space or dashes, etc? There are many example of phone number regexes out there. It's hard to answer your question because it sounds like you have special requirements compared to the task of just matching a phone number, and I am not understanding what these are.


Top
 Profile  
 
PostPosted: Wed Apr 04, 2012 7:35 am 
Offline
Forum Commoner

Joined: Mon Sep 27, 2010 3:26 pm
Posts: 73
Thanks again Ragax,

I think my problem is that I am trying to match phone numbers in free text, and so people seem to use many formats. One of the formats is 10 digits with no spaces (XXXXXXXXXX), and this seems to also match long number strings, as in my provided samples. If the text was more structured, I would probably be having less trouble.

Regards,

KC


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group