Page 1 of 1

Check if string is contained by two other strings?

Posted: Mon Nov 16, 2009 4:35 pm
by iankent
Hi everyone - new here so please bear with me. Wish I'd found this forum years ago though :P

My question is a bit of a strange one, and I'm guessing it'll be some kind of regex, but regex isn't my strong point and there may be another easier way of doing this that I'm missing!

I'm using a regex that somebody else wrote for me to detect a particular pattern in a string, and need to know whether the matched string is contained within another pair of strings.

For arguments sake lets say i'm looking for an e-mail address amongst (properly formatted) HTML code. Once I've found the match I'd like to know whether its contained within a <p> and </p> tag. I.e., I need to treat these two cases differently:
some@address.com
<p>some@address.com</p>

That would be easy enough, but the problem comes that the HTML tags can be nested in any fashion, and the match may be found anywhere. So, I need to identify the following cases:
<p>My address is some@address.com</p>
<p><b>You can e-mail me at</b> <u>some@address.com</u></p>
but not:
<p>hello</p><u>some@address.com</u><p>another hello</p>

Hope this is making sense so far :P I was thinking I could check for the <p> and </p> tags and compare their positions with the position of the regex match, but that would also match the last example as it is (textually) contained within a <p> and </p> tag, but it clearly isn't because of the closing </p> and second opening <p>

Any suggestions on the best way of doing this?

Thanks in advance :)
Ian

edit:
after writing all that it occured to me that converting it to an XML document, finding the match and checking parent nodes for the tags might work? But, that wouldn't work if the document wasn't HTML/XML and would fail if the start/end tags were different (e.g., matching ABC between ! and #)

Re: Check if string is contained by two other strings?

Posted: Mon Nov 16, 2009 5:17 pm
by requinix
The "compare their positions" can work, but you look for both a <p> and a </p>: the nearest <p> must come after the nearest </p> on both sides of email.

(Actually, if the HTML is valid then all you have to do is look before the email.)

Re: Check if string is contained by two other strings?

Posted: Mon Nov 16, 2009 5:25 pm
by iankent
tasairis wrote:but you look for both a <p> and a </p>: the nearest <p> must come after the nearest </p> on both sides of email.
Thanks for the quick reply - that sort of makes sense but I'm going to have to sleep on that one and try coding it tomorrow lol, getting a bit confused just thinking about it :) I think I know what you mean!

Re: Check if string is contained by two other strings?

Posted: Mon Nov 16, 2009 5:45 pm
by requinix
iankent wrote:Thanks for the quick reply - that sort of makes sense but I'm going to have to sleep on that one and try coding it tomorrow lol, getting a bit confused just thinking about it :) I think I know what you mean!
Three possibilities:
1. There aren't any <p>s before the email (didn't find a <p>)

Code: Select all

Text text text email@example.com...
2. There is a <p> before the email but it ended (found a <p> but also found a </p> after it)

Code: Select all

<p>Text text</p> text email@example.com...
3. There is a <p> before the email and it didn't end (found a <p> and no </p>)

Code: Select all

<p>Text text test email@example.com...
The important part is that you find the nearest <p>.

Re: Check if string is contained by two other strings?

Posted: Mon Nov 16, 2009 5:50 pm
by iankent
Brilliant, thank you, you've saved me having to think too much - always a good thing :)

Really appreciate your help!