Page 1 of 1

(Solved) Detect sentences and print code within

Posted: Thu Sep 25, 2008 5:23 pm
by LuckyShot
Hi everyone,

I have been searching a way to detect sentences in XHTML code avoiding to detect tags and including some code between them.
The best bet I guess is RegEx, I have seen some "tag detecting" regular expressions among a big bunch of examples and websites and can code some simple (really simple) expressions too.
But I can't really realize nor find any resource or clue among the Internet to my coding enigma so I though it would be worth asking this specific issue to the community.
Here it is an example (hope I would be clearer this way :P):

Original code:

Code: Select all

<h1>Header</h1>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>
<p>Aliquam <a href="orci">orci mauris</a>, faucibus a, ornare in, consequat at, ante. Duis lobortis nisi in velit.</p>
Printed code:

Code: Select all

<h1><span title="begins with H">Header</span></h1>
<p><span title="begins with L">Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</span></p>
<p><span title="begins with A">Aliquam <a href="orci">orci mauris</a>, faucibus a, ornare in, consequat at, ante. Duis lobortis nisi in velit.</span></p>
So the main target is to split the code between tags-text-tags-text... then maybe insert those data into arrays and finally it would be pretty easy to add some code in between.

Does that make sense? Should I give more examples/explanation?
Please feel free to ask more details or whatever.

Thanks and regards,

LuckyShot

Re: Detect sentences and print code within

Posted: Fri Sep 26, 2008 7:11 am
by prometheuzz
Try this:

Code: Select all

$text = '<h1>Header</h1>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>
<p>Aliquam <a href="orci">orci mauris</a>, faucibus a, ornare in, 
consequat at, ante. Duis lobortis nisi in velit.</p>';
 
preg_replace(
  '#(<([^\s>]+)>)((.)(?:(?!</\2>).)*)(</\2>)#',
  '$1<span title="begins with $4">$3</span>$5',
  $text
);
HTH

Re: Detect sentences and print code within

Posted: Fri Sep 26, 2008 9:02 am
by LuckyShot
Thank you Prometheuzz!

That's a great piece of RegEx dude.
It works perfectly except for the 3rd line of the code (there is an <a> tag there).
But no worries, this is useful enough and will work for my little project.

I am trying to "disassemble" the expression but have little doubts:
Start with "<",

Code: Select all

(<
then one or more characters/numbers or spaces

Code: Select all

([^\s>]+)
then a ">"

Code: Select all

>)
then anything except a line break

Code: Select all

((.)
then maybe...

Code: Select all

(?:
not "</" plus the tag we got at the beginning plus ">"

Code: Select all

(?!</\2>)
but one or more of any characters (except line breaks)

Code: Select all

.)*)
then a "</", the tag we got at the beginning and ">"

Code: Select all

(</\2>)
Is that somehow right? I have read some RegEx newbie tutorials but still find it difficult to understand some expressions.

Thank you!

Re: Detect sentences and print code within

Posted: Fri Sep 26, 2008 9:31 am
by prometheuzz
LuckyShot wrote:Thank you Prometheuzz!
...
You're welcome.
LuckyShot wrote:That's a great piece of RegEx dude.
It works perfectly except for the 3rd line of the code (there is an <a> tag there).
...
I don't quite understand. When I test what I have posted, it produces the exact same output as you posted in your original post.
LuckyShot wrote:Is that somehow right? I have read some RegEx newbie tutorials but still find it difficult to understand some expressions.

Thank you!
Close.

Here's a (short) explanation:

Code: Select all

PATTERN
 
(                         # open group 1
  <                       #   match '<'
  (                       #   open group 2
    [^\s>]+               #     one ore more characters of any type except white space characters and '>' 
  )                       #   close group 2
  >                       #   match '>'
)                         # close group 1
(                         # open group 3
  (                       #   open group 4
    .                     #     any character
  )                       #   close group 4
  (?:                     #   open a non-capturing group
    (?!                   #     when looking ahead ...
      </\2>               #       ... if there's not a string '</' followed by what is matched in group 2, followed by '>' ...
    )                     #     stop looking ahead
    .                     #     ... then match any character ...
  )                       #   close the non-capturing group
  *                       #   ... zero or more times
)                         # close group 3
(                         # open group 5
  </                      #   match '</'
  \2                      #     match what is stored in group 2
  >                       #   match '>'
)                         # close group 5
 
 
REPLACEMENT
 
$1                        # what is stored in group 1
<span title="begins with  # '<span title="begins with'
$4                        # what is stored in group 4
">                        # '">'
$3                        # what is stored in group 3
</span>                   # '</span>'
$5                        # what is stored in group 5

Re: Detect sentences and print code within

Posted: Fri Sep 26, 2008 10:18 am
by LuckyShot
Woah, thanks for the detailed step-by-step!

The code wasn't working for me because of the line break at the 3rd to 4th line.
prometheuzz wrote:

Code: Select all

$text = '<h1>Header</h1>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>
<p>Aliquam <a href="orci">orci mauris</a>, faucibus a, ornare in, 
consequat at, ante. Duis lobortis nisi in velit.</p>';
...
Just removed it and it works perfect now.
I will now have a more detailed look at your explanation.

Thanks very much for your time on this Prometheuzz!

Re: Detect sentences and print code within

Posted: Fri Sep 26, 2008 11:17 am
by prometheuzz
No problem LuckyShot. Don't hesitate to post a follow up question: there's a tricky part in the regex I posted which I don't mind explaining a bit more if there is a need to do so.

Good luck!