Page 1 of 2

preg_split on spaces, but not within tags

Posted: Fri Jun 30, 2006 11:45 pm
by Robert K S
I'm trying to preg_split an HTML text string $text_thats_too_long on the spaces in order to shorten it to a predefined $word_limit, then recombine the split array using implode(). The goal is to truncate the text, but only on word boundaries. 'Til now I had been using something like this:

Code: Select all

$truncated_text = implode(' ', array_slice(preg_split('/\s+/', $text_thats_too_long), 0, $word_limit)).'...'
The problem with this is that if $text_thats_too_long contains a hyperlink anchor tag (or any HTML tag with a space in it), the preg_split will operate on the space inside the tag, which is undesirable as it leaves poorly formed HTML: unclosed tags may result.

Is there some other regular expression I can use to ensure I won't be preg_splitting on spaces within tags, or is this impossible to accomplish with a regular expression?

There seems to be some effort along the lines of what I'm trying to do posted here, but I don't think it has quite the same goal in mind.
http://www.codingforums.com/archive/ind ... 66748.html

Posted: Sat Jul 01, 2006 6:25 am
by bokehman
Even if you don't cut through an HTML tag you will still end up with a sting containing an imbalance of tags. I would suggest removing the tags first with striptags()

Posted: Sat Jul 01, 2006 6:59 am
by Robert K S
Obviously, this is not the intent. Tags need to be preserved. I think I probably can't do what I hoped for with preg_split() alone, but I would still appreciate it if someone can tell me whether or not it is possible to detect a space not within an HTML tag using a single regular expression.

Posted: Sat Jul 01, 2006 7:02 am
by Benjamin
It is, although I'm not an expert at RegEx so I can't tell you how. This may be a starting point...

http://www.php.net/manual/en/function.p ... ch-all.php

Posted: Sat Jul 01, 2006 7:39 am
by bokehman
Avoiding the tag is not the problem. Notice in the following example you would be leaving a lot of tags open.
<div>word word <span class="name"> word word cut here word word</span> word</div>

Posted: Sat Jul 01, 2006 7:47 am
by Benjamin
bokehman wrote:Avoiding the tag is not the problem. Notice in the following example you would be leaving a lot of tags open.
<div>word word <span class="name"> word word cut here word word</span> word</div>
He is referring to spaces not within an html tag.

Posted: Sat Jul 01, 2006 8:00 am
by bokehman
astions wrote:
bokehman wrote:Avoiding the tag is not the problem. Notice in the following example you would be leaving a lot of tags open.
<div>word word <span class="name"> word word cut here word word</span> word</div>
He is referring to spaces not within an html tag.
I realise that but the purpose of the exercise is to truncate the string and doing so will cause tags to be left open.

Posted: Sat Jul 01, 2006 2:34 pm
by sweatje
I think this might be what you are looking for, expressed as a SimpleTest case:

Code: Select all

function testPregSplitSpacesNotInTags() {
	$str = 'This is a<div style="foo">test string<span class="bar">sdfsd
		</span boo bar> dsifjsd.</div>sdfsdf sdfds';
	$target = array(
		 'This'
		,'is'
		,'a'
		,'<div style="foo">test'
		,'string'
		,'<span class="bar">sdfsd'
		,'</span boo bar>'
		,'dsifjsd.'
		,'</div>sdfsdf'
 		,'sdfds');
	$arr = preg_split('/(?=<[^>]+>)|\s+(?![^<>]+>)/m', $str);
	$this->assertEqual($target,array_merge(array_diff($arr,array(''))));
	// alternativly, if no need to break at the tag boundaries
	$target2 = array(
		 'This'
		,'is'
		,'a<div style="foo">test'
		,'string<span class="bar">sdfsd'
		,'</span boo bar>'
		,'dsifjsd.</div>sdfsdf'
		,'sdfds');
	$arr2 = preg_split('/\s+(?![^<>]+>)/m', $str);
	$this->assertEqual($target2,array_merge(array_diff($arr2,array(''))));
}

Posted: Mon Jul 10, 2006 1:01 pm
by Robert K S
Thank you, Jason!

The latter example was exactly what I was looking for.

And I had never heard of SimpleTest before. I didn't find its web page very clear in describing its function. If you don't mind my asking, how did you use it to discover the solution?

Posted: Mon Jul 10, 2006 1:11 pm
by sweatje
Unit testing is writing code to test your code. I wrote a unit test to run a regex against your test data, and stated what I wanted to see as a result. Once this "harness" is in place, you can run the unit test over and over until you get the regex correct. Unit testing applies to all areas of your code, not just regular expressions. Going through the tutorials on either the simpletest.org or http://www.lastcraft.com sites is probably your best bet for getting a basic understanding of how to write unit tests.

Working on the assumption that testing your code is good, turing this practice up a notch is to write your tests before you write your code. This practice is called Test Driven Development or Test First Development. Googling for each of these phrases should give you lots of insight into this development methodology.

Last, there is a testing forum here on this site, which is a somewhat unique resource in the PHP world.

Regards,
Jason

Posted: Mon Jul 10, 2006 1:25 pm
by Robert K S
So, was SimpleTest able to automate the derivation of the regexes you suggested based on the known input and desired output, or was it still a manual trial-and-error process for the coder?

Posted: Mon Jul 10, 2006 1:32 pm
by sweatje
Robert K S wrote:So, was SimpleTest able to automate the derivation of the regexes you suggested based on the known input and desired output, or was it still a manual trial-and-error process for the coder?
I would prefer the wording "iterative process" ;)

Posted: Fri Jul 14, 2006 1:54 am
by Robert K S
I could have called it a heuristic process, but we're all friends here. :)

Posted: Fri Jul 14, 2006 2:55 am
by bokehman
sweatje wrote:

Code: Select all

<[^>]+>
The trouble with that is it will find things that are not html tags.

Posted: Fri Jul 14, 2006 10:10 pm
by feyd
bokehman wrote:
sweatje wrote:

Code: Select all

<[^>]+>
The trouble with that is it will find things that are not html tags.
only in malformed text and, in general, malformed tags too.