preg_split on spaces, but not within tags

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Robert K S
Forum Newbie
Posts: 11
Joined: Thu Dec 18, 2003 7:06 pm
Location: Cleveland, Ohio

preg_split on spaces, but not within tags

Post by Robert K S »

I'm trying to preg_split an HTML text string $text_thats_too_long on the spaces in order to shorten it to a predefined $word_limit, then recombine the split array using implode(). The goal is to truncate the text, but only on word boundaries. 'Til now I had been using something like this:

Code: Select all

$truncated_text = implode(' ', array_slice(preg_split('/\s+/', $text_thats_too_long), 0, $word_limit)).'...'
The problem with this is that if $text_thats_too_long contains a hyperlink anchor tag (or any HTML tag with a space in it), the preg_split will operate on the space inside the tag, which is undesirable as it leaves poorly formed HTML: unclosed tags may result.

Is there some other regular expression I can use to ensure I won't be preg_splitting on spaces within tags, or is this impossible to accomplish with a regular expression?

There seems to be some effort along the lines of what I'm trying to do posted here, but I don't think it has quite the same goal in mind.
http://www.codingforums.com/archive/ind ... 66748.html
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

Even if you don't cut through an HTML tag you will still end up with a sting containing an imbalance of tags. I would suggest removing the tags first with striptags()
Robert K S
Forum Newbie
Posts: 11
Joined: Thu Dec 18, 2003 7:06 pm
Location: Cleveland, Ohio

Post by Robert K S »

Obviously, this is not the intent. Tags need to be preserved. I think I probably can't do what I hoped for with preg_split() alone, but I would still appreciate it if someone can tell me whether or not it is possible to detect a space not within an HTML tag using a single regular expression.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Post by Benjamin »

It is, although I'm not an expert at RegEx so I can't tell you how. This may be a starting point...

http://www.php.net/manual/en/function.p ... ch-all.php
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

Avoiding the tag is not the problem. Notice in the following example you would be leaving a lot of tags open.
<div>word word <span class="name"> word word cut here word word</span> word</div>
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Post by Benjamin »

bokehman wrote:Avoiding the tag is not the problem. Notice in the following example you would be leaving a lot of tags open.
<div>word word <span class="name"> word word cut here word word</span> word</div>
He is referring to spaces not within an html tag.
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

astions wrote:
bokehman wrote:Avoiding the tag is not the problem. Notice in the following example you would be leaving a lot of tags open.
<div>word word <span class="name"> word word cut here word word</span> word</div>
He is referring to spaces not within an html tag.
I realise that but the purpose of the exercise is to truncate the string and doing so will cause tags to be left open.
User avatar
sweatje
Forum Contributor
Posts: 277
Joined: Wed Jun 29, 2005 10:04 pm
Location: Iowa, USA

Post by sweatje »

I think this might be what you are looking for, expressed as a SimpleTest case:

Code: Select all

function testPregSplitSpacesNotInTags() {
	$str = 'This is a<div style="foo">test string<span class="bar">sdfsd
		</span boo bar> dsifjsd.</div>sdfsdf sdfds';
	$target = array(
		 'This'
		,'is'
		,'a'
		,'<div style="foo">test'
		,'string'
		,'<span class="bar">sdfsd'
		,'</span boo bar>'
		,'dsifjsd.'
		,'</div>sdfsdf'
 		,'sdfds');
	$arr = preg_split('/(?=<[^>]+>)|\s+(?![^<>]+>)/m', $str);
	$this->assertEqual($target,array_merge(array_diff($arr,array(''))));
	// alternativly, if no need to break at the tag boundaries
	$target2 = array(
		 'This'
		,'is'
		,'a<div style="foo">test'
		,'string<span class="bar">sdfsd'
		,'</span boo bar>'
		,'dsifjsd.</div>sdfsdf'
		,'sdfds');
	$arr2 = preg_split('/\s+(?![^<>]+>)/m', $str);
	$this->assertEqual($target2,array_merge(array_diff($arr2,array(''))));
}
Robert K S
Forum Newbie
Posts: 11
Joined: Thu Dec 18, 2003 7:06 pm
Location: Cleveland, Ohio

Post by Robert K S »

Thank you, Jason!

The latter example was exactly what I was looking for.

And I had never heard of SimpleTest before. I didn't find its web page very clear in describing its function. If you don't mind my asking, how did you use it to discover the solution?
User avatar
sweatje
Forum Contributor
Posts: 277
Joined: Wed Jun 29, 2005 10:04 pm
Location: Iowa, USA

Post by sweatje »

Unit testing is writing code to test your code. I wrote a unit test to run a regex against your test data, and stated what I wanted to see as a result. Once this "harness" is in place, you can run the unit test over and over until you get the regex correct. Unit testing applies to all areas of your code, not just regular expressions. Going through the tutorials on either the simpletest.org or http://www.lastcraft.com sites is probably your best bet for getting a basic understanding of how to write unit tests.

Working on the assumption that testing your code is good, turing this practice up a notch is to write your tests before you write your code. This practice is called Test Driven Development or Test First Development. Googling for each of these phrases should give you lots of insight into this development methodology.

Last, there is a testing forum here on this site, which is a somewhat unique resource in the PHP world.

Regards,
Jason
Robert K S
Forum Newbie
Posts: 11
Joined: Thu Dec 18, 2003 7:06 pm
Location: Cleveland, Ohio

Post by Robert K S »

So, was SimpleTest able to automate the derivation of the regexes you suggested based on the known input and desired output, or was it still a manual trial-and-error process for the coder?
User avatar
sweatje
Forum Contributor
Posts: 277
Joined: Wed Jun 29, 2005 10:04 pm
Location: Iowa, USA

Post by sweatje »

Robert K S wrote:So, was SimpleTest able to automate the derivation of the regexes you suggested based on the known input and desired output, or was it still a manual trial-and-error process for the coder?
I would prefer the wording "iterative process" ;)
Robert K S
Forum Newbie
Posts: 11
Joined: Thu Dec 18, 2003 7:06 pm
Location: Cleveland, Ohio

Post by Robert K S »

I could have called it a heuristic process, but we're all friends here. :)
User avatar
bokehman
Forum Regular
Posts: 509
Joined: Wed May 11, 2005 2:33 am
Location: Alicante (Spain)

Post by bokehman »

sweatje wrote:

Code: Select all

<[^>]+>
The trouble with that is it will find things that are not html tags.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

bokehman wrote:
sweatje wrote:

Code: Select all

<[^>]+>
The trouble with that is it will find things that are not html tags.
only in malformed text and, in general, malformed tags too.
Post Reply