Page 1 of 3

Making keyword rich, human readable URLs

Posted: Wed Sep 20, 2006 1:48 am
by s.dot
This short class will turn names/titles into safe, keyword rich URLs for use with Apache's mod rewrite. Could be useful for blogs or forums.

Usage Examples

- Before - showforum.php?forumid=21
- After - forums/21/world-news-and-current-events/index.php

- Before - showthread.php?threadid=22107
- After - forums/view-topic/22107/yet-another-school-shooting.php

Sample Apache mod rewrite Rule

Code: Select all

RewriteEngine On
RewriteRule ^forums/([0-9]+)/.+/index.html$ /forum.php?forumid=$1
I have adjusted this code and taken into considerations all posts, so this is the freshest code.

The Code

Code: Select all

<?php

/*
* This short class will turn user entered titles into URLs
* that are keyword rich and human readable.  For use with
* Apache's mod rewrite.
*
* Author - scottayy@gmail.com
*/

class safeurl
{
	//decode html entities in string?
	//param boolean $decode
	var $decode = true;

	//charset to use if $decode is set to true
	//param string $decode_charset
	var $decode_charset = 'ISO-8859-1';

	//turns string into all lowercase letters
	//param boolean $lowercase
	var $lowercase = true;

	//strip out html tags from string?
	//param boolean $strip
	var $strip = true;

	//maximum length of resulting title
	//param int $maxlength
	var $maxlength = 50;
	
	//if maxlength is reached, chop at nearest whole word? or hard chop?
	//param boolean $whole_word
	var $whole_word = true;

	//what title to use if no alphanumeric characters can be found
	//param string $blank
	var $blank = 'no-title';

	//the worker function
	//param string $text
	function make_safe_url($text)
	{
		//prepare the string according to our options
		if($this->decode)
		{
			$text = html_entity_decode($text,ENT_QUOTES,$this->decode_charset);
		}

		if($this->lowercase)
		{
			$text = strtolower($text);
		}

		if($this->strip)
		{
			$text = strip_tags($text);
		}

		//filter
		$text = preg_replace("/[^&a-z0-9_-\s]/i",'',$text);
		$text = str_replace(array('&',' '),array(' and ','-'),trim($text));
		$text = preg_replace("/-{2,}/",'-',$text);

		//chop?
		if(strlen($text) > $this->maxlength)
		{
			$text = substr($text,0,$this->maxlength);
			
			if($this->whole_word)
			{
				$text = explode('-',$text);
				$text = implode('-',array_diff($text,array(array_pop($text))));
			}
		}

		//return =]
		if($text == '')
		{
			return $blank;
		}

		return $text;
	}

}

?>
Test 1

Code: Select all

$safeurl = new safeurl(); 

$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t',
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'
); 

foreach($tests AS $test) 
{ 
        echo $safeurl->make_safe_url($test).'<br />'; 
}
Output 1

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-html-in-here
i-love-devnet
a-long-string-with-a-bunchess-of-words-thats
We'll change a few properities of the object in the test.

Test 2

Code: Select all

$safeurl = new safeurl(); 
$safeurl->lowercase = false;
$safeurl->whole_word = false;

$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t',
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'
); 

foreach($tests AS $test) 
{ 
        echo $safeurl->make_safe_url($test).'<br />'; 
}
Output 2

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-HTML-in-here
i-love-devnet
A-lOng-String-wiTh-a-buNchess-of-words-thats-shoul
Real World Project Usage

Code: Select all

echo '<a href="blog/12/'.$safeurl->make_safe_url($blog_title).'">'.$blog_title.'</a>';

Posted: Wed Sep 20, 2006 2:36 am
by nickvd

Code: Select all

$showPasses = true;
$reporter = ($showPasses)?'HTMLPassReporter':'HTMLReporter';

function cleanUrl($url) {
   // $url = preg_replace('/&/',' and ',$url);//switch &'s
   // $url = preg_replace('/\s{1,}/','-',$url);//switch spaces
   // $url = preg_replace('/[^a-z0-9-]/','',strtolower($url)); //remove the rest
   // return $url;
   
   // This works...
   // BUT... Would this more or less efficiant? it's not all that hard to decipher, 
   // although you have to reverse the order of the logic...
   return 
      preg_replace('/[^a-z0-9-]/','',
      preg_replace('/\s{1,}/','-',
      preg_replace('/&/',' and ',
      strtolower($url)
   )));
}

function make_safe_url($text,$decode=true,$lowercase=true)
{
        if($decode) $text = html_entity_decode($text,ENT_QUOTES);
        if($lowercase) $text = strtolower($text);
        $text = str_replace(array(' ',',','&'),array('-','','and'),$text);
        $text = preg_replace("/[^a-z0-9_-]/i",'',$text);       
        return $text;
} 

class cleanUrlTest extends UnitTestCase {
   function testUrlCleaner(){
      $test_case = cleanUrl('~~~HOW MUCH DO U LIKE\'S DA NFLZ & OR T3H P0K3R!?!?!?@@');
      $this->assertIdentical($test_case,'how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r');
      
      $test_case = cleanUrl('~~~D#@)))E)()#@!(@@v#(  &  *#*N?!?!e?@t@');
      $this->assertIdentical($test_case,'dev-and-net');
      
      $test_case = cleanUrl('This is my url&it is great!');
      $this->assertIdentical($test_case,'this-is-my-url-and-it-is-great');
   }
}
class make_safe_urlTest extends UnitTestCase {
   function testMake_Safe_UrlCleaner(){
      $test_case = make_safe_url('~~~HOW MUCH DO U LIKE\'S DA NFLZ & OR T3H P0K3R!?!?!?@@');
      $this->assertIdentical($test_case,'how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r');
      
      $test_case = make_safe_url('~~~D#@)))E)()#@!(@@v#(  &  *#*N?!?!e?@t@');
      $this->assertIdentical($test_case,'dev-and-net');
      
      $test_case = make_safe_url('This is my url&it is great!');
      $this->assertIdentical($test_case,'this-is-my-url-and-it-is-great');
   }
}

   $test = new cleanUrlTest();
   $test->run(new $reporter);
   
   $test = new make_safe_urlTest();
   $test->run(new $reporter);

Posted: Wed Sep 20, 2006 2:43 am
by nickvd
Produces:

Code: Select all

cleanUrlTest
Pass: testUrlCleaner->Identical expectation [String: how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r]
Pass: testUrlCleaner->Identical expectation [String: dev-and-net]
Pass: testUrlCleaner->Identical expectation [String: this-is-my-url-and-it-is-great]
1/1 test cases complete: 3 passes, 0 fails and 0 exceptions.

make_safe_urlTest
Pass: testMake_Safe_UrlCleaner->Identical expectation [String: how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r]
Fail: testMake_Safe_UrlCleaner -> Identical expectation [String: dev--and--net] fails with [String: dev-and-net] at character 4 with [dev--and--net] and [dev-and-net]
Fail: testMake_Safe_UrlCleaner -> Identical expectation [String: this-is-my-urlandit-is-great] fails with [String: this-is-my-url-and-it-is-great] at character 14 with [this-is-my-urlandit-is-great] and [this-is-my-url-and-it-is-great]
1/1 test cases complete: 1 passes, 2 fails and 0 exceptions.

Posted: Wed Sep 20, 2006 4:56 am
by s.dot
it fails on double --'s. But, is that really a "fail"? I mean, the URL still works. It would be an easy fix. Just curious.

Posted: Wed Sep 20, 2006 7:07 am
by s.dot
I added a couple lines to the function and it now returns this, with your test strings.

Code: Select all

how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r
dev-and-net
this-is-my-url-and-it-is-great
It could be done more gracefully with a good regex. But I'm not that good with them, and this works for my purpose. =]

Posted: Wed Sep 20, 2006 7:51 am
by nickvd
heh, yeah it could (and no the double dashes isnt really a deal breaker, more of a style/picky thing for me i guess ;), i've just been on a unit testing kick lately, and this problem presented a easy way to implement them :)

Posted: Wed Sep 20, 2006 7:58 am
by s.dot
Good deal. And good testing. many thanks! :wink:

Posted: Wed Sep 20, 2006 8:07 am
by Jenk
What's wrong with urlencode() and urldecode()?

Posted: Wed Sep 20, 2006 8:19 am
by s.dot
Jenk wrote:What's wrong with urlencode() and urldecode()?
Pasting a blog url like /user/1113/blog/how *to do* this~~~ with weird marks!@@.html would be ugly urlencoded.

/user/1113/blog/how-to-do-this-with-weird-marks.html is prettier =] and keyword rich.

Posted: Wed Sep 20, 2006 7:22 pm
by Ambush Commander
Hmm, that looks pretty cool. How does it deal with dupes? The decoding needs a charset passed to it, otherwise it's pretty useless. I'm also wondering how it would handle with HTML tags passed to it.

Posted: Wed Sep 20, 2006 11:04 pm
by nickvd
Ambush Commander wrote:I'm also wondering how it would handle with HTML tags passed to it.
I added another test...

Code: Select all

$test_case = make_safe_url('this text is not <b>bold</b>&<span id="sadflksdjk"><em>this text is emphasized</em></span>');
      $this->assertIdentical($test_case,'this-text-is-not-bboldb-and-span-idsadflksdjkemthis-text-is-emphasizedemspan');
      /*Pass: testMake_Safe_UrlCleaner->Identical expectation [String: this-text-is-not-bboldb-and-span-idsadflksdjkemthis-text-is-emphasizedemspan]*/

Posted: Thu Sep 21, 2006 5:50 am
by Ambush Commander
I would presume <b>bold</b> should turn into bold not bboldb...

Posted: Thu Sep 21, 2006 8:23 am
by s.dot
I updated the function based on your remarks AC. Using the test string above, it is turned into:

Code: Select all

this-text-is-not-bold-and-this-text-is-emphasized

Posted: Thu Sep 21, 2006 9:19 pm
by Ambush Commander
Actually, I'm going to take back what I just said.

It appears to me that there would be two uses for this function: one for parsing a title that was passed via a seperate text input field, and the other for parsing a full HTML document for the content between h1 tags. The first case, arguably the more common one, is plaintext input. The second case is HTML input.

Perhaps we should simplify the API parameters.

Also, the function isn't very i18n friendly. There's no quite easy way to get around this.

Posted: Tue Oct 03, 2006 6:59 pm
by s.dot
Seems I have ran across a problem. If a user makes a title that is all weird characters like "!!!!!!!!!!!", this function returns nothing. What's an appropriate title to use in that case? no-title.html ?

PS: This should be a snippet!