Making keyword rich, human readable URLs

Coding Critique is the place to post source code for peer review by other members of DevNetwork. Any kind of code can be posted. Code posted does not have to be limited to PHP. All members are invited to contribute constructive criticism with the goal of improving the code. Posted code should include some background information about it and what areas you specifically would like help with.

Popular code excerpts may be moved to "Code Snippets" by the moderators.

Moderator: General Moderators

User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Making keyword rich, human readable URLs

Post by s.dot »

This short class will turn names/titles into safe, keyword rich URLs for use with Apache's mod rewrite. Could be useful for blogs or forums.

Usage Examples

- Before - showforum.php?forumid=21
- After - forums/21/world-news-and-current-events/index.php

- Before - showthread.php?threadid=22107
- After - forums/view-topic/22107/yet-another-school-shooting.php

Sample Apache mod rewrite Rule

Code: Select all

RewriteEngine On
RewriteRule ^forums/([0-9]+)/.+/index.html$ /forum.php?forumid=$1
I have adjusted this code and taken into considerations all posts, so this is the freshest code.

The Code

Code: Select all

<?php

/*
* This short class will turn user entered titles into URLs
* that are keyword rich and human readable.  For use with
* Apache's mod rewrite.
*
* Author - scottayy@gmail.com
*/

class safeurl
{
	//decode html entities in string?
	//param boolean $decode
	var $decode = true;

	//charset to use if $decode is set to true
	//param string $decode_charset
	var $decode_charset = 'ISO-8859-1';

	//turns string into all lowercase letters
	//param boolean $lowercase
	var $lowercase = true;

	//strip out html tags from string?
	//param boolean $strip
	var $strip = true;

	//maximum length of resulting title
	//param int $maxlength
	var $maxlength = 50;
	
	//if maxlength is reached, chop at nearest whole word? or hard chop?
	//param boolean $whole_word
	var $whole_word = true;

	//what title to use if no alphanumeric characters can be found
	//param string $blank
	var $blank = 'no-title';

	//the worker function
	//param string $text
	function make_safe_url($text)
	{
		//prepare the string according to our options
		if($this->decode)
		{
			$text = html_entity_decode($text,ENT_QUOTES,$this->decode_charset);
		}

		if($this->lowercase)
		{
			$text = strtolower($text);
		}

		if($this->strip)
		{
			$text = strip_tags($text);
		}

		//filter
		$text = preg_replace("/[^&a-z0-9_-\s]/i",'',$text);
		$text = str_replace(array('&',' '),array(' and ','-'),trim($text));
		$text = preg_replace("/-{2,}/",'-',$text);

		//chop?
		if(strlen($text) > $this->maxlength)
		{
			$text = substr($text,0,$this->maxlength);
			
			if($this->whole_word)
			{
				$text = explode('-',$text);
				$text = implode('-',array_diff($text,array(array_pop($text))));
			}
		}

		//return =]
		if($text == '')
		{
			return $blank;
		}

		return $text;
	}

}

?>
Test 1

Code: Select all

$safeurl = new safeurl(); 

$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t',
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'
); 

foreach($tests AS $test) 
{ 
        echo $safeurl->make_safe_url($test).'<br />'; 
}
Output 1

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-html-in-here
i-love-devnet
a-long-string-with-a-bunchess-of-words-thats
We'll change a few properities of the object in the test.

Test 2

Code: Select all

$safeurl = new safeurl(); 
$safeurl->lowercase = false;
$safeurl->whole_word = false;

$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t',
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'
); 

foreach($tests AS $test) 
{ 
        echo $safeurl->make_safe_url($test).'<br />'; 
}
Output 2

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-HTML-in-here
i-love-devnet
A-lOng-String-wiTh-a-buNchess-of-words-thats-shoul
Real World Project Usage

Code: Select all

echo '<a href="blog/12/'.$safeurl->make_safe_url($blog_title).'">'.$blog_title.'</a>';
Last edited by s.dot on Thu Oct 05, 2006 7:50 pm, edited 25 times in total.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

Code: Select all

$showPasses = true;
$reporter = ($showPasses)?'HTMLPassReporter':'HTMLReporter';

function cleanUrl($url) {
   // $url = preg_replace('/&/',' and ',$url);//switch &'s
   // $url = preg_replace('/\s{1,}/','-',$url);//switch spaces
   // $url = preg_replace('/[^a-z0-9-]/','',strtolower($url)); //remove the rest
   // return $url;
   
   // This works...
   // BUT... Would this more or less efficiant? it's not all that hard to decipher, 
   // although you have to reverse the order of the logic...
   return 
      preg_replace('/[^a-z0-9-]/','',
      preg_replace('/\s{1,}/','-',
      preg_replace('/&/',' and ',
      strtolower($url)
   )));
}

function make_safe_url($text,$decode=true,$lowercase=true)
{
        if($decode) $text = html_entity_decode($text,ENT_QUOTES);
        if($lowercase) $text = strtolower($text);
        $text = str_replace(array(' ',',','&'),array('-','','and'),$text);
        $text = preg_replace("/[^a-z0-9_-]/i",'',$text);       
        return $text;
} 

class cleanUrlTest extends UnitTestCase {
   function testUrlCleaner(){
      $test_case = cleanUrl('~~~HOW MUCH DO U LIKE\'S DA NFLZ & OR T3H P0K3R!?!?!?@@');
      $this->assertIdentical($test_case,'how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r');
      
      $test_case = cleanUrl('~~~D#@)))E)()#@!(@@v#(  &  *#*N?!?!e?@t@');
      $this->assertIdentical($test_case,'dev-and-net');
      
      $test_case = cleanUrl('This is my url&it is great!');
      $this->assertIdentical($test_case,'this-is-my-url-and-it-is-great');
   }
}
class make_safe_urlTest extends UnitTestCase {
   function testMake_Safe_UrlCleaner(){
      $test_case = make_safe_url('~~~HOW MUCH DO U LIKE\'S DA NFLZ & OR T3H P0K3R!?!?!?@@');
      $this->assertIdentical($test_case,'how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r');
      
      $test_case = make_safe_url('~~~D#@)))E)()#@!(@@v#(  &  *#*N?!?!e?@t@');
      $this->assertIdentical($test_case,'dev-and-net');
      
      $test_case = make_safe_url('This is my url&it is great!');
      $this->assertIdentical($test_case,'this-is-my-url-and-it-is-great');
   }
}

   $test = new cleanUrlTest();
   $test->run(new $reporter);
   
   $test = new make_safe_urlTest();
   $test->run(new $reporter);
Last edited by nickvd on Wed Sep 20, 2006 2:47 am, edited 2 times in total.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

Produces:

Code: Select all

cleanUrlTest
Pass: testUrlCleaner->Identical expectation [String: how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r]
Pass: testUrlCleaner->Identical expectation [String: dev-and-net]
Pass: testUrlCleaner->Identical expectation [String: this-is-my-url-and-it-is-great]
1/1 test cases complete: 3 passes, 0 fails and 0 exceptions.

make_safe_urlTest
Pass: testMake_Safe_UrlCleaner->Identical expectation [String: how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r]
Fail: testMake_Safe_UrlCleaner -> Identical expectation [String: dev--and--net] fails with [String: dev-and-net] at character 4 with [dev--and--net] and [dev-and-net]
Fail: testMake_Safe_UrlCleaner -> Identical expectation [String: this-is-my-urlandit-is-great] fails with [String: this-is-my-url-and-it-is-great] at character 14 with [this-is-my-urlandit-is-great] and [this-is-my-url-and-it-is-great]
1/1 test cases complete: 1 passes, 2 fails and 0 exceptions.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

it fails on double --'s. But, is that really a "fail"? I mean, the URL still works. It would be an easy fix. Just curious.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

I added a couple lines to the function and it now returns this, with your test strings.

Code: Select all

how-much-do-u-likes-da-nflz-and-or-t3h-p0k3r
dev-and-net
this-is-my-url-and-it-is-great
It could be done more gracefully with a good regex. But I'm not that good with them, and this works for my purpose. =]
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

heh, yeah it could (and no the double dashes isnt really a deal breaker, more of a style/picky thing for me i guess ;), i've just been on a unit testing kick lately, and this problem presented a easy way to implement them :)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

Good deal. And good testing. many thanks! :wink:
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Jenk
DevNet Master
Posts: 3587
Joined: Mon Sep 19, 2005 6:24 am
Location: London

Post by Jenk »

What's wrong with urlencode() and urldecode()?
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

Jenk wrote:What's wrong with urlencode() and urldecode()?
Pasting a blog url like /user/1113/blog/how *to do* this~~~ with weird marks!@@.html would be ugly urlencoded.

/user/1113/blog/how-to-do-this-with-weird-marks.html is prettier =] and keyword rich.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm, that looks pretty cool. How does it deal with dupes? The decoding needs a charset passed to it, otherwise it's pretty useless. I'm also wondering how it would handle with HTML tags passed to it.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

Ambush Commander wrote:I'm also wondering how it would handle with HTML tags passed to it.
I added another test...

Code: Select all

$test_case = make_safe_url('this text is not <b>bold</b>&<span id="sadflksdjk"><em>this text is emphasized</em></span>');
      $this->assertIdentical($test_case,'this-text-is-not-bboldb-and-span-idsadflksdjkemthis-text-is-emphasizedemspan');
      /*Pass: testMake_Safe_UrlCleaner->Identical expectation [String: this-text-is-not-bboldb-and-span-idsadflksdjkemthis-text-is-emphasizedemspan]*/
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I would presume <b>bold</b> should turn into bold not bboldb...
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

I updated the function based on your remarks AC. Using the test string above, it is turned into:

Code: Select all

this-text-is-not-bold-and-this-text-is-emphasized
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Actually, I'm going to take back what I just said.

It appears to me that there would be two uses for this function: one for parsing a title that was passed via a seperate text input field, and the other for parsing a full HTML document for the content between h1 tags. The first case, arguably the more common one, is plaintext input. The second case is HTML input.

Perhaps we should simplify the API parameters.

Also, the function isn't very i18n friendly. There's no quite easy way to get around this.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

Seems I have ran across a problem. If a user makes a title that is all weird characters like "!!!!!!!!!!!", this function returns nothing. What's an appropriate title to use in that case? no-title.html ?

PS: This should be a snippet!
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
Post Reply