Making Keyword Rich, Human Readable URLs

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

Post Reply
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Making Keyword Rich, Human Readable URLs

Post by s.dot »

This short class will turn names/titles into safe, keyword rich URLs for use with Apache's mod rewrite. Could be useful for blogs or forums.

Usage Examples

- Before - showforum.php?forumid=21
- After - forums/21/world-news-and-current-events/index.php

- Before - showthread.php?threadid=22107
- After - forums/view-topic/22107/yet-another-school-shooting.php

Sample Apache mod rewrite Rule

Code: Select all

 
RewriteEngine On
RewriteRule ^forums/([0-9]+)/.+/index.html$ /forum.php?forumid=$1
 
I have adjusted this code and taken into considerations all posts, so this is the freshest code.

The Code

Code: Select all

 
<?php
 
/*
* This short class will turn user entered titles into URLs
* that are keyword rich and human readable.  For use with
* Apache's mod rewrite.
*
* Author - scottayy@gmail.com
*/
 
class safeurl
{
    //decode html entities in string?
    //param boolean $decode
    var $decode = true;
 
    //charset to use if $decode is set to true
    //param string $decode_charset
    var $decode_charset = 'ISO-8859-1';
 
    //turns string into all lowercase letters
    //param boolean $lowercase
    var $lowercase = true;
 
    //strip out html tags from string?
    //param boolean $strip
    var $strip = true;
 
    //maximum length of resulting title
    //param int $maxlength
    var $maxlength = 50;
    
    //if maxlength is reached, chop at nearest whole word? or hard chop?
    //param boolean $whole_word
    var $whole_word = true;
 
    //what title to use if no alphanumeric characters can be found
    //param string $blank
    var $blank = 'no-title';
 
    //the worker function
    //param string $text
    function make_safe_url($text)
    {
        //prepare the string according to our options
        if($this->decode)
        {
            $text = html_entity_decode($text,ENT_QUOTES,$this->decode_charset);
        }
 
        if($this->lowercase)
        {
            $text = strtolower($text);
        }
 
        if($this->strip)
        {
            $text = strip_tags($text);
        }
 
        //filter
        $text = preg_replace("/[^&a-z0-9_-\s']/i",'',$text);
        $text = str_replace(array('&',' ','\''),array(' and ','-',''),$text);
        $text = trim(preg_replace("/-{2,}/",'-',$text), "-");
 
        //chop?
        if(strlen($text) > $this->maxlength)
        {
            $text = substr($text,0,$this->maxlength);
            
            if($this->whole_word)
            {
                $text = explode('-',$text);
                $text = implode('-',array_diff($text,array(array_pop($text))));
            }
        }
 
        //return =]
        if($text == '')
        {
            return $blank;
        }
 
        return $text;
    }
 
}
 
?>
 
 
Test 1
 

Code: Select all

 
$safeurl = new safeurl(); 
 
$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t',
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'
); 
 
foreach($tests AS $test) 
{ 
        echo $safeurl->make_safe_url($test).'<br />'; 
}
 
Output 1

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-html-in-here
i-love-devnet
a-long-string-with-a-bunchess-of-words-thats
 
We'll change a few properities of the object in the test.
 
Test 2

Code: Select all

</span></li><li style=\"\" class=\"li2\"><span style=\"color: #ff0000;\">$safeurl = new safeurl(); </span></li><li style=\"\" class=\"li1\"><span style=\"color: #ff0000;\">$safeurl->lowercase = false;</span></li><li style=\"\" class=\"li2\"><span style=\"color: #ff0000;\">$safeurl->whole_word = false;</span></li><li style=\"\" class=\"li1\">&nbsp;</li><li style=\"\" class=\"li2\"><span style=\"color: #ff0000;\">$tests = array( </span></li><li style=\"\" class=\"li1\"><span style=\"color: #ff0000;\"> &nbsp; &nbsp; &nbsp; &nbsp;'</span>i\span style=\"color: #ff0000;\">'m a test string!! do u like me. or not......., billy bob!!@#'</span>, </li><li style=\"\" class=\"li2\">&nbsp; &nbsp; &nbsp; &nbsp; <span style=\"color: #ff0000;\">'<b>some HTML</b> in <i>here</i>!!~'</span>, </li><li style=\"\" class=\"li1\">&nbsp; &nbsp; &nbsp; &nbsp; <span style=\"color: #ff0000;\">'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e<span style=\"color: #000099; font-weight: bold;\">\\</span><span style=\"color: #000099; font-weight: bold;\">\\</span><span style=\"color: #000099; font-weight: bold;\">\\</span><span style=\"color: #000099; font-weight: bold;\">\\</span>v,,,,,,,,,,n%$#@!~e*(+=t'</span>,</li><li style=\"\" class=\"li2\">&nbsp; &nbsp; &nbsp; &nbsp; <span style=\"color: #ff0000;\">'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'</span></li><li style=\"\" class=\"li1\"><span style=\"color: #66cc66;\">&#41;</span>; </li><li style=\"\" class=\"li2\">&nbsp;</li><li style=\"\" class=\"li1\"><a href=\"http://www.php.net/foreach\"><span style=\"color: #b1b100;\">foreach</span></a><span style=\"color: #66cc66;\">&#40;</span><span style=\"color: #0000ff;\">$tests</span> <a href=\"http://www.php.net/as\"><span style=\"color: #b1b100;\">AS</span></a> <span style=\"color: #0000ff;\">$test</span><span style=\"color: #66cc66;\">&#41;</span> </li><li style=\"\" class=\"li2\"><span style=\"color: #66cc66;\">&#123;</span> </li><li style=\"\" class=\"li1\">&nbsp; &nbsp; &nbsp; &nbsp; <a href=\"http://www.php.net/echo\"><span style=\"color: #b1b100;\">echo</span></a> <span style=\"color: #0000ff;\">$safeurl</span>-><span style=\"color: #006600;\">make_safe_url</span><span style=\"color: #66cc66;\">&#40;</span><span style=\"color: #0000ff;\">$test</span><span style=\"color: #66cc66;\">&#41;</span>.<span style=\"color: #ff0000;\">'<br />'</span>; </li><li style=\"\" class=\"li2\"><span style=\"color: #66cc66;\">&#125;</span></li><li style=\"\" class=\"li1\"><span style=\"color: #66cc66;\">&#91;</span>/php<span style=\"color: #66cc66;\">&#93;</span></li><li style=\"\" class=\"li2\">&nbsp;</li><li style=\"\" class=\"li1\"><span style=\"color: #66cc66;\">&#91;</span>b<span style=\"color: #66cc66;\">&#93;</span><span style=\"color: #000000; font-weight: bold;\">Output</span> <span style=\"color: #cc66cc;\">2</span><span style=\"color: #66cc66;\">&#91;</span>/b<span style=\"color: #66cc66;\">&#93;</span></li><li style=\"\" class=\"li2\"><span style=\"color: #66cc66;\">&#91;</span>code<span style=\"color: #66cc66;\">&#93;</span>im-a-test-string-do-u-like-me-or-not-billy-bob</li><li style=\"\" class=\"li1\">some-HTML-in-here</li><li style=\"\" class=\"li2\">i-love-devnet</li><li style=\"\" class=\"li1\">A-lOng-String-wiTh-a-buNchess-of-words-thats-shoul</li></ol></div>
 
Real World Project Usage

Code: Select all

 
echo '<a href="blog/12/'.$safeurl->make_safe_url($blog_title).'">'.$blog_title.'</a>';
 
 
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

I updated the above code because there were two problems that were bugging me.

If the string "what are you doing today/tomorrow" were passed to it, it'd come out as 'what-are-you-doing-todaytomorrow'. So I made all non-alphanumeric characters (except ') be replaced with a -. So now that string would come out correctly.

Also, if the string "-my day today-" were passed to it, it'd come out as '-my-day-today-' which is ugly with the leading and trailing -'s. So I trimmed those.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Re: Making Keyword Rich, Human Readable URLs

Post by JayBird »

"Design/Project Engineer" still comes out as "designproject-engineer" for me 8O
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: Making Keyword Rich, Human Readable URLs

Post by s.dot »

That is really weird. Why isn't / being caught by this regexp?

Code: Select all

$text = preg_replace('#[^&a-z0-9\s\']#', '', $text);
I can't figure it out, because if you do "Design / Project Engineer" it works. Or even "Design/ Project Engineer".
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Re: Making Keyword Rich, Human Readable URLs

Post by JayBird »

As a temporary measure, i have just done this

Code: Select all

 
 //filter
$text = str_replace("/","-",$text);
$text = preg_replace("/[^&a-z0-9_-\s']/i",'',$text);
$text = str_replace(array('&',' ',''),array(' and ','-',''),$text);
$text = trim(preg_replace("/-{2,}/",'-',$text), "-");
 
User avatar
thisismyurl
Forum Newbie
Posts: 15
Joined: Wed Dec 03, 2008 8:00 am

Re: Making Keyword Rich, Human Readable URLs

Post by thisismyurl »

I absolutely love temporary measures. To be honest, I can't see what's wrong with the RegEx either but I just wanted to say thanks for posting the code for people. I'm a coder but on the usability junkie / web marketing side of things (yes, you can boo now) so it's always refreshing to see code samples like this for non-marketing orientated web developers.

This week I was in a meeting with a client whom I consult for, they've had a great web site built for them by a really strong technical developer who took the time to do some Search Engine Optimization work on the site (at their request) so that the listings appeared as sub pages automatically but he simply refused to understand how or why making the URL's keyword rich would matter.

As a result, what you have twenty thousands pages such as:

Code: Select all

domain.com/?p=1
domain.com/?p=2
...
domain.com/?p=20000
Instead of:

Code: Select all

domain.com/dating/new-york/albany/
This simple piece of code would have been invaluable to the client, and saved the developer a lot of frustrated emails from non technical, marketing orientated clients.

Thanks again.
timemachine3030
Forum Newbie
Posts: 2
Joined: Sat Apr 17, 2010 11:21 am

Re: Making Keyword Rich, Human Readable URLs

Post by timemachine3030 »

Hi there, using this code in one of my projects. Thank you for sharing! Here are the updates that I have made. I converted it to my local coding standard, sorry that I made diffing a hassle.

Added Features:
* Added a translation table for non-ascii characters.
* Fixed a bug where low values for maxlength obliterated the string.
* $this->seperator (defaults to hyphen) is used to separate words. I need to have underscores in one project I'm on and hyphens in another.

Code: Select all

<?php

/**
 * This short class will turn user entered titles into URLs
 * that are keyword rich and human readable.  For use with
 * Apache's mod rewrite.
 *
 * @author scottayy@gmail.com
 * @author $Author: $
 *
 */
class SafeUrl {
    /**
     * decode html entities in string?
     * @var boolean
     */
    var $decode = true;
    /**
     * charset to use if $decode is set to true
     * @var string
     */
    var $decode_charset = 'UTF-8';
    /**
     * turns string into all lowercase letters
     * @var boolean
     */
    var $lowercase = true;
    /**
     * strip out html tags from string?
     * @var boolean
     */
    var $strip = true;
    /**
     * maximum length of resulting title
     * @var int
     */
    var $maxlength = 50;
    /**
     * if maxlength is reached, chop at nearest whole word? or hard chop?
     * @var boolean
     */
    var $whole_word = true;
    /**
     * what title to use if no alphanumeric characters can be found
     * @var string
     */
    var $blank = 'no-title';
    /**
     * Allow a differnt character to be used as the separator.
     * @var string
     */
    var $separator = '-';
    /**
     * A table of UTF-8 characters and what to make them.
     * @link http://www.php.net/manual/en/function.strtr.php#90925
     * @var array
     */
    var $translation_table = array(
        'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj','Ð'=>'Dj','đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
        'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
        'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
        'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
        'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
        'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
        'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
        'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r',
        /**
         * Special characters:
         */
        "'"    => '',       // Single quote
        '&'    => ' and ',  // Amperstand
        "\r\n" => ' ',      // Newline
        "\n"   => ' '       // Newline

    );

    /**
     * Class constructor
     *
     * @param array $options
     */
    function SafeUrl( $options='' ) {
        if (is_array($options)) {
            foreach($options as $property => $value) {
                $this->$property = $value;
            }
        }
    }

    /**
     * the worker function
     *
     * @param string $text
     * @return string
     */
    function makeUrl($text) {
        //Shortcut
        $s = $this->separator;
        //prepare the string according to our options
        if ($this->decode) {
            $text = html_entity_decode($text, ENT_QUOTES, $this->decode_charset);
            $text = strtr($text, $this->translation_table);
        }

        if ($this->lowercase) {
            $text = strtolower($text);
        }
        if ($this->strip) {
            $text = strip_tags($text);
        }

        //filter
        $text = preg_replace("/[^&a-z0-9_-\s']/i", '', $text);
        $text = str_replace(' ', $s, $text);
        $text = trim(preg_replace("/{$s}{2,}/", $s, $text), $s);

        //chop?
        if (strlen($text) > $this->maxlength) {
            $text = substr($text, 0, $this->maxlength);

            if ($this->whole_word) {
                /**
                 * If maxlength is small and leaves us with only part of one
                 * word ignore the "whole_word" filtering.
                 */
                $words = explode($s, $text);
                $temp  = implode($s, array_diff($words, array(array_pop($words))));
                if ($temp != '') {
                    $text = $temp;
                }
            }
        }
        //return =]
        if ($text == '') {
            return null;
        }

        return $text;
    }
}
Test File

This is a PHPUnit Unit Test.

Code: Select all

<?php
require_once 'PHPUnit/Framework.php';

require_once dirname(__FILE__) . '/../../lib/SafeUrl.class.php';

/**
 * Test class for SafeUrl.
 * Generated by PHPUnit on 2010-04-20 at 12:57:43.
 */
class SafeUrlTest extends PHPUnit_Framework_TestCase {

    /**
     * @var SafeUrl
     */
    protected $object;

    /**
     * Sets up the fixture, for example, opens a network connection.
     * This method is called before a test is executed.
     */
    protected function setUp() {
        $this->object = new SafeUrl;
    }

    /**
     * Tears down the fixture, for example, closes a network connection.
     * This method is called after a test is executed.
     */
    protected function tearDown() {

    }

    public function testMakeUrl() {
        
            $this->assertEquals( $this->object->makeUrl(
                'i\'m a test string!! do u like me. or not......., billy bob!!@#'),
                'im-a-test-string-do-u-like-me-or-not-billy-bob');

            $this->assertEquals( $this->object->makeUrl(
                '<b>some HTML</b> in <i>here</i>!!~'),
                'some-html-in-here');

            $this->assertEquals( $this->object->makeUrl(
                'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t'),
                'i-love-devnet');

            $this->assertEquals( $this->object->makeUrl(
                'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word'),
                'a-long-string-with-a-bunchess-of-words-thats');

            $this->object->lowercase = false;
            $this->assertEquals( $this->object->makeUrl(
                'Eyjafjallajökull Glacier'),
                'Eyjafjallajokull-Glacier');

            $this->object->maxlength = 100;
            $this->assertEquals( $this->object->makeUrl(
                'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ'),
                'AAAAAAACEEEEIIIIDjNOOOOOOUUUUYBSsaaaaaaaceeeeiiiionoooooouuuyybyRr');

            $this->object->maxlength = 20;
            $this->assertEquals( $this->object->makeUrl(
                    $this->big_mess),
                    'safeurl-new-safeurl');

            /**
             * Regresstion test:
             *
             * If max length was so small that we where left with only one
             * word, then whole_word would leave us with an empty string.
             */
            $this->object->maxlength = 5;
            $this->object->whole_word = true;
            $this->assertEquals( $this->object->makeUrl(
                'supercalafragalisticexpialadoshus'),
                'super');
            

            /**
             * Acceptable Bug:
             *
             * It would be nice if we put a space between block level elements,
             * but it is kind of too much to ask for.
             */
            $this->object->maxlength = 200;
            $html = <<<HTML
                <div>
                    <h1>Title</h1>
                    <h2>Subtitle!</h2>Read the <a href="ReleaseNotes.html">Release Notes</a> for this Revision.<br/>
                </div>
HTML;
            $this->assertEquals( $this->object->makeUrl(
                    $html),
                    'Title-SubtitleRead-the-Release-Notes-for-this-Revision');
            /**                    ^
             * Look: --------------|
             *
             * Should be:
             *     'Title-Subtitle-Read-the-Release-Notes-for-this-Revision'
             */
    }
    
    var $big_mess = '
            </span></li><li style=\"\" class=\"li2\"><span style=\"color:
            #ff0000;\">\$safeurl = new safeurl(); </span></li><li style=\"\"
            class=\"li1\"><span style=\"color: #ff0000;\">\$safeurl->lowercase
            = false;</span></li><li style=\"\" class=\"li2\"><span
            style=\"color: #ff0000;\">\$safeurl->whole_word = false;</span></li>
            <li style=\"\" class=\"li1\">&nbsp;</li><li style=\"\"
            class=\"li2\"><span style=\"color: #ff0000;\">\$tests = array(
            </span></li><li style=\"\" class=\"li1\"><span style=\"color:
            #ff0000;\"> &nbsp; &nbsp; &nbsp; &nbsp;\'</span>i\span
            style=\"color: #ff0000;\">\'m a test string!! do u like me. or
            not......., billy bob!!@#\'</span>, </li><li style=\"\"
            class=\"li2\">&nbsp; &nbsp; &nbsp; &nbsp; <span
            style=\"color: #ff0000;\">\'<b>some HTML</b> in <i>here</i>!!~\'
            </span>, </li><li style=\"\" class=\"li1\">&nbsp; &nbsp; &nbsp;
            &nbsp; <span style=\"color: #ff0000;\">\'i!@#*#@ l#*(*(#**$*o**(*^v
            ^*(e d//////e<span style=\"color: #000099; font-weight: bold;\">\\
            </span><span style=\"color: #000099; font-weight: bold;\">\\</span>
            <span style=\"color: #000099; font-weight: bold;\">\\</span><span
            style=\"color: #000099; font-weight: bold;\">\\</span>v,,,,,,,,,,n%
            $#@!~e*(+=t\'</span>,</li>';

}

User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Making Keyword Rich, Human Readable URLs

Post by Mordred »

1. The allowed chars should be configurable, and timemachine3030 's idea of using a translation table has merit, especially for non-english texts.
2. The & in the permitted-by-default characters might cause problems. I suggest two helper methods that transform this into an URL (for use in header() for example) using rawurlencode and into a HTML link, using htmlspecialchars
timemachine3030
Forum Newbie
Posts: 2
Joined: Sat Apr 17, 2010 11:21 am

Re: Making Keyword Rich, Human Readable URLs

Post by timemachine3030 »

Mordred wrote:1. The allowed chars should be configurable, and timemachine3030 's idea of using a translation table has merit, especially for non-english texts.
2. The & in the permitted-by-default characters might cause problems. I suggest two helper methods that transform this into an URL (for use in header() for example) using rawurlencode and into a HTML link, using htmlspecialchars
Great ideas. I had posted the code on github some time ago but forgot to update this post. Patches welcome: https://github.com/timemachine3030/safe-url
Post Reply