Making human readable, keyword rich strings for PHP 5

Coding Critique is the place to post source code for peer review by other members of DevNetwork. Any kind of code can be posted. Code posted does not have to be limited to PHP. All members are invited to contribute constructive criticism with the goal of improving the code. Posted code should include some background information about it and what areas you specifically would like help with.

Popular code excerpts may be moved to "Code Snippets" by the moderators.

Moderator: General Moderators

Post Reply
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Making human readable, keyword rich strings for PHP 5

Post by s.dot »

Okay, so this is the new and improved php5 version of this class i made for php4.

The main purpose I use this class for is for making nice human readable keyword rich URLs (that the search engines love). However, there are other uses of this function, so i've dropped the class name safeurl and changed it to safe_string.

I've also added a property _delimiter, so you don't have to use hyphens (-) for the delimiter, you could use underscores, periods, or any other character.

This class will turn user generated strings or strings pulled from databases into strings stripped of everything but alphanumeric characters, separated by a delimiter.

Class safe_string Code

Code: Select all

<?php

/*
** This PHP5 class will turn strings that may be user generated,
** or pulled from a database, containing HTML or other special 
** characters into strings that are human readable, keyword rich, and safe
** for passing as URLs.  Very useful in addition with Apache's mod_rewrite module.
**
** --------------------------------------------------------------------------------
** This is updated from the PHP4 safeurl() class.
** Changes: All object properties are set to private.
** Delimeter can be chosen, instead of being forced to be a hyphen (-).
** The main fuction name make_safe_url() has been changed to just make_safe(), because 
** there are many other valid uses for this class other than just using the strings as
** URLs.
** Class has been broken up into many methods each performing a specific task, rather
** than being thrown into one method.
** --------------------------------------------------------------------------------
**
** Author - << smp_info _at_ yahoo _dot_ com >>
** Date - Monday, August 6th, 2007
*/

class safe_string
{
	/*
	** Set this to false if your string has already been cleaned of entities
	** @access private
	** @bool $_decode
	*/
	private $_decode = true;
	
	/*
	** If $_decode is set to true, this will be the character encoding set that 
	** will be used to decode strings.  Defaults to PHP's default of ISO-8859-1
	** @access private
	** @str $_decode_charset
	*/
	private $_decode_charset = 'ISO-8859-1';
	
	/*
	** Decides whether or not to leave the string how it is, or to lowercase all letters.
	** Defaults to true, lowercasing of all letters.
	** @access private
	** @bool $_lowercase
	*/
	private $_lowercase = true;
	
	/*
	** If your string has HTML in it, this will strip it out.  true = strip html,
	** false = don't strip html
	** @access private
	** @bool $_strip
	*/
	private $_strip = true;
	
	/*
	** Sets the maximum length of characters in the returned string.
	** @access private
	** @bool $_maxlength
	*/
	private $_maxlength = 50;
	
	/*
	** Decides whether or not to chop the result string at the last whole word separated by 
	** $this->_delimiter.
	** @access private
	** @bool $_whole_word
	*/
	private $_whole_word = true;
	
	/*
	** Used as a delimiter between words.  Can be any character.
	** @access private
	** @str $_delimiter
	*/
	private $_delimiter = '-';
	
	/*
	** Default string to use if no alphanumeric characters can be found in the string
	** @access private
	** @str $_blank
	*/
	private $_blank = 'no-title';
	
	/*
	** Container for our output string
	** @access private
	** @str $_output
	*/
	private $_output;
	
	/*
	** Method to decode the given string of entities
	** @access private
	*/
	private function _decode_string()
	{
		if($this->_decode)
		{
			$this->_output = html_entity_decode($this->_output, ENT_QUOTES, $this->_decode_charset);
		}
	}
	
	/*
	** Method to lowercase the string
	** @access private
	*/
	private function _lowercase_string()
	{
		if($this->_lowercase)
		{
			$this->_output = strtolower($this->_output);
		}
	}
	
	/*
	** Method to strip the string of html tags
	** @access private
	*/
	private function _strip_string()
	{
		if($this->_strip)
		{
			$this->_output = strip_tags($this->_output);
		}
	}
	
	/*
	** Method to filter the string of invalid characters, replace &, spaces, and apostrophes
	** and to replace multiple occurences of $this->_delimiter.
	** @access private
	*/
	private function _filter_string()
	{
		//filter out invalid characters
		$this->_output = preg_replace("/[^&a-z0-9_-\s']/i", '', $this->_output);
		
		//replace &, spaces, and apostrophes with $this->_delimiter
		$this->_output = str_replace(array('&', ' ', '\''), array(' and ', $this->_delimiter, ''), $this->_output);
		
		//trim the string of $this->_delimiter, and replace multiple occurences of $this->_delimiter
		$this->_output = trim(preg_replace("/" . preg_quote($this->_delimiter) . "{2,}/", $this->_delimiter, $this->_output), $this->_delimiter);
	}
	
	/*
	** Method to chop the string to $this->_maxlength characters
	** @access private
	*/
	private function _chop_string()
	{
		if(strlen($this->_output) > $this->_maxlength)
		{
			$this->_output = substr($this->_output, 0, $this->_maxlength);
			$this->_whole_word_string();
		}
	}
	
	/*
	** Method to chop the string at the last whole word separated by $this->_delimiter
	** @access private
	*/
	private function _whole_word_string()
	{
		if($this->_whole_word)
		{
			$this->_output = explode($this->_delimiter, $this->_output);
			$this->_output = implode($this->_delimiter, array_diff($this->_output, array(array_pop($this->_output))));
		}
	}
	
	/*
	** Method that simply runs through the list of methods to prepare $this->_output
	** @access private
	** @param str $string
	** @return str $this->_output
	*/
	private function _run($string)
	{
		$this->_output = $string;
		$this->_decode_string();
		$this->_lowercase_string();
		$this->_strip_string();
		$this->_filter_string();
		$this->_chop_string();
		return $this->_output;
	}
	
	/*
	** Method to call the _run() method, and return $this->_output string
	** @access public
	** @param str $string
	** @return string $this->_output
	*/
	public function make_safe($string)
	{
		return $this->_run($string);
	}
	
	/*
	** Method to allow changing of private properties
	** @access public
	** @param str $property
	** @param mixed $value
	*/
	public function __set($property, $value)
	{
		$this->$property = $value;
	}
}
And, I went ahead and did some tests from the other topic, to ensure that this class produced the exact same results.

Test One

Code: Select all

$safe_string = new safe_string(); 

$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t', 
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word' 
); 

foreach($tests AS $test) 
{
	echo $safe_string->make_safe($test) . '<br />';
}
Test One Result

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-html-in-here
i-love-devnet
a-long-string-with-a-bunchess-of-words-thats
Test Two

Code: Select all

$safe_string = new safe_string();

//we'll change a few object properties
$safe_string->_lowercase = false;
$safe_string->_whole_word = false;

$tests = array( 
        'i\'m a test string!! do u like me. or not......., billy bob!!@#', 
        '<b>some HTML</b> in <i>here</i>!!~', 
        'i!@#*#@ l#*(*(#**$*o**(*^v^*(e d//////e\\\\\\\\v,,,,,,,,,,n%$#@!~e*(+=t', 
        'A lOng String wiTh a buNchess of words thats! should be -chopped- at the last whole word' 
); 

foreach($tests AS $test) 
{
	echo $safe_string->make_safe($test) . '<br />';
}
Test Two Result

Code: Select all

im-a-test-string-do-u-like-me-or-not-billy-bob
some-HTML-in-here
i-love-devnet
A-lOng-String-wiTh-a-buNchess-of-words-thats-shoul
Real world project usage

Code: Select all

$safe_string = new safe_string();

echo '<a href="blog/jimbob/12/' . $safe_string->make_safe($dba['blog_title']) . '.html">' . $dba['blog_title'] . '</a>';
My example of real world project usage would need a mod rewrite rule, which, together, I find helps me in search engine ranking positions.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Code: Select all

urlencode(strtolower(preg_replace('&-+&', '-', preg_replace('&[^A-Za-z0-9]&', '-', strip_tags($data)))));
I miss anything? ;)
Hehe, probably.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

superdezign wrote:

Code: Select all

urlencode(strtolower(preg_replace('&-+&', '-', preg_replace('&[^A-Za-z0-9]&', '-', strip_tags($data)))));
I miss anything? ;)
Hehe, probably.
Could be done in a single expression instead of two. ;)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

Thanks for the encouragement. :x
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

superdezign wrote:

Code: Select all

urlencode(strtolower(preg_replace('&-+&', '-', preg_replace('&[^A-Za-z0-9]&', '-', strip_tags($data)))));
I miss anything? ;)
Hehe, probably.
Actually, you missed quite a lot. This code you posted does not allow for ANY options, no maximum length, no chopping at whole words, does not take into account entities (which if aren't decoded, will turn & into -amp- (weird)), and most importantly, the use of urlencode() would not make it human readable friendly.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

scottayy wrote:
superdezign wrote:

Code: Select all

urlencode(strtolower(preg_replace('&-+&', '-', preg_replace('&[^A-Za-z0-9]&', '-', strip_tags($data)))));
I miss anything? ;)
Hehe, probably.
Actually, you missed quite a lot. This code you posted does not allow for ANY options, no maximum length, no chopping at whole words, does not take into account entities (which if aren't decoded, will turn & into -amp- (weird)), and most importantly, the use of urlencode() would not make it human readable friendly.
Hehe, oh well then.

I usually create a whole class of static functions for doing that sort of thing. It formats my URLs, parses my custom tags, cleans user submitted data, filters HTML, cleans suspicious 'src' and 'href' attributes, etc.

Edit: It's very site-specific, though.
Last edited by superdezign on Mon Aug 06, 2007 9:01 pm, edited 1 time in total.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

feyd wrote:
superdezign wrote:

Code: Select all

urlencode(strtolower(preg_replace('&-+&', '-', preg_replace('&[^A-Za-z0-9]&', '-', strip_tags($data)))));
I miss anything? ;)
Hehe, probably.
Could be done in a single expression instead of two. ;)
:!:

Code: Select all

&[^A-Za-z0-9]+&
Post Reply