Page 1 of 1

Stripping words from a string

Posted: Sat Oct 15, 2005 6:11 pm
by BigChase
What is the best way to strip a pre-defined list of words from an aribtary string? What I want to do is remove all simple connecting words like "a", "the", "what", "who", "where", etc. (my English teacher probably has a better term to describes these ords than 'connecting words") from a string.

I imagine it's something like the following, though perhaps there os a common tool, function, code fragment, etc. that people use to accomplish this?

$words_to_exclude = {
"the",
"a ",
"what",
"who",
etc///

php_grep_function ($string, $words_to_exclude); (not sure what the php grep or string replace function is or how it works)

[As a follow-on question, does anyone know of a tool that can discern and extract likely subject words from a sentence? In other words, a tool that can guess which words in a sentence are its key words?]

Thanks

Posted: Sat Oct 15, 2005 6:18 pm
by feyd
preg_replace()

Code: Select all

function pregProtect($a) {
  return preg_quote($a,'#');
}
$words_to_exclude = '#\b('.implode('|',array_map('pregProtect',$words_to_exclude)).')\b#';

$text = preg_replace($words_to_exclude,'',$text);
that's untested..

Brilliant!

Posted: Sun Oct 16, 2005 1:00 am
by BigChase
Appears to work very well! Thanks feyd. Very elegant solution.

What do the "#"s and "/b"s do?

Posted: Sun Oct 16, 2005 2:01 am
by feyd
pcre (preg_*) functions require pattern start and end markers, so I often use #, as it's rarely used in patterns I build. \b is the word boundry metacharacter.

Posted: Sun Oct 16, 2005 7:34 pm
by BigChase
Is there way to make the preg_replace in the solution above case IN-sensitive so that the list of excluded words does not have to contain two versions of each word, capped and uncapped?

Thanks.

Posted: Sun Oct 16, 2005 8:10 pm
by feyd
use the i pattern modifier

Posted: Mon Oct 17, 2005 3:16 am
by BigChase
The following seems to work just as well and is much simpler. No need to use regular expressions.


$excluded_words = array(
"the",
"is",
"a",
etc...
);

$string = str_replace($excluded_words, "", "string");

Posted: Mon Oct 17, 2005 7:22 am
by feyd
you realize that's case sensitive and will replace those words if they fall inside another word?