Stripping words from a string

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
BigChase
Forum Newbie
Posts: 8
Joined: Sat Oct 15, 2005 4:48 pm

Stripping words from a string

Post by BigChase »

What is the best way to strip a pre-defined list of words from an aribtary string? What I want to do is remove all simple connecting words like "a", "the", "what", "who", "where", etc. (my English teacher probably has a better term to describes these ords than 'connecting words") from a string.

I imagine it's something like the following, though perhaps there os a common tool, function, code fragment, etc. that people use to accomplish this?

$words_to_exclude = {
"the",
"a ",
"what",
"who",
etc///

php_grep_function ($string, $words_to_exclude); (not sure what the php grep or string replace function is or how it works)

[As a follow-on question, does anyone know of a tool that can discern and extract likely subject words from a sentence? In other words, a tool that can guess which words in a sentence are its key words?]

Thanks
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

preg_replace()

Code: Select all

function pregProtect($a) {
  return preg_quote($a,'#');
}
$words_to_exclude = '#\b('.implode('|',array_map('pregProtect',$words_to_exclude)).')\b#';

$text = preg_replace($words_to_exclude,'',$text);
that's untested..
BigChase
Forum Newbie
Posts: 8
Joined: Sat Oct 15, 2005 4:48 pm

Brilliant!

Post by BigChase »

Appears to work very well! Thanks feyd. Very elegant solution.

What do the "#"s and "/b"s do?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

pcre (preg_*) functions require pattern start and end markers, so I often use #, as it's rarely used in patterns I build. \b is the word boundry metacharacter.
BigChase
Forum Newbie
Posts: 8
Joined: Sat Oct 15, 2005 4:48 pm

Post by BigChase »

Is there way to make the preg_replace in the solution above case IN-sensitive so that the list of excluded words does not have to contain two versions of each word, capped and uncapped?

Thanks.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

use the i pattern modifier
BigChase
Forum Newbie
Posts: 8
Joined: Sat Oct 15, 2005 4:48 pm

Post by BigChase »

The following seems to work just as well and is much simpler. No need to use regular expressions.


$excluded_words = array(
"the",
"is",
"a",
etc...
);

$string = str_replace($excluded_words, "", "string");
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you realize that's case sensitive and will replace those words if they fall inside another word?
Post Reply