searching text in html page

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
cisrudlow
Forum Newbie
Posts: 2
Joined: Sun May 08, 2005 7:28 pm

searching text in html page

Post by cisrudlow »

I want search phrase in html file and mark (like Google when I click on "cached") it (phrase). Does somebody know any class, that can help me? Maybe somebody know how could I make this?
Thx
php_wiz_kid
Forum Contributor
Posts: 181
Joined: Tue Jun 24, 2003 7:33 pm

Post by php_wiz_kid »

You could use str_replace().

Just open a file, put its contents into a variable and use str_replace() to replace a string with the desired string.

Code: Select all

$file_open = fopen($file, 'r');
$file_read = fread($file_open, filesize($file)); //Contents of $file

str_replace($to_replace, $replace_with, $file_read);
Give that a whirl. The contents of $file_read doesn't have to be a file. It can be any type of string or array.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

or you can even minimize that with

Code: Select all

str_replace($to_replace, $replace_with, file_get_contents($filename));
I would recommend using file_get_contents() instead of fopen then fread
cisrudlow
Forum Newbie
Posts: 2
Joined: Sun May 08, 2005 7:28 pm

Post by cisrudlow »

ok, but if I want find "body" or "table" etc. it'll find me html tags, too.
php_wiz_kid
Forum Contributor
Posts: 181
Joined: Tue Jun 24, 2003 7:33 pm

Post by php_wiz_kid »

Yes, it should. You could do this:

Code: Select all

$to_replace = &quote;<table>&quote;;
$replace_with = &quote;<blahblah>&quote;;
str_replace($to_replace, $replace_with, file_get_contents($filename));
It would turn this:

Code: Select all

...
<body>
<table>
  <tr>
    <td>BLAH</td>
  </tr>
</table>
</body>
...
to:

Code: Select all

...
<body>
<blahblah>
  <tr>
    <td>BLAH</td>
  </tr>
</table>
</body>
...
In fact. I made a template object that uses this (str_replace) to find strings like {U_THING} inside a html/text/template file and change into XHTML.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

:arrow: Moving to regex....
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Here's a nice little trick ;-)

Code: Select all

function getBlock($source, $tag) {
    $re = '#<\s*'.$tag.'[^>]*>(.*?)<\s*/\s*'.$tag.'\s*>#is';
    if (preg_match($re, $source, $matches)) {
        $block = $matches[1]; //The bit you need
        return $block;
    } else {
        return false;
    }
}

/*** EXAMPLE ****/
$google_source = file_get_contents('http://www.google.com/');

$googles_body = getBlock($google_source, 'body');

echo '<pre>';
echo htmlspecialchars($googles_body);
echo '</pre>';
Change preg_match() to preg_match_all() if you're looking for numerous items (e.g. <b>text</b> tags)...

Hope that helps ;-)
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Misunderstood question... apologies.

To avoid replacing HTML tags like you suggested try this...

It will have to be a regex for this anyway.

Untested (can you let me know how this goes please - curious on this one but too busy to test).

Code: Select all

function highlightWords ($source, $word, $color) {
    $re = '#(?<!<)(\s*'.$word.')(?![^>]*>)#is';
    $replace = '<span style="background-color:'.$color.'">$1</span>';
    $highlighted = preg_replace($re, $replace, $source);
    return $highlighted;
}

/*** EXAMPLE ***/
$regex_info_source = file_get_contents('http://www.regular-expressions.info/');
$re_highlighted = highlightWords($regex_info_source, 'regex', '#FFEE00');
echo $re_highlighted;
At first sight the regular expressions looks quite scary (and it unavoidably matches the whitespace preceding the word but nobody sees that).

The (?<!...) is a negative lookbehind (in other words the word must NOT follow "<". Equally the (?!...) is a negative lookahead (in other words the word must not come before ">". The \s* and [^>]* just allow other permittable characters to be in the source code and not cause a problem.

Good luck ;-)
Post Reply