Page 2 of 3
Posted: Thu Oct 28, 2004 5:54 pm
by d3ad1ysp0rk
Ok, well I just finished the code when you posted, so I'm posting it anyways
This searches for a single word, I'll try the multi word one now.
Is it ever going to be three words?
Code: Select all
<?php
function get_word_array(){
mysql_connect("localhost","username","password");
mysql_select_db("mydatabase");
$result = mysql_query("SELECT * FROM glossary");
$myarray = mysql_fetch_array($result);
return $myarray;
}
$mystring = "body of site";
$page_array = explode(" ",$mystring);
$word_array = get_word_array();
for($i=0,$size=count($page_array);$i<$size;$i++){
if(in_array($page_array[$i],$word_array)){
$page_array[$i] = "<a href="glossary.php?w=".$page_array[$i]."">".$page_array[$i]."</a>";
}
}
$page = implode(" ",$page_array);
?>
Posted: Thu Oct 28, 2004 5:57 pm
by rehfeld
now, will you be parsing plain text, or html?
please say text lol
Posted: Thu Oct 28, 2004 5:58 pm
by mongol
It could be any number of words to make up a term. That's the kicker!
This must be a tricky one...
I'm sorry to be a pain, but when it's sorted, this will be a useful function for everyone's toolbox.
Posted: Thu Oct 28, 2004 5:59 pm
by mongol
Just plain text. Do I sound that psycho?? lol
Posted: Thu Oct 28, 2004 6:29 pm
by rehfeld
thank god! lol
theres still some pitfalls i see
lets say your matching the phrase "too difficult"
1- "this mongol person is being too difficult and we dont want to help"
2- "this mongol person is being too difficult, and we dont want to help"
3- "this mongol person is being too difficulty"
4- "this mongol person is being too difficult. we dont want to help"
5- "this mongol person is being too difficult.we dont want to help"
the first one is a peice of cake to match, because we have a space on both sides of the search phrase
the second, is a bit harder because we now have to allow other characters to be ignored, in this case, the comma
because of the third example, we can only allow specific characters to be ignored
and you can see the other problems w/ 4 and 5. Descisions need to be made whether to allow those a match or not.
it cannot really be achived because there will prob always be syntax errors in the text your searching(im assuming)
but i doubt your after absolute perfection.
would it be acceptable if it was not to match phrases which ended in a comma, or period, or quotes?
do you have controll of the text to search, so that you could make sure
theres no periods or commas etc.. where they would interfere w/ proper matching?(prob not i assume)
i iwsh we had some google source code hehehe
Posted: Thu Oct 28, 2004 6:38 pm
by mongol
True, there are many problems, but I'm amazed this problem has not been tackled before. If it requires some google search code then it must be worth solving.
I will not have control of the text that gets parsed through this function.
There must be a way to do this, and while perfection is the ultimate goal, a degree below would be fantastic!!!
Posted: Thu Oct 28, 2004 6:46 pm
by mongol
How about this:
1. we replace all "." with " . " (now, the only words that don't have a space at the start are at the very beginning of the string (and thats an easy fix))
2. we replace all "," with " ," and any other punctuation with a space in front.
3. now we can match/replace using the space at either end method, right?
4. we now put all the punctuation back to how it was.
would that bring us closer?
Posted: Fri Oct 29, 2004 3:34 am
by rehfeld
ive got something that almost works, i think youll like the solution i came up with
it works by checking the character directly before and after the phrase. If the characters are ones we specify that can be ignored, it will match it. the chracters to be ignored are in an array, so you can easily add/remove ignore chars if needed.
now it only checks 1 char before and 1 char after, but i think that will be sufficient in the majority of cases.
lemme get it working a bit better and ill post it
only thing is it needs php5 ... but that can be changed later
you know i never thought of just adding whitespace to all the possible problem punctuation. that could work. and its as easy as
Code: Select all
str_replace('.', ' . ', $document); // do this for each problem char
// do matching
str_replace(' . ', '.', $document); // return it to normal
nice idea
oh and mongol, the whole "too difficult dont wanna help" thing, it was just a joke, i hope it didnt sound bad. i have odd humor lol
you know, i know i have seen vbulletin boards with hacks that do EXACTLY this, but i dont recall if they matched phrases, i think it was just single words(which is easy as hell to do). But maybe they have improved it, ive seen some excellent coders working on vbulletin hacks.
if you want to look for it,
http://vbulletin.org
i think you can browse without a liscense, just cant view the whole code. If you find something, tell me, i have a liscense so i can grab it.
i saw it on a board that was running 2.x, and most everyones runs 3.x now, but id imagine they ported it to work on 3.x by now too, cause it was over a year ago when i saw it.
Posted: Fri Oct 29, 2004 4:28 am
by Wayne
how about ... (not tested)
Code: Select all
$find = "\b" . $pattern . "\b";
$var=eregi_replace($find,"<a href="http://www.url.com">\\0</a>",$sample_text);
Posted: Fri Oct 29, 2004 5:30 am
by rehfeld
ok, try and break this so we can get the bugs out of it.
works pretty good, and i think its somewhat expandable too
and yeah, i did all this cause im not good at regex lol
Code: Select all
<?php
$ignore = array(
' ',
'.',
',',
'(',
')',
'*',
'"',
"'",
"\r",
"\n",
"\t"
);
$phrases = array(
'Arbys Roast beef',
'Roast beef',
'i',
'ice cream',
'cookies'
);
$document = '
today i was hungry so i went to arbys
and ate roast beef. arbys roAst Beef is good.
I also like cookies, and ice cream.
but i really really like "roast beef"
mmmmm......arbys roast beef.....
so thats what i like(roast beef, that is...)
';
foreach ($phrases as $phrase) {
$doc_pointer = 0;
while (false !== ($start_pos = stripos($document, $phrase, $doc_pointer))) {
$phrase_len = strlen($phrase);
$end_pos = $start_pos + $phrase_len;
$char_before_phrase = $document[$start_pos - 1]; // chracter directly before phrase begins
$char_after_phrase = $document[$end_pos]; // chracter directly after phrase ends
$char2_after_phrase = $document[$end_pos + 1]; // 2nd chracter directly after phrase ends
if (in_array($char_before_phrase, $ignore) // its a match if char_before and char_after are allowed to be ignored
&& in_array($char_after_phrase, $ignore)
&& $char2_after_phrase !== '>') { // second char after cannot be > its to prevent re-matching inside links)
$replacement = "<a href="glossary.php?phrase=$phrase">$phrase<a>";
$document_before_phrase = substr($document, 0, $start_pos); // get document up to start of phrase
$document_after_phrase = substr($document, $end_pos); // get rest of doc after the phrase
$document = $document_before_phrase . $replacement . $document_after_phrase; // we rebuild the entire doc everytime we make a change
$doc_pointer = $start_pos + strlen($replacement); // advance the offset so we dont keep modifying the same part of the document
} else {
$doc_pointer = $end_pos; // even if no match, we still need to advance pointer otherwise we get caught in endless while loop
}
}
}
echo $document;
/*
something i noticed right away is that if you have 2 phrases with the same word in it,
whichever phrase is checked first will get the match. Thats not what we want.
Solution is to make sure you sort the array of phrases so that the longest phrases are check first.
for example, if you had these 2 phrases, in this order
'roast beef'
'arbys roast beef'
if you check for 'roast beef' first, it will match it, and then since the
document will be modified with html replacing the phrase, it will not be possible
to match 'arbys roast beef' anymore, cause we now have arbys <a href="glossary.php?phrase=roast beef">roast beef</a> ..
but if you check for the longer phrases first, it will match properly, matching every
'arbys roast beef' where possible, and then if theres any lone 'roast beef' left, it will then match that as well
current recomendation is to hand edit your array of phrases
maybe someone can write a function to sort the array descending based on strlen()
that should work, in most cases.
the capitalization you enter in the phrases array is what will be output when matched,
so if you want caps, make sure to use them in the phrases array
*/
?>
heres the html out, view it in a browser
Code: Select all
today <a href="glossary.php?phrase=i">i<a> was hungry so <a href="glossary.php?phrase=i">i<a> went to arbys
and ate <a href="glossary.php?phrase=Roast beef">Roast beef<a>. <a href="glossary.php?phrase=Arbys Roast beef">Arbys Roast beef<a> is good.
<a href="glossary.php?phrase=i">i<a> also like <a href="glossary.php?phrase=cookies">cookies<a>, and <a href="glossary.php?phrase=ice cream">ice cream<a>.
but <a href="glossary.php?phrase=i">i<a> really really like "<a href="glossary.php?phrase=Roast beef">Roast beef<a>"
mmmmm......<a href="glossary.php?phrase=Arbys Roast beef">Arbys Roast beef<a>.....
so thats what <a href="glossary.php?phrase=i">i<a> like(<a href="glossary.php?phrase=Roast beef">Roast beef<a>, that is...)
Posted: Fri Oct 29, 2004 7:59 am
by mongol
I'll give it a go. Thank you very, very much.
N.B. I didn't think you were going to give up, I just thought it was too tricky. Let me see if you have got a universal solution to this problem.
Thanks again
Posted: Fri Oct 29, 2004 8:26 am
by mongol
This looks good, but my computer does not have php5 and neither does the server I'm using. I can't change what php my server has installed, so I'm kinda stuck.
Infact this is a major pain, because now you seem to have the solution and I can't have it!!!
ARRRRGH!
It's like looking through the bakery window at a big cream cake, but the shop is closed!!!
Posted: Fri Oct 29, 2004 12:10 pm
by rehfeld
replace
Code: Select all
while (false !== ($start_pos = stripos($document, $phrase, $doc_pointer)))
with
Code: Select all
while (false !== ($start_pos = strpos(strtolower($document), strtolower($phrase), $doc_pointer)))
i think that should work in the later php 4.x's
and yeah i wish more hosts would goto php5
i think your idea idea earlier about adding whitespace still might be better though
Posted: Fri Oct 29, 2004 1:12 pm
by mongol
FANTASIC!!!
REHFELD IS THE BOMB!!!!
I wish I had my own "X-Prize" to give away, I think you would have won it.
The only thing we need to do is add a space at the front and back of the text being parsed, just incase the glossary term is at the begining or end of the passage.
Remember to always serve the function the glossary terms in reverse order and Bob's your Uncle!!!
Thanks to everyone who helped and especially to Rehfeld how won the X-Prize.
Mongol
Posted: Fri Oct 29, 2004 7:26 pm
by rehfeld
i think just adding an empty entry
'',
to the ignore array should allow it to properly match words at the beg and end
it just occured to me, that this will not match a phrase like this
Code: Select all
phrase = 'roast beef'
text = 'roast
beef'
or
text = 'roast beef'
both of those wont match properly
but that could be fixed by stripping unneeded whitespace
ill put something together for that soon.
if you find anything else let me know, ill see what i can do about it