Stop words with fulltext search
Moderator: General Moderators
Stop words with fulltext search
Hi guys,
Yesterday, I asked about fulltext search problem. Fortunately there were some guys who helped me to solve it but now I faced another problem(It was mentioned in my previous post but although I thought I would be able to find a solution it was not the case.). I am talking about Stop Words. I was told yesterday not to use them but I am developing a news portal in which users are allowed to submit their own posts. After I have read the list provided by MySQL which contains all stop words I got very frustrated.
The problem is that if a user submits this title("One Minister") it contains "One" which is a stop word so when the search engine tries to find a solution the result will not be very accurate. This led me to the idea of creating a second table which contains word equivallents and is used for replacing for example "One" with (There are some options:Just a mismatch, digital representaion, and a cryptic value).
This is what I mean:
Real table:
id|news|user_id
1|One Minister|3
Other table
real|unreal
One|EnO or 15,14,5 or cryptic value
I am unsure whether this is the-best solution that is why I am writing here. Why MySQL introduced these words is another thing that is weird for me?
Yesterday, I asked about fulltext search problem. Fortunately there were some guys who helped me to solve it but now I faced another problem(It was mentioned in my previous post but although I thought I would be able to find a solution it was not the case.). I am talking about Stop Words. I was told yesterday not to use them but I am developing a news portal in which users are allowed to submit their own posts. After I have read the list provided by MySQL which contains all stop words I got very frustrated.
The problem is that if a user submits this title("One Minister") it contains "One" which is a stop word so when the search engine tries to find a solution the result will not be very accurate. This led me to the idea of creating a second table which contains word equivallents and is used for replacing for example "One" with (There are some options:Just a mismatch, digital representaion, and a cryptic value).
This is what I mean:
Real table:
id|news|user_id
1|One Minister|3
Other table
real|unreal
One|EnO or 15,14,5 or cryptic value
I am unsure whether this is the-best solution that is why I am writing here. Why MySQL introduced these words is another thing that is weird for me?
Last edited by user___ on Thu Feb 01, 2007 1:45 pm, edited 2 times in total.
because they are so common that they would turn up too many results. Stopwords are actually quite helpful. If you did a search for "the book of apples" in a book database, the search would turn up a lot of records that aren't relevant because "the" and "of" are such common words. You may also wish to add "book" to your list of stop words since it is also likely to be in a lot of rows.
Read this...
http://dev.mysql.com/doc/refman/5.0/en/ ... uning.html
Also, read all about full text on mysql's site:
http://dev.mysql.com/doc/refman/5.0/en/ ... earch.html
You'd be surprised how much you learn by reading
Read this...
http://dev.mysql.com/doc/refman/5.0/en/ ... uning.html
Also, read all about full text on mysql's site:
http://dev.mysql.com/doc/refman/5.0/en/ ... earch.html
You'd be surprised how much you learn by reading
Reply
I do see your point but lets imagine a record in a table which is "I have a lot to tell you about apples" and a user types "tell you about apples" there is not result. That was my question for and how this problem can be solved.
Do you have direct access to the server & it's files? If so, just modify (after making a backup of course) the stopword file to remove the words you don't want to be considered stopwords.
As ~Ninja mentioned though, be mindful of the number of results you could get if you mess with the default stopwords.
As ~Ninja mentioned though, be mindful of the number of results you could get if you mess with the default stopwords.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
Reply
Unfortunately I do not have access to the server files(If I had I would not have bothered you). I know this about percentages and I try to obey having more than fifty percents of occurences. I still wonder whether it is possible this problem to be solved by custom script.
If you can't edit server variables, then maybe you can run a query first that changes the appropriate variable in run time. I know you can set replication variables & the sort in a query - maybe you could also do that for the variables relevant to stop words.
Another option would be to make your query IN BOOLEAN MODE, the wrap the search string in double quotes. However, that will search for the entire string as it was typed, rather each component word.
Another option would be to make your query IN BOOLEAN MODE, the wrap the search string in double quotes. However, that will search for the entire string as it was typed, rather each component word.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
Reply
I did it. I made it work. I tried it on another server and removed some of the records(Cause of the rule the fifty pecents rule. By the way is there a way to do a seaqrch in which there are more than fifty percents occurences? If there is can anyone tell me?). Otherwise everything work fine.
Thank you guys.
Thank you guys.