Stop words with fulltext search

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
user___
Forum Contributor
Posts: 297
Joined: Tue Dec 05, 2006 3:05 pm

Stop words with fulltext search

Post by user___ »

Hi guys,
Yesterday, I asked about fulltext search problem. Fortunately there were some guys who helped me to solve it but now I faced another problem(It was mentioned in my previous post but although I thought I would be able to find a solution it was not the case.). I am talking about Stop Words. I was told yesterday not to use them but I am developing a news portal in which users are allowed to submit their own posts. After I have read the list provided by MySQL which contains all stop words I got very frustrated.

The problem is that if a user submits this title("One Minister") it contains "One" which is a stop word so when the search engine tries to find a solution the result will not be very accurate. This led me to the idea of creating a second table which contains word equivallents and is used for replacing for example "One" with (There are some options:Just a mismatch, digital representaion, and a cryptic value).

This is what I mean:

Real table:
id|news|user_id
1|One Minister|3

Other table
real|unreal
One|EnO or 15,14,5 or cryptic value

I am unsure whether this is the-best solution that is why I am writing here. Why MySQL introduced these words is another thing that is weird for me?
Last edited by user___ on Thu Feb 01, 2007 1:45 pm, edited 2 times in total.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

because they are so common that they would turn up too many results. Stopwords are actually quite helpful. If you did a search for "the book of apples" in a book database, the search would turn up a lot of records that aren't relevant because "the" and "of" are such common words. You may also wish to add "book" to your list of stop words since it is also likely to be in a lot of rows.

Read this...
http://dev.mysql.com/doc/refman/5.0/en/ ... uning.html

Also, read all about full text on mysql's site:
http://dev.mysql.com/doc/refman/5.0/en/ ... earch.html

You'd be surprised how much you learn by reading :)
user___
Forum Contributor
Posts: 297
Joined: Tue Dec 05, 2006 3:05 pm

Reply

Post by user___ »

I do see your point but lets imagine a record in a table which is "I have a lot to tell you about apples" and a user types "tell you about apples" there is not result. That was my question for and how this problem can be solved.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

Do you have direct access to the server & it's files? If so, just modify (after making a backup of course) the stopword file to remove the words you don't want to be considered stopwords.

As ~Ninja mentioned though, be mindful of the number of results you could get if you mess with the default stopwords.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

I just did that and it found it just fine. Same exact search terms. Remember that if more than 50% of rows match, nothing is returned... so make sure you've got enough (diverse) test data in your database for the search to work properly.
user___
Forum Contributor
Posts: 297
Joined: Tue Dec 05, 2006 3:05 pm

Reply

Post by user___ »

Unfortunately I do not have access to the server files(If I had I would not have bothered you). I know this about percentages and I try to obey having more than fifty percents of occurences. I still wonder whether it is possible this problem to be solved by custom script.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

If you can't edit server variables, then maybe you can run a query first that changes the appropriate variable in run time. I know you can set replication variables & the sort in a query - maybe you could also do that for the variables relevant to stop words.

Another option would be to make your query IN BOOLEAN MODE, the wrap the search string in double quotes. However, that will search for the entire string as it was typed, rather each component word.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
user___
Forum Contributor
Posts: 297
Joined: Tue Dec 05, 2006 3:05 pm

Reply

Post by user___ »

Thank you guys but my work time is over and I have to go. I will try what you suggested me pickle tommorow and I will tell you what will have happened.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

I'm not sure what the problem is. Like I said, I tried that search and came up with results no problem.
user___
Forum Contributor
Posts: 297
Joined: Tue Dec 05, 2006 3:05 pm

Reply

Post by user___ »

I did it. I made it work. I tried it on another server and removed some of the records(Cause of the rule the fifty pecents rule. By the way is there a way to do a seaqrch in which there are more than fifty percents occurences? If there is can anyone tell me?). Otherwise everything work fine.

Thank you guys.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

Boolean mode doesn't use the 50% restriction.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
user___
Forum Contributor
Posts: 297
Joined: Tue Dec 05, 2006 3:05 pm

Reply

Post by user___ »

Thank you pickle.
Post Reply