Page 1 of 2

Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 5:40 pm
by Eran
Consider the following:

I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. I'm operating under the assumption that after a while the database of keywords will be much larger than the text.

Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).

The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.

Code: Select all

SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');
Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.

Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 5:49 pm
by VladSun
Have you tried MySQL FULL TEXT search capabilities? Maybe you don't need your own "keyword-matching" algorithm?

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 6:03 pm
by Eran
No, this isn't full-text and there is no algorithm. I'm not searching for words in a text, I'm breaking a small text into words and performing exact matches against a very large table. I need ideas for performing this matching in the most performant way possible.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 6:47 pm
by VladSun
Maybe you can give us some details about it - your DB schema, user requirements, user input format, etc.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 6:59 pm
by Eran
I'm building a service that receives text inputs from different sources (RSS, web services, etc), and needs to send alerts to users whose keywords appear in the text. At the very basic we have:

users
- id
- email
- name
- other details

users_to_keywords
- user_id
- keyword

Bear in mind that the users_to_keywords can grow to very large proportions (assuming the service is a success...). The size of the articles varies, but I assume that most of the articles will be 1000-5000 characters (or several hundred words). For this reason I assumed it would be most efficient to tokenize the document instead of performing a text search on it with all the keywords in the database (which could be 100's of thousands of keywords or more). So I will be breaking documents into words, removing repetitions and grammar words (such as 'the', 'a' etc.).

In addition, I thought of creating a table in memory that holds just the keywords with no repetitions, where the keyword is a unique index:

keywords
- keyword (PK)

And query against it with a giant IN clause (several hundred to several thousand terms).

Alternatively, having this reduced keyword table as a memcached array might perform better. Those are the options I considered thus far. Am I approaching this wrong? does anybody has better suggestions on how to implement such a system?

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 7:21 pm
by VladSun
Maybe you need this schema:
user - id, name, email, etc.
keyword - id, text
user_keyword - FK_user_id, FK_keword_id

Then create a temporary table for all the keywords you have parsed - temp_keywords, and create an index on it.

Then the query will be something like:
[sql]SELECT     *FROM    keywordINNER JOIN     temp_keyword ON temp_keyword.text = keyword.textINNER JOIN    user_keyword ON user_keyword.FK_keyword_id=keyword.idINNER JOIN    user ON user_keyword.FK_user_id=user.idGROUP BY    user.id [/sql]

I think, the slowest operation is GROUP BY. This way you minimize the number of string matching (i.e. long key for indexes)

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 7:25 pm
by Eran
So you are saying creating a temporary table from all the keywords I obtained from the document? how performant is doing that and then querying as opposed to running a query directly with a giant IN clause?

I don't have much experience with creating temporary tables on the fly, so I'm clueless to the performance of this approach.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 7:39 pm
by VladSun
http://dev.mysql.com/doc/refman/5.1/en/ ... unction_in
If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.
Binary search is fast, but I think hashing is faster. You should try both solutions and decide which is better.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 7:41 pm
by Eran
I will. Thanks :)

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 9:23 pm
by Christopher
Maybe an in-memory hash letter-by-letter of all the words that only gives if they exist or not. If you got a hit you could then go out to the database for any info on the terms. Essentially putting a fast hash front-end check on your system.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Fri Jan 02, 2009 10:02 pm
by Eran
How would you go about implementing such a hash? would it be an array that has only single characters as indexes and each index is another array of character indexes?

Also, this requires me to analyze the keywords I generated from the article in greater detail (breaking them down into a hash of characters). Do you have an idea on how go about that?

Re: Implementing keyword comparison scheme (reverse search)

Posted: Sat Jan 03, 2009 12:42 am
by Christopher

Re: Implementing keyword comparison scheme (reverse search)

Posted: Sat Jan 03, 2009 1:02 am
by Eran
Basically implement an index engine? wouldn't it be better to use a database?

I was thinking either a database approach as I discussed with vlad, or a memcache hash array. Any other/better ideas?

Re: Implementing keyword comparison scheme (reverse search)

Posted: Thu Jan 08, 2009 4:45 pm
by Benjamin
Honestly I would go with your first thought and use the IN clause with a list of words from the article. I have done this in the past and never had a problem with it. Should it start to slow down a little bit, you could look at tuning the MySQL server rather than changing the implementation.

Re: Implementing keyword comparison scheme (reverse search)

Posted: Thu Jan 08, 2009 4:49 pm
by Eran
Yep, I went with that approach for now. I abstracted the process, so should the need arise later I will make the necessary adjustments.