Implementing keyword comparison scheme (reverse search)

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Implementing keyword comparison scheme (reverse search)

Post by Eran »

Consider the following:

I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. I'm operating under the assumption that after a while the database of keywords will be much larger than the text.

Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).

The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.

Code: Select all

SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');
Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.

Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Implementing keyword comparison scheme (reverse search)

Post by VladSun »

Have you tried MySQL FULL TEXT search capabilities? Maybe you don't need your own "keyword-matching" algorithm?
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

No, this isn't full-text and there is no algorithm. I'm not searching for words in a text, I'm breaking a small text into words and performing exact matches against a very large table. I need ideas for performing this matching in the most performant way possible.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Implementing keyword comparison scheme (reverse search)

Post by VladSun »

Maybe you can give us some details about it - your DB schema, user requirements, user input format, etc.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

I'm building a service that receives text inputs from different sources (RSS, web services, etc), and needs to send alerts to users whose keywords appear in the text. At the very basic we have:

users
- id
- email
- name
- other details

users_to_keywords
- user_id
- keyword

Bear in mind that the users_to_keywords can grow to very large proportions (assuming the service is a success...). The size of the articles varies, but I assume that most of the articles will be 1000-5000 characters (or several hundred words). For this reason I assumed it would be most efficient to tokenize the document instead of performing a text search on it with all the keywords in the database (which could be 100's of thousands of keywords or more). So I will be breaking documents into words, removing repetitions and grammar words (such as 'the', 'a' etc.).

In addition, I thought of creating a table in memory that holds just the keywords with no repetitions, where the keyword is a unique index:

keywords
- keyword (PK)

And query against it with a giant IN clause (several hundred to several thousand terms).

Alternatively, having this reduced keyword table as a memcached array might perform better. Those are the options I considered thus far. Am I approaching this wrong? does anybody has better suggestions on how to implement such a system?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Implementing keyword comparison scheme (reverse search)

Post by VladSun »

Maybe you need this schema:
user - id, name, email, etc.
keyword - id, text
user_keyword - FK_user_id, FK_keword_id

Then create a temporary table for all the keywords you have parsed - temp_keywords, and create an index on it.

Then the query will be something like:
[sql]SELECT     *FROM    keywordINNER JOIN     temp_keyword ON temp_keyword.text = keyword.textINNER JOIN    user_keyword ON user_keyword.FK_keyword_id=keyword.idINNER JOIN    user ON user_keyword.FK_user_id=user.idGROUP BY    user.id [/sql]

I think, the slowest operation is GROUP BY. This way you minimize the number of string matching (i.e. long key for indexes)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

So you are saying creating a temporary table from all the keywords I obtained from the document? how performant is doing that and then querying as opposed to running a query directly with a giant IN clause?

I don't have much experience with creating temporary tables on the fly, so I'm clueless to the performance of this approach.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Implementing keyword comparison scheme (reverse search)

Post by VladSun »

http://dev.mysql.com/doc/refman/5.1/en/ ... unction_in
If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.
Binary search is fast, but I think hashing is faster. You should try both solutions and decide which is better.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

I will. Thanks :)
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Implementing keyword comparison scheme (reverse search)

Post by Christopher »

Maybe an in-memory hash letter-by-letter of all the words that only gives if they exist or not. If you got a hit you could then go out to the database for any info on the terms. Essentially putting a fast hash front-end check on your system.
(#10850)
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

How would you go about implementing such a hash? would it be an array that has only single characters as indexes and each index is another array of character indexes?

Also, this requires me to analyze the keywords I generated from the article in greater detail (breaking them down into a hash of characters). Do you have an idea on how go about that?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

Basically implement an index engine? wouldn't it be better to use a database?

I was thinking either a database approach as I discussed with vlad, or a memcache hash array. Any other/better ideas?
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Implementing keyword comparison scheme (reverse search)

Post by Benjamin »

Honestly I would go with your first thought and use the IN clause with a list of words from the article. I have done this in the past and never had a problem with it. Should it start to slow down a little bit, you could look at tuning the MySQL server rather than changing the implementation.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Implementing keyword comparison scheme (reverse search)

Post by Eran »

Yep, I went with that approach for now. I abstracted the process, so should the need arise later I will make the necessary adjustments.
Post Reply