Page 1 of 1
Writing something to parse natural language requests.
Posted: Tue Jul 07, 2009 1:27 pm
by onion2k
I'm writing a little script that returns data based on a specific request format.. at the moment it receives a string of...
... and it returns a fact about <thing>.
I'd rather like to make it a bit 'looser'. For example, I want to accept strings like "<username> I love <thing>", "<username> I really like <thing>", "<username> I like <thing> loads!". I can't really do a strpos() for all the "<thing>" options because there are currently over 110,000 of them. I suppose I could sit here and think up every possible format I can think of and then do a regexp for each of them, but that's pretty nasty too.
Has anyone here written some sort of a parser for this sort of thing? How does one go about it?
EDIT: <thing> can be several words long. I think that makes a difference.
Re: Writing something to parse natural language requests.
Posted: Tue Jul 07, 2009 2:43 pm
by alex.barylski
All I can really tel you is this is extremely complex stuff.
I assume you have Googled:
http://www.google.ca/search?hl=en&rlz=1 ... sing&meta=
Natural Language Processing are the keywords of interest here.
It's not easy and not something I know of existing libraries that can help with, unlike parsing programming languages like PHP, I think natural languages have a much more flexible grammar, virtually limitless, which makes processing them very difficult.
English, unlike PHP, evolved over time, whereas, PHP's grammar was planned or already understood from the get go.
I suppose one could compile a database of all the caveats, etc of English, but I dought that is realistic, so I believe the field of NLP is basically taking educated/calculated guesses, otherwise if this field was mastered Google would be answering your questions perfectly.
My suggestion, would be to consider using something less trivial, like maybe soundex or download an English thesurasus to perform lookups on similar words. You can usually exlcude any words less than 3 characters.
Split the sentance up into words, drop those less than 3 characters, iterate array and soundex or compare to some dictionary source until you find something interesting.
Anything much beyond this is very theoretical and long winded, not to mention difficult to comprehend.
Re: Writing something to parse natural language requests.
Posted: Tue Jul 07, 2009 2:58 pm
by Weirdan
Does that mean you'd like to parse even such requests as ' "I like it" said <username>, speaking about the <thing>'? Are those <things> nouns? Do you know all the verbs you'd like to detect?
Re: Writing something to parse natural language requests.
Posted: Tue Jul 07, 2009 7:31 pm
by omniuni
Hm. Brainstorming...
Given a list of possible <thing>s, and <negative>s, I'd start by creating a set of functions to test for language patterns. For example, if I had an array of $likePhrases that included options such as "*i*!<negative>*like*<thing>*", "*<thing>*is*i*!<negative>like*", etc. I should be able to recognize the sentence structure as whether a person like something, and return a fact, or if it has the negative, I could say "I'm sorry you don't like <thing>." As an interesting side note, where it becomes difficult is with things like "I don't hate" which would, in this syntax, be represented as "*i*<negative>*<negative>*<thing>*". Also, I'd have to check first if I even want to parse it against the $likePhrases filter! Ok, so it gets difficult anyway. Good Luck!!!
Re: Writing something to parse natural language requests.
Posted: Tue Jul 07, 2009 8:02 pm
by Christopher
Better to Google this:
http://www.google.ca/search?hl=en&rlz=1 ... arch&meta=
You really don't want to do this in PHP. I am sure you can find some software that you give a string and it will return a bunch of useful data that you can use to do something interesting. I notice that
OpenNLP has a toolkit and there are others.
Re: Writing something to parse natural language requests.
Posted: Wed Jul 08, 2009 3:29 am
by onion2k
I tested a way of testing all the <thing>s in the end. It's faster than I thought it'd be... only takes 0.25s. I think I can get that down a lot. I can do:
[sql]SELECT *FROM `tm_things`WHERE 'Bob, I really like Dell computers. They are so dreamy!' LIKE CONCAT( '% ', `title` , '%' )AND LENGTH( `title` ) > 4[/sql]
That returns "Dell computers". Unfortunately it also returns "computer" and "dream" though.