Page 1 of 1

Human text parsing

Posted: Thu Nov 27, 2008 7:52 pm
by Luke
I am looking into trying to recognize certain things from human-entered text. Basically I am attempting to pull appointments, meetings, etc. from the text by looking for dates and times being mentioned in the text and assessing the surrounding sentence. I am curious if anybody has ever had to deal with parsing of human-entered text before and if there are any libraries or tips I should be aware of. Thanks!

Re: Human text parsing

Posted: Fri Nov 28, 2008 4:12 am
by mintedjo
The Ninja Space Goat wrote:...tips I should be aware of...
It's hard. :-(

Re: Human text parsing

Posted: Fri Nov 28, 2008 11:36 am
by Luke
LOL thanks... well, as I figure it out, I'll keep y'all updated.

Re: Human text parsing

Posted: Sat Nov 29, 2008 5:51 am
by Kieran Huggins
While I'd agree that it's a monster task, I think it would be a hell of a fun problem! Go for it :-)

You're on the right track: first you have to find the sentences that seem to contain the kinds of information you want to parse. Next, you'll have to score each sentence based on how likely they're each of the datatypes you're supporting. Finally, you'll need to run each through the appropriate parser.

Each parser will have it's own set of keywords that alter the data either implicitly or explicitly. Let's take a date parser as an example:

A date can be represented in many ways: 12/08/1979 (US standard), 08/12/1979 (CA standard), Dec[ember] 8 1979 (written), in 9 days (relative written), next Monday (what does next mean?), next week Monday / Monday next week (GAH) ...etc...

It gets even MORE complicated when you factor time into this... and even MORE complicated when you factor in timezones! And that's just dates ;-)

If you can, use hinting to direct user input. Example text is the obvious solution.

If you want to check out some code to get you thinking take a look at "Chronic", a natural date-time parser in Ruby: http://chronic.rubyforge.org/ (it's rad) or maybe track down the PHP source for strtotime() if you can, though it might be less readable.

I have a feeling that something ambiguous like a meeting would be MUCH harder to find / parse. Good luck!!!

Re: Human text parsing

Posted: Sat Nov 29, 2008 5:53 am
by mmj

Re: Human text parsing

Posted: Sat Nov 29, 2008 6:41 am
by Kieran Huggins
AWESOME! :rofl:

(whew - pulled ahead of Sami's post count with this one... my work is done!)

Re: Human text parsing

Posted: Sat Nov 29, 2008 9:56 am
by jayshields
Why not look into Semantic Web technologies like RDF, RDFS, OWL, etc.? They're designed to do stuff like this.