Human text parsing

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Human text parsing

Post by Luke »

I am looking into trying to recognize certain things from human-entered text. Basically I am attempting to pull appointments, meetings, etc. from the text by looking for dates and times being mentioned in the text and assessing the surrounding sentence. I am curious if anybody has ever had to deal with parsing of human-entered text before and if there are any libraries or tips I should be aware of. Thanks!
mintedjo
Forum Contributor
Posts: 153
Joined: Wed Nov 19, 2008 6:23 am

Re: Human text parsing

Post by mintedjo »

The Ninja Space Goat wrote:...tips I should be aware of...
It's hard. :-(
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Human text parsing

Post by Luke »

LOL thanks... well, as I figure it out, I'll keep y'all updated.
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Re: Human text parsing

Post by Kieran Huggins »

While I'd agree that it's a monster task, I think it would be a hell of a fun problem! Go for it :-)

You're on the right track: first you have to find the sentences that seem to contain the kinds of information you want to parse. Next, you'll have to score each sentence based on how likely they're each of the datatypes you're supporting. Finally, you'll need to run each through the appropriate parser.

Each parser will have it's own set of keywords that alter the data either implicitly or explicitly. Let's take a date parser as an example:

A date can be represented in many ways: 12/08/1979 (US standard), 08/12/1979 (CA standard), Dec[ember] 8 1979 (written), in 9 days (relative written), next Monday (what does next mean?), next week Monday / Monday next week (GAH) ...etc...

It gets even MORE complicated when you factor time into this... and even MORE complicated when you factor in timezones! And that's just dates ;-)

If you can, use hinting to direct user input. Example text is the obvious solution.

If you want to check out some code to get you thinking take a look at "Chronic", a natural date-time parser in Ruby: http://chronic.rubyforge.org/ (it's rad) or maybe track down the PHP source for strtotime() if you can, though it might be less readable.

I have a feeling that something ambiguous like a meeting would be MUCH harder to find / parse. Good luck!!!
mmj
Forum Contributor
Posts: 118
Joined: Fri Oct 31, 2008 4:00 pm

Re: Human text parsing

Post by mmj »

User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Re: Human text parsing

Post by Kieran Huggins »

AWESOME! :rofl:

(whew - pulled ahead of Sami's post count with this one... my work is done!)
User avatar
jayshields
DevNet Resident
Posts: 1912
Joined: Mon Aug 22, 2005 12:11 pm
Location: Leeds/Manchester, England

Re: Human text parsing

Post by jayshields »

Why not look into Semantic Web technologies like RDF, RDFS, OWL, etc.? They're designed to do stuff like this.
Post Reply