Human text parsing
Moderator: General Moderators
Human text parsing
I am looking into trying to recognize certain things from human-entered text. Basically I am attempting to pull appointments, meetings, etc. from the text by looking for dates and times being mentioned in the text and assessing the surrounding sentence. I am curious if anybody has ever had to deal with parsing of human-entered text before and if there are any libraries or tips I should be aware of. Thanks!
Re: Human text parsing
It's hard.The Ninja Space Goat wrote:...tips I should be aware of...
Re: Human text parsing
LOL thanks... well, as I figure it out, I'll keep y'all updated.
- Kieran Huggins
- DevNet Master
- Posts: 3635
- Joined: Wed Dec 06, 2006 4:14 pm
- Location: Toronto, Canada
- Contact:
Re: Human text parsing
While I'd agree that it's a monster task, I think it would be a hell of a fun problem! Go for it 
You're on the right track: first you have to find the sentences that seem to contain the kinds of information you want to parse. Next, you'll have to score each sentence based on how likely they're each of the datatypes you're supporting. Finally, you'll need to run each through the appropriate parser.
Each parser will have it's own set of keywords that alter the data either implicitly or explicitly. Let's take a date parser as an example:
A date can be represented in many ways: 12/08/1979 (US standard), 08/12/1979 (CA standard), Dec[ember] 8 1979 (written), in 9 days (relative written), next Monday (what does next mean?), next week Monday / Monday next week (GAH) ...etc...
It gets even MORE complicated when you factor time into this... and even MORE complicated when you factor in timezones! And that's just dates
If you can, use hinting to direct user input. Example text is the obvious solution.
If you want to check out some code to get you thinking take a look at "Chronic", a natural date-time parser in Ruby: http://chronic.rubyforge.org/ (it's rad) or maybe track down the PHP source for strtotime() if you can, though it might be less readable.
I have a feeling that something ambiguous like a meeting would be MUCH harder to find / parse. Good luck!!!
You're on the right track: first you have to find the sentences that seem to contain the kinds of information you want to parse. Next, you'll have to score each sentence based on how likely they're each of the datatypes you're supporting. Finally, you'll need to run each through the appropriate parser.
Each parser will have it's own set of keywords that alter the data either implicitly or explicitly. Let's take a date parser as an example:
A date can be represented in many ways: 12/08/1979 (US standard), 08/12/1979 (CA standard), Dec[ember] 8 1979 (written), in 9 days (relative written), next Monday (what does next mean?), next week Monday / Monday next week (GAH) ...etc...
It gets even MORE complicated when you factor time into this... and even MORE complicated when you factor in timezones! And that's just dates
If you can, use hinting to direct user input. Example text is the obvious solution.
If you want to check out some code to get you thinking take a look at "Chronic", a natural date-time parser in Ruby: http://chronic.rubyforge.org/ (it's rad) or maybe track down the PHP source for strtotime() if you can, though it might be less readable.
I have a feeling that something ambiguous like a meeting would be MUCH harder to find / parse. Good luck!!!
- Kieran Huggins
- DevNet Master
- Posts: 3635
- Joined: Wed Dec 06, 2006 4:14 pm
- Location: Toronto, Canada
- Contact:
Re: Human text parsing
AWESOME!mmj wrote:http://google.com/search?q=php+ocr
(whew - pulled ahead of Sami's post count with this one... my work is done!)
- jayshields
- DevNet Resident
- Posts: 1912
- Joined: Mon Aug 22, 2005 12:11 pm
- Location: Leeds/Manchester, England
Re: Human text parsing
Why not look into Semantic Web technologies like RDF, RDFS, OWL, etc.? They're designed to do stuff like this.