corpus/collocate feasability with php
Posted: Fri Jun 04, 2010 5:54 am
I'm considering taking a large amount of plain text and use PHP to break the text down into groups of words.
The text will be anywhere from 1million to 30 million words -- I haven't decided yet and much will depend on the processing feasibility.
I'm wondering if this is even feasible or would it be "process overload." I'm using linux on a standard laptop ACER dual processor, 1gig memory, 30gigs hard drive. (as a rule, I can run several intensive programs, word editors, movie compilers, web browsers all at the same time without much trouble).
My approach is like this-- I need to group words by two, three, four, five and possibly six. I know it would be too intensive to do all that at the same time, so I'll do each group on a separate run and save to separate lists (a list of grouped by two, a list of grouped by three etc..)
The program, for example in grouping by threes, will locate the first three words in a text, save it to a list (or array or whatever), move over only one word, and group the next three words and save it to list, and move over one word and group the next three words and save it .... until EOF.
Later, I'll do a frequency run to eliminate groups that don't have a frequency of at least 3 times in the generated list, but I think php can handle that.
I've never worked with such huge files before, so before I invest my time, I am soliciting opinions about the feasibility of it.
The text will be anywhere from 1million to 30 million words -- I haven't decided yet and much will depend on the processing feasibility.
I'm wondering if this is even feasible or would it be "process overload." I'm using linux on a standard laptop ACER dual processor, 1gig memory, 30gigs hard drive. (as a rule, I can run several intensive programs, word editors, movie compilers, web browsers all at the same time without much trouble).
My approach is like this-- I need to group words by two, three, four, five and possibly six. I know it would be too intensive to do all that at the same time, so I'll do each group on a separate run and save to separate lists (a list of grouped by two, a list of grouped by three etc..)
The program, for example in grouping by threes, will locate the first three words in a text, save it to a list (or array or whatever), move over only one word, and group the next three words and save it to list, and move over one word and group the next three words and save it .... until EOF.
Later, I'll do a frequency run to eliminate groups that don't have a frequency of at least 3 times in the generated list, but I think php can handle that.
I've never worked with such huge files before, so before I invest my time, I am soliciting opinions about the feasibility of it.