corpus/collocate feasability with php

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
clarkepeters
Forum Newbie
Posts: 6
Joined: Mon May 24, 2010 10:50 pm

corpus/collocate feasability with php

Post by clarkepeters »

I'm considering taking a large amount of plain text and use PHP to break the text down into groups of words.
The text will be anywhere from 1million to 30 million words -- I haven't decided yet and much will depend on the processing feasibility.

I'm wondering if this is even feasible or would it be "process overload." I'm using linux on a standard laptop ACER dual processor, 1gig memory, 30gigs hard drive. (as a rule, I can run several intensive programs, word editors, movie compilers, web browsers all at the same time without much trouble).

My approach is like this-- I need to group words by two, three, four, five and possibly six. I know it would be too intensive to do all that at the same time, so I'll do each group on a separate run and save to separate lists (a list of grouped by two, a list of grouped by three etc..)

The program, for example in grouping by threes, will locate the first three words in a text, save it to a list (or array or whatever), move over only one word, and group the next three words and save it to list, and move over one word and group the next three words and save it .... until EOF.

Later, I'll do a frequency run to eliminate groups that don't have a frequency of at least 3 times in the generated list, but I think php can handle that.

I've never worked with such huge files before, so before I invest my time, I am soliciting opinions about the feasibility of it.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: corpus/collocate feasability with php

Post by Weirdan »

assuming words on average are 10 letters (I know it's not true for English, but good enough to estimate upper bound) the file would be 300mb. Parsing it shouldn't be a problem, if done right.
clarkepeters
Forum Newbie
Posts: 6
Joined: Mon May 24, 2010 10:50 pm

Re: corpus/collocate feasability with php

Post by clarkepeters »

Thanks Weirdan,,
that give me enough confidence to start experimenting.
thanks a lot
Post Reply