splitting text files into groups
Posted: Sat Jun 05, 2010 2:23 pm
I'm working on a multi-million word corpus (single space between each word) and am considering different approaches to splitting a text file according to a set number of words, for example, into groups of three (marking it delimited at that point or splitting it into an array and saving it in columnar spreadsheet fashion--haven't decided yet).
Assuming I have the processing power, I could simply explode() the text at each space into an array of single words and then read the words three at a time and put them into another array with three words per element and then save it into whatever format I choose.
Some of you, however, may be more nifty with text manipulation. It would be great if you could count words somehow and use a special chunk or split function. The problem is that we can't know the character length of every word, so while we can count characters, we have no way of knowing where every third word end. Also, I thought of doing a string_replace or preg_replace to insert a special delimited character in the place of every third space, but I'm not sure you can do that with a regex. Although, now that I think of it, it seems there might be a special word delimiter meta-character in regex.
Or maybe the speed and processing power would be the same regardless of any of the different approaches one might take?
Any thoughts on this?
Assuming I have the processing power, I could simply explode() the text at each space into an array of single words and then read the words three at a time and put them into another array with three words per element and then save it into whatever format I choose.
Some of you, however, may be more nifty with text manipulation. It would be great if you could count words somehow and use a special chunk or split function. The problem is that we can't know the character length of every word, so while we can count characters, we have no way of knowing where every third word end. Also, I thought of doing a string_replace or preg_replace to insert a special delimited character in the place of every third space, but I'm not sure you can do that with a regex. Although, now that I think of it, it seems there might be a special word delimiter meta-character in regex.
Or maybe the speed and processing power would be the same regardless of any of the different approaches one might take?
Any thoughts on this?