I'm working on a multi-million word corpus (single space between each word) and am considering different approaches to splitting a text file according to a set number of words, for example, into groups of three (marking it delimited at that point or splitting it into an array and saving it in columnar spreadsheet fashion--haven't decided yet).
Assuming I have the processing power, I could simply explode() the text at each space into an array of single words and then read the words three at a time and put them into another array with three words per element and then save it into whatever format I choose.
Some of you, however, may be more nifty with text manipulation. It would be great if you could count words somehow and use a special chunk or split function. The problem is that we can't know the character length of every word, so while we can count characters, we have no way of knowing where every third word end. Also, I thought of doing a string_replace or preg_replace to insert a special delimited character in the place of every third space, but I'm not sure you can do that with a regex. Although, now that I think of it, it seems there might be a special word delimiter meta-character in regex.
Or maybe the speed and processing power would be the same regardless of any of the different approaches one might take?
Any thoughts on this?
splitting text files into groups
Moderator: General Moderators
-
clarkepeters
- Forum Newbie
- Posts: 6
- Joined: Mon May 24, 2010 10:50 pm
- Jonah Bron
- DevNet Master
- Posts: 2764
- Joined: Thu Mar 15, 2007 6:28 pm
- Location: Redding, California
Re: splitting text files into groups
I really don't think you should make a multi-million element array. You should do it something like this:
fread() 500 characters
convert to desired file format
write to output file
fread() 500 more characters
...
fread() 500 characters
convert to desired file format
write to output file
fread() 500 more characters
...
-
clarkepeters
- Forum Newbie
- Posts: 6
- Joined: Mon May 24, 2010 10:50 pm
Re: splitting text files into groups
I was rethinking this last night and I also realized myself that I could break the file up before processing and then save a portion of it at a time--so I could work with say 50-100,000 words or so instead of mutli-million. But what is key here is that because it's a corpus I can't be splitting words in the middle, and I have to be able to group them by, or distinguish between, groups of three words.
- Jonah Bron
- DevNet Master
- Posts: 2764
- Joined: Thu Mar 15, 2007 6:28 pm
- Location: Redding, California
Re: splitting text files into groups
Why does it have to be broken up into groups of three? Here's some pseudo code representing what I laid out above, that doesn't break words.
[syntax]
file = fopen('file.txt')
i = 0
while (true) {
text = fread(file, 500 - i)
i = 0
while (text[strlen(text)-(i+1)] != ' ') {
i++
}
text = substr(0, strlen(text)-i, text)
text = str_replace(" ", "\n", text)
// do something with the new line delimited word list
}
[/syntax]
[syntax]
file = fopen('file.txt')
i = 0
while (true) {
text = fread(file, 500 - i)
i = 0
while (text[strlen(text)-(i+1)] != ' ') {
i++
}
text = substr(0, strlen(text)-i, text)
text = str_replace(" ", "\n", text)
// do something with the new line delimited word list
}
[/syntax]