Page 1 of 2
Another algorithm challenge :)
Posted: Sat Mar 01, 2008 6:14 pm
by Chris Corbyn
Ok, so I'm stealing ideas here. I have my own way of doing this but I'm hoping someone can post a more optimized method
Given two file handles; one which is readable and another which is writable, write one file to the other performing string-replacements on-the-fly. Imagine these files could be 10MB or so, and thus reading the *entire* file contents and doing a str_replace() is not acceptable. Use the least amount of memory possible.
Pseudo Example:
Code: Select all
function copyStreamTransformed($reader, $writer, $replacements = array()) {
while (!feof($reader) && false !== $bytes = fread($reader, 8192)) {
//here's where you'd replace $bytes using $replacements, then...
fwrite($writer, $bytes);
}
}
//Use case
$fpOut = fopen('text-file.txt', 'r');
$fpIn = fopen('replaced-file.txt', 'w');
copyStreamTransformed($fpOut, $fpIn, array('foo' => 'bar', "\r\n" => "\n"));
The above example a text file with windows line endings would be copied to a text file containing UNIX line endings and with all occurences of "foo" replaced with "bar".
Bear in mind that when you fread() from the reader, you may end up with half of one of your replacements tagged onto the end of the string, and the other half would come out at the start of the next read (e.g. read 1 = "some fo", read 2 = "o and more foo").
Looking forward to stealing your ideas

Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 8:20 pm
by hawkenterprises
What are the full rules to the challenge? I mean something like this might be easily approached with shell commands and grep. That's what PHP is basically doing anyways might as well just beat it to it.
Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 8:23 pm
by Chris Corbyn
Pure PHP... no shell commands

Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 9:26 pm
by Sekka
No str_replace()? Hrm... preg_replace() acceptable? Or is that even worse?

Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 9:50 pm
by Chris Corbyn
Sekka wrote:No str_replace()? Hrm... preg_replace() acceptable? Or is that even worse?

I didn't say no str_replace()

What I did say was none of doing this:
Code: Select all
$full_file = fread($reader); //read the lot! NO thanks!
fwrite($writer, str_replace( ... , $full_file)); //This will run out of memory on large files
You need to make use of streaming the file contents from one to the other.... it's harder than it sounds

Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 10:10 pm
by Sekka
Then surely the example you posted is the best as it does it one line at a time? Or do I just lack the relevant PHP knowledge in this area?
I'm also gonna use the 'I'm in the UK and it's currently 4am' card as I truly can't think at this hour.
Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 11:51 pm
by Chris Corbyn
Ok, here's a problem
Original file contains:
Replacements: foo => zip, bar => button
So we'd want:
Code: Select all
I'm a zip but I wish I was a button
Now when streaming 8192 bytes a time (the max fread() chunk size) that would be fine. But let's say for example there are 8185 characters before that occurence of "I'm a foo...". That's means the 8192th character is actually slap bang in the middle of the word "foo":
Code: Select all
$bytes = fread($fp, 8192);
echo $bytes;
//<8185 chars here>I'm a f
That means a str_replace() for foo => zip will not work since there's only the "f" and no "oo". Likewise, the next read will start with:
Same issue, the fact that the search string is now truncated has caused a serious problem.
That's why I called it an algorithm. It needs to take into account:
a) The maximum length of a search string
b) The possibility that the start or end of a chunk read from the file may be part of a search string
Re: Another algorithm challenge :)
Posted: Sat Mar 01, 2008 11:54 pm
by Chris Corbyn
Reading single lines won't work neither

Imagine you want to replace:
Code: Select all
$replacements = array("foo\r\nbar" => "zip\r\nbutton");
Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 1:04 am
by hawkenterprises
Well no guarantee this would be faster but a way to solve a truncation of a streaming issue is to add a buffer thus a streaming buffered file transfer. Some thing like
Code: Select all
while(!feof($fp)){
$filebuffer = fread($fp,8192);
$filebufferold .= $filebuffer;
// do your search/replace on $filebufferold (8192*2) bytes
$filebufferold = $filebuffer; //resets the var to just the original 8192
}
This would solve any overlap search issues I would believe. The code wasn't test but the logic is solid
Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 9:06 am
by dml
It sounds like you need a tokeniser: a wrapper for a byte stream that turns it into a token stream.
The byte stream going into the component goes:
..., b, l, a, h, f, o, o, b, l, a, h, ...
And the token stream coming out of the component goes (say):
..., b, l, a, h, foo, b, l, a, h, ...
The component has an input buffer of whatever the optimal block size is, and it needs an output buffer which is going to be no bigger than the biggest token in its configuration. There are ways of compiling the optimal state machine given the set of tokens, but the other way of doing it is to examine the output buffer on every iteration... if it starts with one of the tokens in the list, emit the token, otherwise emit the first byte.
Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 5:33 pm
by alex.barylski
May I assume that placeholders are pre and post fixed with a unique delimiter? Something like:
%%FNAME%%
So your currently loading 8192 bytes into a buffer and replacing placeholders on the fly. The interesting problem is how to stitch placeholders togather when they fall between two chunk boundries???
Question: If you are using 'fixed' chunk sizes of 8192 -- what happens if the placeholder replacement text is larger than the placeholder itself -- replacing the text on the fly would introduce the following problem, wouldn't it?
Assume a chunk of 32 bytes:
Placeholder replaced version:
Code: Select all
This is a string 'which has been replaced with sub-contents'
The placeholder replacement requires more than the given 32 bytes and the chunk would overflow -- wouldn't it???
I would either load it all into a buffer (memory hog -- perhaps) or alternatively seek through the file handle byte by byte, basically emulating str_replace (slow like turtle -- easy on memory).
Cheers

Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 5:48 pm
by s.dot
substr_replace() ?

Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 5:52 pm
by Chris Corbyn
Hockey wrote:May I assume that placeholders are pre and post fixed with a unique delimiter? Something like:
%%FNAME%%
No... this needs to the equivalent a str_replace() except more of a stream_replace(). It would be used to convert CRLF line endings to just LF for example

Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 5:58 pm
by alex.barylski
Ok, so placeholders are not what you are replacing, you are performing Search & Replace on live data.
I'm still confused about the issue of fixed chunk sizes...does this mean each word you replace is essentially the same number of characters as the original?
Re: Another algorithm challenge :)
Posted: Sun Mar 02, 2008 7:15 pm
by Benjamin
I'll whip it out later tonight.