Another algorithm challenge :)

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Another algorithm challenge :)

Post by Chris Corbyn »

Ok, so I'm stealing ideas here. I have my own way of doing this but I'm hoping someone can post a more optimized method ;)

Given two file handles; one which is readable and another which is writable, write one file to the other performing string-replacements on-the-fly. Imagine these files could be 10MB or so, and thus reading the *entire* file contents and doing a str_replace() is not acceptable. Use the least amount of memory possible.

Pseudo Example:

Code: Select all

function copyStreamTransformed($reader, $writer, $replacements = array()) {
  while (!feof($reader) && false !== $bytes = fread($reader, 8192)) {
    //here's where you'd replace $bytes using $replacements, then...
    fwrite($writer, $bytes);
  }
}
 
//Use case
$fpOut = fopen('text-file.txt', 'r');
$fpIn = fopen('replaced-file.txt', 'w');
 
copyStreamTransformed($fpOut, $fpIn, array('foo' => 'bar', "\r\n" => "\n"));
The above example a text file with windows line endings would be copied to a text file containing UNIX line endings and with all occurences of "foo" replaced with "bar".

Bear in mind that when you fread() from the reader, you may end up with half of one of your replacements tagged onto the end of the string, and the other half would come out at the start of the next read (e.g. read 1 = "some fo", read 2 = "o and more foo").

Looking forward to stealing your ideas :P
User avatar
hawkenterprises
Forum Commoner
Posts: 54
Joined: Thu Feb 28, 2008 9:56 pm
Location: gresham,oregon
Contact:

Re: Another algorithm challenge :)

Post by hawkenterprises »

What are the full rules to the challenge? I mean something like this might be easily approached with shell commands and grep. That's what PHP is basically doing anyways might as well just beat it to it.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Another algorithm challenge :)

Post by Chris Corbyn »

Pure PHP... no shell commands ;)
User avatar
Sekka
Forum Commoner
Posts: 91
Joined: Mon Feb 18, 2008 10:25 am
Location: Huddersfield, West Yorkshire, UK

Re: Another algorithm challenge :)

Post by Sekka »

No str_replace()? Hrm... preg_replace() acceptable? Or is that even worse? :D
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Another algorithm challenge :)

Post by Chris Corbyn »

Sekka wrote:No str_replace()? Hrm... preg_replace() acceptable? Or is that even worse? :D
I didn't say no str_replace() ;) What I did say was none of doing this:

Code: Select all

$full_file = fread($reader); //read the lot! NO thanks!
fwrite($writer, str_replace( ... , $full_file)); //This will run out of memory on large files
You need to make use of streaming the file contents from one to the other.... it's harder than it sounds :)
User avatar
Sekka
Forum Commoner
Posts: 91
Joined: Mon Feb 18, 2008 10:25 am
Location: Huddersfield, West Yorkshire, UK

Re: Another algorithm challenge :)

Post by Sekka »

Then surely the example you posted is the best as it does it one line at a time? Or do I just lack the relevant PHP knowledge in this area? :D

I'm also gonna use the 'I'm in the UK and it's currently 4am' card as I truly can't think at this hour.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Another algorithm challenge :)

Post by Chris Corbyn »

Ok, here's a problem :)

Original file contains:

Code: Select all

I'm a foo but I wish I was a bar
Replacements: foo => zip, bar => button

So we'd want:

Code: Select all

I'm a zip but I wish I was a button
Now when streaming 8192 bytes a time (the max fread() chunk size) that would be fine. But let's say for example there are 8185 characters before that occurence of "I'm a foo...". That's means the 8192th character is actually slap bang in the middle of the word "foo":

Code: Select all

$bytes = fread($fp, 8192);
echo $bytes;
//<8185 chars here>I'm a f
 
That means a str_replace() for foo => zip will not work since there's only the "f" and no "oo". Likewise, the next read will start with:

Code: Select all

oo but I wish I was a bar
Same issue, the fact that the search string is now truncated has caused a serious problem.

That's why I called it an algorithm. It needs to take into account:

a) The maximum length of a search string
b) The possibility that the start or end of a chunk read from the file may be part of a search string
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Another algorithm challenge :)

Post by Chris Corbyn »

Reading single lines won't work neither :P Imagine you want to replace:

Code: Select all

$replacements = array("foo\r\nbar" => "zip\r\nbutton");
User avatar
hawkenterprises
Forum Commoner
Posts: 54
Joined: Thu Feb 28, 2008 9:56 pm
Location: gresham,oregon
Contact:

Re: Another algorithm challenge :)

Post by hawkenterprises »

Well no guarantee this would be faster but a way to solve a truncation of a streaming issue is to add a buffer thus a streaming buffered file transfer. Some thing like

Code: Select all

 
while(!feof($fp)){
$filebuffer = fread($fp,8192);
$filebufferold .= $filebuffer;
    // do your search/replace on $filebufferold   (8192*2) bytes
$filebufferold = $filebuffer;  //resets the var to just the original 8192
}
 
This would solve any overlap search issues I would believe. The code wasn't test but the logic is solid
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Another algorithm challenge :)

Post by dml »

It sounds like you need a tokeniser: a wrapper for a byte stream that turns it into a token stream.

The byte stream going into the component goes:

..., b, l, a, h, f, o, o, b, l, a, h, ...

And the token stream coming out of the component goes (say):

..., b, l, a, h, foo, b, l, a, h, ...

The component has an input buffer of whatever the optimal block size is, and it needs an output buffer which is going to be no bigger than the biggest token in its configuration. There are ways of compiling the optimal state machine given the set of tokens, but the other way of doing it is to examine the output buffer on every iteration... if it starts with one of the tokens in the list, emit the token, otherwise emit the first byte.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Another algorithm challenge :)

Post by alex.barylski »

May I assume that placeholders are pre and post fixed with a unique delimiter? Something like:

%%FNAME%%

So your currently loading 8192 bytes into a buffer and replacing placeholders on the fly. The interesting problem is how to stitch placeholders togather when they fall between two chunk boundries???

Question: If you are using 'fixed' chunk sizes of 8192 -- what happens if the placeholder replacement text is larger than the placeholder itself -- replacing the text on the fly would introduce the following problem, wouldn't it?

Assume a chunk of 32 bytes:

Code: Select all

This is a string %%PLACEHOLDER%%
Placeholder replaced version:

Code: Select all

This is a string 'which has been replaced with sub-contents'
The placeholder replacement requires more than the given 32 bytes and the chunk would overflow -- wouldn't it???

I would either load it all into a buffer (memory hog -- perhaps) or alternatively seek through the file handle byte by byte, basically emulating str_replace (slow like turtle -- easy on memory).

Cheers :)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: Another algorithm challenge :)

Post by s.dot »

substr_replace() ? :-D
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Another algorithm challenge :)

Post by Chris Corbyn »

Hockey wrote:May I assume that placeholders are pre and post fixed with a unique delimiter? Something like:

%%FNAME%%
No... this needs to the equivalent a str_replace() except more of a stream_replace(). It would be used to convert CRLF line endings to just LF for example :)
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Another algorithm challenge :)

Post by alex.barylski »

Ok, so placeholders are not what you are replacing, you are performing Search & Replace on live data.

I'm still confused about the issue of fixed chunk sizes...does this mean each word you replace is essentially the same number of characters as the original?
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Another algorithm challenge :)

Post by Benjamin »

I'll whip it out later tonight.
Post Reply