Page 1 of 1
fseek efficiency
Posted: Thu Nov 17, 2005 5:26 pm
by josh
Let's say I need to get an integer (4 bytes) from every 100 byte mark in a file. What would be the speediest method, would reading them in order be any faster then reading them in random order, should I read them backwards, does it depend on the OS? Does it even make a difference?
Code: Select all
for($x=0;$x<=10000; $x+=100) {
fseek($h,$x);
fread()
}
Code: Select all
for($x=0;$x<=10000; $x-=100) {
fseek($h,$x);
fread()
}
Just using that as an example, it's not going to be every 100 bytes in the application.
Posted: Thu Nov 17, 2005 6:24 pm
by yum-jelly
For me it really would depend on what I am doing, if reading from a big file I would use random with fseek() in a while(). If I have a really long string and know I want 2 bytes after ever 100 bytes, I would use pack(), no loops that way. But that would depend on if I need the data in a array or if I need to do other processing to the 2 bytes, where then a loop would be better!
If your only reading 100 bytes then grabing 2 or 4 bytes then I might also just use fread() and fread (102 or 104) then substr(fread(), -2 or -4). Testing will tell which is best. Testing is important because even core functions that you think would be best are slower than other core functions that you might not think to use for the task at hand!
yj
Posted: Thu Nov 17, 2005 8:56 pm
by alex.barylski
I am willing to bet, for performance, you probably want to read the entire file into a buffer and just index the buffer instead of using individual fseek() calls
Cheers

Posted: Thu Nov 17, 2005 9:10 pm
by josh
Well the file is very large, and reading it into memory and then extracting substrings would eat up server memory, not to mention I'd have to increase PHP's maximum memory limit, if the file is read into memory by several concurrent processes the server would run out of ram and it would begin using pagefile, so for performance reasons fseek() would be the way to go especially since I'll only be needing on average .1% of the file on any given request, I was more or less asking if the order in which I read the data out matters at all. After some thought I have decided to use mysql as the storage engine so I can just pull out rows I need, and let mysql worry about the most efficient way of finding the right data, but I'd still like to know the answer to my question just because I foresee me needing to use the file-system in this manor on future projects.
Posted: Thu Nov 17, 2005 9:11 pm
by Ambush Commander
When it comes to this sort of stuff, make sure it works, then profile your application. If speed is that big of a problem, there's usually other stuff that takes up more execution time and would have a greater beneficial effect when optimized.
Don't sacrifice readability for performance up-front. Only once you have a solution, and it is inefficient, do you refactor.
Posted: Thu Nov 17, 2005 9:25 pm
by josh
Well I already have a solution, the thing is I'm working with an excerpt of the entire data, so once we load in the entire dataset I know there will be a slowdown if I read the whole thing in
Posted: Thu Nov 17, 2005 9:32 pm
by Ambush Commander
Would it be possible to try it (the entire testing set) in a testing environment?
Anyway, I have no clue. Benchmark it. Maybe it makes no difference. Maybe it is a machine thing. Only you know.
Posted: Thu Nov 17, 2005 10:32 pm
by josh
I won't have access to the entire dataset for a few weeks, I can't benchmark it now (well technically I could but itd be pointless) because the sample data I have is so small, any differences in the order I fseek things would be in-measurable. I just tested the whole method with the data being stored in mysql and it runs almost 400% slower, so it looks like I will be using the file-system after all, and just keeping an index in mysql, but you're absolutely right, I could set up benchmarks using dummy data and I probably will, as I have to test a whole bunch of things besides the order I fseek() in. Incase you were wondering I'm loading huge amounts of vector data and the file that reads the data, and converts it to an array of usable points is the bottleneck in my system, so the faster that file runs the faster my total execution time will be (That file can take up to 5 seconds to run depending on the range of data I'm pulling)
Thanks for all the suggestions though
Posted: Fri Nov 18, 2005 7:37 am
by BDKR
jshpro2 wrote:
...and the file that reads the data, and converts it to an array of usable points is the bottleneck in my system, ...
Is their some reason you can't cache this data once it's been generated? Does it change too often to make this a reasonable plan of action?
What is/are the system/systems like that will be running this stuff? I ask for two reasons.
1) Depending on the version of MySQL you are using, you should have a query cache that can have a dramatic effect in some cases. This sounds like one of them.
2) The Turck MMCache Accellerator has an API that let's you store data in memory. Data that can be called between script runs. Additionally, data that can be updated or deleted from and adminstrative utility. The idea being that when your script runs, it can see if there is a cached version of this information in memory vai the Turck API. If it's there, then the bottlenceck is avoided.
Of course, you could also serialize that generated array and store it to file. It would work the same as number 2) above for the most part.
Anyway, I can garauntee that both of these methods are much faster in most circumstances. There may be some where it's not, but I can't fathom what they may be.
Cheers,
BDKR
Posted: Fri Nov 18, 2005 9:04 am
by josh
I'm not sure what you mean by serializing it to the filesystem, right now it is in a binary file, and mysql is being used as an index, If I need record #400 for example, I look up record 400's byte position in mysql, then fopen my file and fseek to whatver byte position it is at. I may have to do this 700 times per page request, so is there any way to optimize the filesystem part of this? Like splitting up the file so each record has it's own file?
I don't think the cache system you were talking about is practical though, for the same reason you probably wouldn't store a bunch of image in a blob field in a mysql table, you would use the filesystem and use mysql to index it.
Posted: Sat Nov 19, 2005 1:42 am
by BDKR
jshpro2 wrote:... If I need record #400 for example, I look up record 400's byte position in mysql, .... I may have to do this 700 times per page request,...
This right here is a perfect example of overhead that can be cached.
Let me ask a question? Would it be faster to seek and find this information 700 times out of some data structure stored in memory or do 700 queries to the database? If you said getting out of memory, you're correct. It certainly is faster then suffering the overhead of the initial connection to the db (you are connecting only once right?) and the network I/O for each request. Just grab all that crap out of the db and put it into an array or something. One option would be to serialize it between script runs. Another option would be to store it in memory. The Turck API provides a fantastic mechanism for doing this. There are also the shared memory functions.
serialize()
Enjoy!
Cheers,
BDKR
Posted: Sat Nov 19, 2005 12:17 pm
by josh
By 700 times per request, I mean 700 different pieces of data, I decided to throw all my data into a file and have the mysql table hold the index (tells me which data can be found where). This brought my average execution time down .3 seconds, down from over 2 seconds. Commenting out the unserialize() function brings my total execution time down to .4 seconds. I will be storing my data in a binary format, instead of a serialized array stored as ascii for 2 reasons:
speed = serialize wasn't built for large amounts of data in short amounts of time
space = why use serialized on a 1D array, when you can store it as binary with 20% the required disk space
Posted: Sat Nov 19, 2005 7:01 pm
by BDKR
shpro2 wrote:
By 700 times per request, I mean 700 different pieces of data,
Which doesn't change anything. Whether it's the same info 700 times (which is brain dead) or 700 seperate peices of info means there are still upwards of 700 requests to the database per page request (which is brain dead) based on information you've provided above.
shpro2 wrote:
I decided to throw all my data into a file and have the mysql table hold the index (tells me which data can be found where). This brought my average execution time down .3 seconds, down from over 2 seconds. Commenting out the unserialize() function brings my total execution time down to .4 seconds. I will be storing my data in a binary format, instead of a serialized array stored as ascii for 2 reasons:
In my last post, my suggestion was that you serialize the index that is being held in memory instead of making all of those calls to mysql. I wasn't talking about the file that you are using fseek() on. I still believe this is a much better option from an algorithmic point of view
shpro2 wrote:
speed = serialize wasn't built for large amounts of data in short amounts of time
That may be, however, the time taken to rebuild a serialized array of indexes to points in your binary file would be much,
MUCH faster then upwards of 700 seperate requests to a database.
shpro2 wrote:
space = why use serialized on a 1D array, when you can store it as binary with 20% the required disk space
You've misunderstood me. One more time: my contention was that building a data structure of some sort that represents your index stored in MySQL and a) unserializing it from a file, or b) storing it in shared memory, would be a lot faster then doing upwards of 700 queries to the database.
Just some FYI for ya, this exact idea has been used in either this manner or very similar ways in other applicattions. Especially those that require complex data structures based on information stored in a database. Many modern ERP systems using OODBMS's will often store trees of objects in memory to cut down on the huge overhead of rebuilding those data structures using SQL calls alone. It gets even worse (and as a result, becomes even more important) when someone is trying to model their data in an OO fashion using a Relational storage medium (ever heared the term "Impedance Mismatch"?). This second scenario is more prevalent then you may believe and catches many a programmer off guard when any kind of moderately large data set is thrown at their glorious creations. You have to realize on some level that a file that is requiring upwards of 700 queries per page request can't and won't scale as the load goes up.
Anyway, my tone was probably a bit edgy so my apologies.
However, I do suggest you begin to look into an algortihm that employs caching at some level.
Cheers,
BDKR
Posted: Sat Nov 19, 2005 10:43 pm
by josh
I think you actually misunderstood me, It's one call to the DB, upwards of 700 ROWS returned, thus 700 pieces of data to pull off the file-system, I don't see where a cache would help out dramatically, the mySQL query is nearly instant, using binary rather then serialized ascii files has brought the execution time down so now the only bottleneck is disk speed.
Posted: Sat Nov 19, 2005 11:04 pm
by BDKR
jshpro2 wrote:I think you actually misunderstood me, It's one call to the DB, upwards of 700 ROWS returned, ...
Well that changes everything then! LOL