Page 1 of 2

Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 11:14 am
by Luke
I'm actually not using PHP for this, I'm using Coldfusion (because that's what they use where I work :( ) but I figured if you guys could help me get this figured in in PHP I could easily translate it to Coldfusion code. So, here's my problem.

I have to import data from a really large CSV file (15,000+ lines) and everything I've done so far has resulted in the server running out of memory and/or timing out. I need to figure out how to split the file into manageable chunks and process chunk by chunk, but I'm not sure I know how to do that.

I mean, I know I could just chop the file into several files and process each file individually, but is that any better than just doing them all at once? What makes it better? I really have never had to do something like this believe it or not.

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 12:03 pm
by Christopher
Why don't you read the file either line by line or in chunks? How are you reading the file now? And is it running out of memory or timing out? Those are different problems with different solutions.

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 12:12 pm
by McInfo
You could build an Web interface that allows you read and write the data a chunk at a time. This would add some manual labor to the insertion process, but you wouldn't get timeouts. The interface would keep track of what line of your source file you were on and increment a position counter by a fixed chunk size after each request. You script would seek to a new position in your source file based on chunk size times request count. Then it would read a chunk of data and do whatever you need with it.

I built something like this to upload phrases and words to a database from a very large text file. If my description wasn't sufficient, I can upload some sample code.

Edit: This post was recovered from search engine cache.

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 12:47 pm
by Luke
Sorry I guess I didn't describe my problem as well as I could have. The main issue is not time-outs. I have extended the max execution time and solved that problem. The problem now is that I'm running out of memory. When you say I could read it in chunks, how does that work? Right now, my code looks like this (my code is coldfusion, but this is what it would look like if it were PHP). Oh and I should mention I didn't write this. This is code that was written by some programmer who no longer works here and I'm fixing it.

Here is what the code ACTUALLY is

Code: Select all

<cfif StructKeyExists(FORM, "submit")>
    <cfsetting requesttimeout="1800" /> <!--- 30 minutes --->
    <cfparam name = "FORM.skipFirst" default = "0" />
    <cffile action = "upload"
            fileField = "csv"
            destination = "#GetTempDirectory()#"
            nameConflict = "overwrite"
            accept = "text/plain, text/csv, text/comma-separated-values, application/vnd.ms-excel" />
 
    <cfset fileName = cffile.clientFilename />
    <cfset fileExt = cffile.clientFileExt />
    <cffile
        action = "read"
        file = "#GetTempDirectory()#\#fileName#.#fileExt#"
        variable = "csv"
        charset = "utf-8" />
    <cfset lineNum = "0" />
    <cfloop list="#csv#" index = "line" delimiters = "#chr(10)##chr(13)#">
        <cfset lineNum = IncrementValue(lineNum) />
        <cfif lineNum EQ 1 AND FORM.skipFirst EQ 1>
            <!--- We're skipping the first line --->
        <cfelse>
            <cfset vals = ListToArray(line, ",") />
            <!--- this simply perfoms an insert query --->
            <cfset Model.updateAttribute(vals) />
        </cfif>
    </cfloop>
    
    <cffile action = "delete" file = "#GetTempDirectory()#\#fileName#.#fileExt#" />
    <cfset flash("notice", "Your update was successful.") />
    <cflocation url = "?page=#URL.page#&action=browse" />
</cfif>
 
And this is its php equivalent

Code: Select all

if (array_key_exists("submit", $_POST)) {
    set_time_limit(1800); // 30 minutes
    $skipfirst = (array_key_exists("skipFirst", $_POST)) ? $_POST['skipFirst'] : false; 
    $uploadfile = getTempDirectory() . basename($_FILES['uploadcsv']['name']);
    if (move_uploaded_file($_FILES['uploadcsv']['tmp_name'], $uploadfile)) {
        $csv = new Csv_Reader($uploadfile);
        if ($skipfirst) $csv->next();
        while($row = $csv->getRow()) {
            $model->updateAttribute($row);
        }
        unlink($uploadfile);
        flash("All data has been imported successfully", "notice");
    } else {
        flash("Your upload failed", "error");
    }
    redirect("/", array('page' => $_GET['page'], 'action' => 'browse'));
}
I don't understand how reading a file in chunks would be any better than what I'm doing now. I mean, it would still have to read the same amount of data, right? :?

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 12:57 pm
by Christopher
Luke wrote:I don't understand how reading a file in chunks would be any better than what I'm doing now. I mean, it would still have to read the same amount of data, right? :?
IANACFP, but I think the problem is the <cffile> reads the whole file into a variable -- like file_get_contents(). So your PHP code is not equivalent. I don't think there is a way to read files C-style with open/read/close. You will probably have to execute some external program to split the file into chunks.

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 1:02 pm
by McInfo
A quick search turned up http://livedocs.adobe.com/coldfusion/8/ ... -g_12.html

Edit: This post was recovered from search engine cache.

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 1:02 pm
by n00b Saibot
I would like to think that original guy thought of not burdening the server too much... like with loading all the data all at once into the memory...?

do you face any real problem there? or are you looking just for explanations?

Re: Processing a CSV file in chunks

Posted: Wed Apr 08, 2009 1:04 pm
by Luke
Well the thing is... it doesn't seem to have a problem reading the whole file into memory. In fact, it inserted about 8,000 rows before it ran out of memory. :?

EDIT: McInfo, that looks like it might do the trick... but like I said it doesn't seem to be the reading of the whole file that is the problem. I think the main problem is that coldfusion sucks.

Re: Processing a CSV file in chunks

Posted: Thu Apr 09, 2009 1:05 pm
by McInfo
I mentioned FileOpen() because, in PHP, you need to open the file with fopen() (rather than file_get_contents()) to be able to use fseek() or fgets(). I was hoping there was a FileSeek() function in ColdFusion.

You might be able to loop through the file using FileReadLine() to move the file pointer ahead to a specific line, then process only a few lines. The following PHP code demonstrates this strategy.

read_chunk.php

Code: Select all

<?php
$fh = fopen('example_long_file.txt', 'r');
 
// Line number to start at (first is 0)
$start = 5;
 
// Number of lines to get
$get = 5;
 
if ($fh)
{
    // Current line number (increments to 0 before first line is read)
     $line = -1;
     
     // Loops through every line before line $start + $get
     while ($line < $start + $get - 1 && !feof($fh))
     {
        // Increments the line counter
        $line++;
        
        // Gets the current line and moves the pointer ahead
        $line_contents = fgets($fh);
        
        if ($line < $start)
        {
            // Debugging only: Shows the lines that are skipped
            echo "<div style='color:#CCC'>$line: $line_contents</div>\n";
           
            // Does not process any line before the start line
            continue;
        }
 
        // Process the "gotten" lines here
        
        // Debugging only: Shows the lines that are processed
        echo "<div style='font-weight:bold'>$line: $line_contents</div>\n";
     }
}
?>
example_long_file.txt
[text]THE BROOKLYN BRIDGE
PEAKS AND VALLEYS
CONSTRUCTION FOREMAN
UNSPOILED OCEAN VIEWS
NEW YEAR'S & UNITED NATIONS RESOLUTION
MAKE IT SNAPPY
WOOL CREW NECK SWEATER
PROVIDENCE RHODE ISLAND
HOMEBODY
SLEDS AND TOBOGGANS
POPLAR TREES
JOHANNESBURG SOUTH AFRICA
HOT ARTICHOKE SPINACH DIP
HAVING THE ENTIRE BEACH TO YOURSELF
THE BIGGEST LOSER
NEEDLE-NOSE PLIERS
GIVE ME A HINT
SUGAR COOKIES
INNKEEPER
TRADITIONAL SPANISH OMELETTE
A FIGMENT OF YOUR IMAGINATION
IS THERE A DOCTOR IN THE HOUSE DRESSING
STORAGE UNIT
THE DRYER'S LINT SCREEN
FASHION POLICE
BABY'S BREATH
METEOROLOGIST
STATE CAPITAL
WIDE-BRIMMED STRAW HAT
PATRICIA ARQUETTE STARS IN MEDIUM[/text]
Output of read_chunk.php
0: THE BROOKLYN BRIDGE
1: PEAKS AND VALLEYS
2: CONSTRUCTION FOREMAN
3: UNSPOILED OCEAN VIEWS
4: NEW YEAR'S & UNITED NATIONS RESOLUTION

5: MAKE IT SNAPPY
6: WOOL CREW NECK SWEATER
7: PROVIDENCE RHODE ISLAND
8: HOMEBODY
9: SLEDS AND TOBOGGANS
Edit: This post was recovered from search engine cache.

Re: Processing a CSV file in chunks

Posted: Fri Apr 10, 2009 7:37 am
by Luke
I am on Coldfusion 7 and the File* family of functions was added in version 8. So instead I just used Java's io module to do read the file one line at a time. I still run out of memory. Should I maybe chop it into manageable chunks and process them one at a time? :(

This is what it looks like now that I switched to Java io and processed one line at a time.

Code: Select all

<cfif StructKeyExists(FORM, "submit")>
    <cfparam name = "FORM.skipFirst" default = "0" />
    <cffile action = "upload"
            fileField = "csv"
            destination = "#GetTempDirectory()#"
            nameConflict = "overwrite"
            accept = "text/plain, text/csv, text/comma-separated-values, application/octet-stream, application/vnd.ms-excel" />
 
    <cfset fileName = cffile.clientFilename />
    <cfset fileExt = cffile.clientFileExt />
        <!--- read the file line by line to avoid running out of memory --->
    <cfset filePath = "#GetTempDirectory()##fileName#.#fileExt#" />
    <cfset fileReader = CreateObject("java", "java.io.FileReader").Init(JavaCast("string", filePath)) />
    <cfset bufferedReader = CreateObject("java", "java.io.BufferedReader").Init(fileReader) />
    <cfset lineReader = CreateObject("java", "java.io.LineNumberReader" ).Init(bufferedReader) />
    <cfset request.lineData = lineReader.readLine() />
    
    <cfloop condition="StructKeyExists(request, 'lineData')">
        <cfset lineNum = lineReader.GetLineNumber() />
        <cfif form.skipFirst EQ 1 AND lineNum EQ 1>
            <!--- this is the only way I could get this to work... --->
            <cfset request.lineData = lineReader.readLine() />
        </cfif>
        <cfset vals = ListToArray(request.lineData) />
        <cfset Model.updateAttribute(URL.id, vals) />
        <cfset request.lineData = lineReader.readLine() />
    </cfloop>
            
    <cfset flash("notice", "Your update was successful.") />
    <cflocation url = "?page=#URL.page#&action=browse&id=#URL.id#" />
</cfif>
(Oh and I moved this to misc. since it really isnt' a PHP topic).

Re: Processing a CSV file in chunks

Posted: Fri Apr 10, 2009 7:44 am
by Benjamin
Read a chunk into a buffer, say 100 lines at a time or so. Process the buffer, clear the processed part (if you can't process 100% of it), then read another chunk into the buffer, appending it to what is left after you clear it.

I had forgotten what coldfusion looked like. Any chance of them switching to PHP?

Re: Processing a CSV file in chunks

Posted: Fri Apr 10, 2009 7:47 am
by Luke
No chance in hell. :(

Re: Processing a CSV file in chunks

Posted: Fri Apr 10, 2009 7:49 am
by Luke
McInfo wrote:You could build an Web interface that allows you read and write the data a chunk at a time. This would add some manual labor to the insertion process, but you wouldn't get timeouts. The interface would keep track of what line of your source file you were on and increment a position counter by a fixed chunk size after each request. You script would seek to a new position in your source file based on chunk size times request count. Then it would read a chunk of data and do whatever you need with it.

I built something like this to upload phrases and words to a database from a very large text file. If my description wasn't sufficient, I can upload some sample code.
I guess I'll have to do this...

Re: Processing a CSV file in chunks

Posted: Fri Apr 10, 2009 11:39 am
by Luke
Now I'm watching my memory in Windows Task Manager. It does fine until its gets to the 5,000th line in the file and then it quickly climbs up to 500,000k and craps out. I added the following to see if I could force garbage collection every 25 lines and it seems to have done me no good.

Code: Select all

<cfif lineNum % 25 EQ 0>
    <cfset runtime = CreateObject("java", "java.lang.Runtime").getRuntime() />
    <cfset runtime.gc() />
</cfif>

Re: Processing a CSV file in chunks

Posted: Fri Apr 10, 2009 11:43 am
by Benjamin
Hmm, could you use and install PHP just for that task? Possibly even just invoking a php script from a system call? I don't even know if CF can do that.