Processing a CSV file in chunks

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Processing a CSV file in chunks

Post by Luke »

I'm actually not using PHP for this, I'm using Coldfusion (because that's what they use where I work :( ) but I figured if you guys could help me get this figured in in PHP I could easily translate it to Coldfusion code. So, here's my problem.

I have to import data from a really large CSV file (15,000+ lines) and everything I've done so far has resulted in the server running out of memory and/or timing out. I need to figure out how to split the file into manageable chunks and process chunk by chunk, but I'm not sure I know how to do that.

I mean, I know I could just chop the file into several files and process each file individually, but is that any better than just doing them all at once? What makes it better? I really have never had to do something like this believe it or not.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Processing a CSV file in chunks

Post by Christopher »

Why don't you read the file either line by line or in chunks? How are you reading the file now? And is it running out of memory or timing out? Those are different problems with different solutions.
(#10850)
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: Processing a CSV file in chunks

Post by McInfo »

You could build an Web interface that allows you read and write the data a chunk at a time. This would add some manual labor to the insertion process, but you wouldn't get timeouts. The interface would keep track of what line of your source file you were on and increment a position counter by a fixed chunk size after each request. You script would seek to a new position in your source file based on chunk size times request count. Then it would read a chunk of data and do whatever you need with it.

I built something like this to upload phrases and words to a database from a very large text file. If my description wasn't sufficient, I can upload some sample code.

Edit: This post was recovered from search engine cache.
Last edited by McInfo on Mon Jun 14, 2010 12:45 pm, edited 1 time in total.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Processing a CSV file in chunks

Post by Luke »

Sorry I guess I didn't describe my problem as well as I could have. The main issue is not time-outs. I have extended the max execution time and solved that problem. The problem now is that I'm running out of memory. When you say I could read it in chunks, how does that work? Right now, my code looks like this (my code is coldfusion, but this is what it would look like if it were PHP). Oh and I should mention I didn't write this. This is code that was written by some programmer who no longer works here and I'm fixing it.

Here is what the code ACTUALLY is

Code: Select all

<cfif StructKeyExists(FORM, "submit")>
    <cfsetting requesttimeout="1800" /> <!--- 30 minutes --->
    <cfparam name = "FORM.skipFirst" default = "0" />
    <cffile action = "upload"
            fileField = "csv"
            destination = "#GetTempDirectory()#"
            nameConflict = "overwrite"
            accept = "text/plain, text/csv, text/comma-separated-values, application/vnd.ms-excel" />
 
    <cfset fileName = cffile.clientFilename />
    <cfset fileExt = cffile.clientFileExt />
    <cffile
        action = "read"
        file = "#GetTempDirectory()#\#fileName#.#fileExt#"
        variable = "csv"
        charset = "utf-8" />
    <cfset lineNum = "0" />
    <cfloop list="#csv#" index = "line" delimiters = "#chr(10)##chr(13)#">
        <cfset lineNum = IncrementValue(lineNum) />
        <cfif lineNum EQ 1 AND FORM.skipFirst EQ 1>
            <!--- We're skipping the first line --->
        <cfelse>
            <cfset vals = ListToArray(line, ",") />
            <!--- this simply perfoms an insert query --->
            <cfset Model.updateAttribute(vals) />
        </cfif>
    </cfloop>
    
    <cffile action = "delete" file = "#GetTempDirectory()#\#fileName#.#fileExt#" />
    <cfset flash("notice", "Your update was successful.") />
    <cflocation url = "?page=#URL.page#&action=browse" />
</cfif>
 
And this is its php equivalent

Code: Select all

if (array_key_exists("submit", $_POST)) {
    set_time_limit(1800); // 30 minutes
    $skipfirst = (array_key_exists("skipFirst", $_POST)) ? $_POST['skipFirst'] : false; 
    $uploadfile = getTempDirectory() . basename($_FILES['uploadcsv']['name']);
    if (move_uploaded_file($_FILES['uploadcsv']['tmp_name'], $uploadfile)) {
        $csv = new Csv_Reader($uploadfile);
        if ($skipfirst) $csv->next();
        while($row = $csv->getRow()) {
            $model->updateAttribute($row);
        }
        unlink($uploadfile);
        flash("All data has been imported successfully", "notice");
    } else {
        flash("Your upload failed", "error");
    }
    redirect("/", array('page' => $_GET['page'], 'action' => 'browse'));
}
I don't understand how reading a file in chunks would be any better than what I'm doing now. I mean, it would still have to read the same amount of data, right? :?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Processing a CSV file in chunks

Post by Christopher »

Luke wrote:I don't understand how reading a file in chunks would be any better than what I'm doing now. I mean, it would still have to read the same amount of data, right? :?
IANACFP, but I think the problem is the <cffile> reads the whole file into a variable -- like file_get_contents(). So your PHP code is not equivalent. I don't think there is a way to read files C-style with open/read/close. You will probably have to execute some external program to split the file into chunks.
(#10850)
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: Processing a CSV file in chunks

Post by McInfo »

A quick search turned up http://livedocs.adobe.com/coldfusion/8/ ... -g_12.html

Edit: This post was recovered from search engine cache.
Last edited by McInfo on Mon Jun 14, 2010 12:45 pm, edited 1 time in total.
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Re: Processing a CSV file in chunks

Post by n00b Saibot »

I would like to think that original guy thought of not burdening the server too much... like with loading all the data all at once into the memory...?

do you face any real problem there? or are you looking just for explanations?
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Processing a CSV file in chunks

Post by Luke »

Well the thing is... it doesn't seem to have a problem reading the whole file into memory. In fact, it inserted about 8,000 rows before it ran out of memory. :?

EDIT: McInfo, that looks like it might do the trick... but like I said it doesn't seem to be the reading of the whole file that is the problem. I think the main problem is that coldfusion sucks.
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: Processing a CSV file in chunks

Post by McInfo »

I mentioned FileOpen() because, in PHP, you need to open the file with fopen() (rather than file_get_contents()) to be able to use fseek() or fgets(). I was hoping there was a FileSeek() function in ColdFusion.

You might be able to loop through the file using FileReadLine() to move the file pointer ahead to a specific line, then process only a few lines. The following PHP code demonstrates this strategy.

read_chunk.php

Code: Select all

<?php
$fh = fopen('example_long_file.txt', 'r');
 
// Line number to start at (first is 0)
$start = 5;
 
// Number of lines to get
$get = 5;
 
if ($fh)
{
    // Current line number (increments to 0 before first line is read)
     $line = -1;
     
     // Loops through every line before line $start + $get
     while ($line < $start + $get - 1 && !feof($fh))
     {
        // Increments the line counter
        $line++;
        
        // Gets the current line and moves the pointer ahead
        $line_contents = fgets($fh);
        
        if ($line < $start)
        {
            // Debugging only: Shows the lines that are skipped
            echo "<div style='color:#CCC'>$line: $line_contents</div>\n";
           
            // Does not process any line before the start line
            continue;
        }
 
        // Process the "gotten" lines here
        
        // Debugging only: Shows the lines that are processed
        echo "<div style='font-weight:bold'>$line: $line_contents</div>\n";
     }
}
?>
example_long_file.txt
[text]THE BROOKLYN BRIDGE
PEAKS AND VALLEYS
CONSTRUCTION FOREMAN
UNSPOILED OCEAN VIEWS
NEW YEAR'S & UNITED NATIONS RESOLUTION
MAKE IT SNAPPY
WOOL CREW NECK SWEATER
PROVIDENCE RHODE ISLAND
HOMEBODY
SLEDS AND TOBOGGANS
POPLAR TREES
JOHANNESBURG SOUTH AFRICA
HOT ARTICHOKE SPINACH DIP
HAVING THE ENTIRE BEACH TO YOURSELF
THE BIGGEST LOSER
NEEDLE-NOSE PLIERS
GIVE ME A HINT
SUGAR COOKIES
INNKEEPER
TRADITIONAL SPANISH OMELETTE
A FIGMENT OF YOUR IMAGINATION
IS THERE A DOCTOR IN THE HOUSE DRESSING
STORAGE UNIT
THE DRYER'S LINT SCREEN
FASHION POLICE
BABY'S BREATH
METEOROLOGIST
STATE CAPITAL
WIDE-BRIMMED STRAW HAT
PATRICIA ARQUETTE STARS IN MEDIUM[/text]
Output of read_chunk.php
0: THE BROOKLYN BRIDGE
1: PEAKS AND VALLEYS
2: CONSTRUCTION FOREMAN
3: UNSPOILED OCEAN VIEWS
4: NEW YEAR'S & UNITED NATIONS RESOLUTION

5: MAKE IT SNAPPY
6: WOOL CREW NECK SWEATER
7: PROVIDENCE RHODE ISLAND
8: HOMEBODY
9: SLEDS AND TOBOGGANS
Edit: This post was recovered from search engine cache.
Last edited by McInfo on Mon Jun 14, 2010 12:49 pm, edited 1 time in total.
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Processing a CSV file in chunks

Post by Luke »

I am on Coldfusion 7 and the File* family of functions was added in version 8. So instead I just used Java's io module to do read the file one line at a time. I still run out of memory. Should I maybe chop it into manageable chunks and process them one at a time? :(

This is what it looks like now that I switched to Java io and processed one line at a time.

Code: Select all

<cfif StructKeyExists(FORM, "submit")>
    <cfparam name = "FORM.skipFirst" default = "0" />
    <cffile action = "upload"
            fileField = "csv"
            destination = "#GetTempDirectory()#"
            nameConflict = "overwrite"
            accept = "text/plain, text/csv, text/comma-separated-values, application/octet-stream, application/vnd.ms-excel" />
 
    <cfset fileName = cffile.clientFilename />
    <cfset fileExt = cffile.clientFileExt />
        <!--- read the file line by line to avoid running out of memory --->
    <cfset filePath = "#GetTempDirectory()##fileName#.#fileExt#" />
    <cfset fileReader = CreateObject("java", "java.io.FileReader").Init(JavaCast("string", filePath)) />
    <cfset bufferedReader = CreateObject("java", "java.io.BufferedReader").Init(fileReader) />
    <cfset lineReader = CreateObject("java", "java.io.LineNumberReader" ).Init(bufferedReader) />
    <cfset request.lineData = lineReader.readLine() />
    
    <cfloop condition="StructKeyExists(request, 'lineData')">
        <cfset lineNum = lineReader.GetLineNumber() />
        <cfif form.skipFirst EQ 1 AND lineNum EQ 1>
            <!--- this is the only way I could get this to work... --->
            <cfset request.lineData = lineReader.readLine() />
        </cfif>
        <cfset vals = ListToArray(request.lineData) />
        <cfset Model.updateAttribute(URL.id, vals) />
        <cfset request.lineData = lineReader.readLine() />
    </cfloop>
            
    <cfset flash("notice", "Your update was successful.") />
    <cflocation url = "?page=#URL.page#&action=browse&id=#URL.id#" />
</cfif>
(Oh and I moved this to misc. since it really isnt' a PHP topic).
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Processing a CSV file in chunks

Post by Benjamin »

Read a chunk into a buffer, say 100 lines at a time or so. Process the buffer, clear the processed part (if you can't process 100% of it), then read another chunk into the buffer, appending it to what is left after you clear it.

I had forgotten what coldfusion looked like. Any chance of them switching to PHP?
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Processing a CSV file in chunks

Post by Luke »

No chance in hell. :(
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Processing a CSV file in chunks

Post by Luke »

McInfo wrote:You could build an Web interface that allows you read and write the data a chunk at a time. This would add some manual labor to the insertion process, but you wouldn't get timeouts. The interface would keep track of what line of your source file you were on and increment a position counter by a fixed chunk size after each request. You script would seek to a new position in your source file based on chunk size times request count. Then it would read a chunk of data and do whatever you need with it.

I built something like this to upload phrases and words to a database from a very large text file. If my description wasn't sufficient, I can upload some sample code.
I guess I'll have to do this...
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Processing a CSV file in chunks

Post by Luke »

Now I'm watching my memory in Windows Task Manager. It does fine until its gets to the 5,000th line in the file and then it quickly climbs up to 500,000k and craps out. I added the following to see if I could force garbage collection every 25 lines and it seems to have done me no good.

Code: Select all

<cfif lineNum % 25 EQ 0>
    <cfset runtime = CreateObject("java", "java.lang.Runtime").getRuntime() />
    <cfset runtime.gc() />
</cfif>
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Processing a CSV file in chunks

Post by Benjamin »

Hmm, could you use and install PHP just for that task? Possibly even just invoking a php script from a system call? I don't even know if CF can do that.
Post Reply