Page 1 of 1

INSERT Query via CSV taking ages

Posted: Thu Apr 10, 2014 5:39 am
by simonmlewis
I am trying to upload a 350,000 line CSV file.
Each row is checking that it doesn't exist already (to avoid duplicates), and then remove the " at the end of the run.
I thought it would take maybe a minute or two. It's taking a lot longer. Is my script badly done, causing it to be slow?

Code: Select all

if ($update == "addcsv")
{
if ($_FILES[csv][size] > 0) 
{
    //get the csv file
    $file = $_FILES[csv][tmp_name];
    $handle = fopen($file,"r");
    //loop through the csv file and insert into database
    do {
        if ($data[0]) {
        $newdata = mysql_real_escape_string($data[0]);
        
        $result = mysql_query ("SELECT email FROM subscribed_upload WHERE email = '$newdata'");
        $num_result = mysql_num_rows($result);
        if ($num_result == 0)
        {
            mysql_query("INSERT INTO subscribed_upload (email) VALUES
                (
                     '$newdata'
                )
            ")or die(mysql_error());
        }
            $count = $count + 1;
        }
    } while ($data = fgetcsv($handle,1000,",","'"));
}
mysql_query ("UPDATE subscribed_upload SET email = REPLACE(email,'\"','')");

}

Re: INSERT Query via CSV taking ages

Posted: Thu Apr 10, 2014 6:55 am
by simonmlewis
It's definitely wrong, as I stripped it back to this:

Code: Select all

if ($update == "addcsv")
{
if ($_FILES[csv][size] > 0) 
{
    //get the csv file
    $file = $_FILES[csv][tmp_name];
    $handle = fopen($file,"r");
    //loop through the csv file and insert into database
    do {
        if ($data[0]) {
        $newdata = mysql_real_escape_string($data[0]);
        
 
            mysql_query("INSERT INTO subscribed_upload (email) VALUES
                (
                     '$newdata'
                )
            ")or die(mysql_error());

            $count = $count + 1;
        }
    } while ($data = fgetcsv($handle,1000,",","'"));
}
mysql_query ("UPDATE subscribed_upload SET email = REPLACE(email,'\"','')");

}
And it ran in under a minute. The previous version took 2 hours and hadn't finished.

I need to do the INSERT query, but to check the email "newdata" being passed through, doesn't already exist.

Re: INSERT Query via CSV taking ages

Posted: Thu Apr 10, 2014 7:11 am
by simonmlewis
I don't think I can use DUPLICATE KEY because the field, "email" is not a primary key.

Re: INSERT Query via CSV taking ages

Posted: Thu Apr 10, 2014 8:08 am
by Celauran
Is the email field indexed as unique? Probably makes sense as that's a constraint you're clearly looking to enforce. That would also allow you to use INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE.

Re: INSERT Query via CSV taking ages

Posted: Thu Apr 10, 2014 8:40 am
by simonmlewis
Yeah - it wasn't, but is now. I recreate the table with only "email", and set it as a primary key. The script now works, and quickly.
Cool.

Re: INSERT Query via CSV taking ages

Posted: Wed Apr 23, 2014 6:44 am
by simonmlewis
Hi there

I still have this issue, but we cannot have the email field as primary, as we use the id for something else.

So is there another way to check for duplicates and update or ignore?

Re: INSERT Query via CSV taking ages

Posted: Wed Apr 23, 2014 6:47 am
by Celauran
It doesn't need to be primary. Unique should suffice.

Re: INSERT Query via CSV taking ages

Posted: Wed Apr 23, 2014 6:50 am
by simonmlewis
Ok two questions then:
how do I find all duplicoates and delete so only one remains (to tidy it before changing to Unique)
how do I set the email field to be unique?

Re: INSERT Query via CSV taking ages

Posted: Wed Apr 23, 2014 7:00 am
by Celauran
Finding the duplicates should be pretty straightforward. Determining which one(s) to delete may be a little more difficult.

Code: Select all

SELECT id, email
FROM subscribed_upload AS subs
INNER JOIN (SELECT email FROM subscribed_upload GROUP BY email HAVING COUNT(id) > 1) AS dupes ON subs.email = dupes.email
CREATE INDEX

Re: INSERT Query via CSV taking ages

Posted: Wed Apr 23, 2014 7:19 am
by simonmlewis
#1052 - Column 'email' in field list is ambiguous

Re: INSERT Query via CSV taking ages

Posted: Wed Apr 23, 2014 8:17 am
by simonmlewis
Got it.
Used something slightly different, and then set the field as Unique and then my script worked.