Batching an import [solved]

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Batching an import [solved]

Post by josh »

Right now I'm writing a plugin for another system. I can't really access cron or anything fancy.

My users are uploading huge files, these are CSV files with rows that correspond roughly to rows in a table (although the actual database is more normalized than the import format, the import format is intentionally de-normalized). Right now it goes row by row. First it issues a query to see if it already knows about the row. If it does not, it inserts a corresponding row.

Now the issue is I want it go 100 rows at a time, and batch the INSERT statements. The issue in doing so, that I currently see - is that it is make/model/year data. So lets step through in psuedo code

Import Format:

Code: Select all

Honda,Civic,2000
There are these tables: Make,Model,Year,Vehicle

The code looks at this row and issues a SELECT (or looks in memory) to see if it has loaded, or has in the database - a make called "Honda". Lets pretend we do and it was make #1. Next it looks at "Civic" and lets assume it has never been encountered before. The system issues an INSERT for Civic. Whatever ID number it gets back then has to be inserted into a master vehicles table. So if Honda=1, Civic=8, '2000' = 2 (years are normalized for technical reasons). It next has to:

Code: Select all

INSERT INTO `vehicle` (make,model,year) values(1,8,2);
Since its the hierarchy, and so many consistency checks are needed.. I don't really see how it can be batched. I could use ON DUPLICATE KEY and just issue a grotesque amount of inserts (which will probably be faster than trying to be all efficient with the SELECT statements?). However I'm admittedly having a hard time wrapping my head around it since its a hierarchy. Does anyone have any ideas or has done anything like this in the past they are willing to share? I guess the problem is, I have to know that Civic is for Honda (make #1) before I start replacing rows. If another manufacturer comes out with a like named model, it should not overwrite them even if they have the same name, because they have different parent makes. The current algorithm meets the spec but is not most efficient.
Last edited by josh on Fri Sep 24, 2010 11:15 pm, edited 1 time in total.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Batching an import

Post by VladSun »

Your approach seems fine to me. Basically I do the same.

// more or less off-topic

I'm working on a very similar problem. What I'm trying to do with my import "engine" is to reduce the number of spelling mistakes and bad formated data. My CSV files are huge and every manufacturer has their own "style" of formatting them. It's often found an CSV line containing data for more than one vehicle. The worst is these imports have to be updated every year (or twice a year) and the old data may be edited in some weird way.

So, I have an ImportManager object which registers (and assigns column positions to) my Importer objects - like MakeImporter, ModelImporter etc.
These object know how to parse the CSV line and how to extract and reformat their specific data. E.g. my StartYearImporter object know it should look at column #4 for its input data and it may or may not look like some of these:
[text]"2004"
"04"
"04-08"
[/text]The last line contains both the start year and end year data.
Every import object may decide it has multiple records data in a single CSV line and request a multiplication of this line.
E.g. if the TypeImporter object receives this line:
[text]Honda Civic RS/SX/TS ...[/text] it will multiply this line into three lines:
[text]Honda Civic RS ...
Honda Civic SX ...
Honda Civic TS ...[/text]
It must be smart enough though ;)

Also, some of the import objects know that their data may be in more than one CSV column.

I start with creating a temporary table containing the unparsed CSV lines and fields like "hasError", "errorMessage", "removed". I have duplicate import tables defined - make_import, model_import, etc. In order to keep track on previously parsed data, I keep the CSV data assigned to the corresponding vehicle.

Then I process it starting with the "make" column (updating only the temporary table and the import tables)
- "multiple values" are detected and lines are multiplied (and marked as such);
- check for spelling mistakes using the Levenshtein distance (an DB UDF) and mark these rows;
- user must correct them or mark them as correct and update the table;
- for new found makes and ID is generated; the rest must be assigned the existing IDs;

The hierarchy is processed in a similar way (i.e. model, type, variant etc.)

Finally, I add the parsed data from these import tables to the "production" tables.

I must admit that these imports are a nasty thing to do. I have nearly 200 records per every 10'000 CSV lines which must be taken care by an operator, because my import objects failed to parse them. I've been even thinking about NN solutions!?!
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Batching an import

Post by Christopher »

In cases like this, I tend to load the CSV into a import table that matches its schems using LOAD DATA INFILE -- because it is so fast. Then I do queries on that data joined with existing tables to create the new data.
(#10850)
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Batching an import

Post by josh »

Christopher wrote: Then I do queries on that data joined with existing tables to create the new data.
Interesting, but does this provides an advantage over reading the CSV directly, in terms of speed? I'm going to try structuring my code such that it uses INSERT ON DUPLICATE KEY syntax, and see if thats faster than issuing the SELECT prior to the INSERT. Another idea I found on the interwebs was to try the INSERTs unconditionally, and then look at the modified rows variable to see if it existed. (or catch the PDO exception)

This is a big problem with good practices (data normalization). If we do things "right" our systems run slow as hell. If my users all had access to and the expertise to setup cron, then it wouldn't be an issue. I'd just load the data straight to a temp table and process it in a background process, but alas that doesnt seem to be the situation.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Batching an import

Post by VladSun »

How about using an IFRAME together with ignore_user_abort()/set_time_limit(0)? Something like putting the process in background.
There are 10 types of people in this world, those who understand binary and those who don't
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Batching an import

Post by josh »

That's what Magento does. I think they re-invented the wheel though, because they basically built a crappy version of the command line that still crashes. Right now my users can use a form file upload field, and I'm adding a command line version they can set up. That helps them stay patient because they'll be able to see its importing at least a vehicle every second. But ideally I'd like 100 vehicles a second, you know? If the server is doing 1,000 queries a second on benchmarks, it shouldn't take 1 second+ per vehicle, or per product. Both Magento itself (its products import) is slow, and so is my vehicles import (which runs separate). Magento uses the iframe trick and I do not, and both are slow & hated by the users. :-(
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Batching an import

Post by alex.barylski »

I do as arborint, using SQL for entire data conversion:

First I import CSV data (or using other tables, when converting Drupal, WordPress, etc to my own CMS) into temp tables and perform the entire conversion in SQL, it can result in some interesting SQL but none-the-less much easier than writing a script IMO.

My current project I have a series of import scripts that normalize the data, remove duplicates, map records, etc. For example the original tables used a human readable date format (coming from access) so I run a quick SQL script to convert all dates into timestamps (which I prefer):

This is what one such import command looks like:

Code: Select all

INSERT INTO rpi_procedures
(
  SELECT 
    mh.`ID` AS `id_primary`,
    mh.`ID` AS `id_previous`,
    (SELECT `id_primary` FROM system_manufacturers WHERE `name` = mh.`Manufacturer` LIMIT 1) AS `id_manufacturer`,
    0 AS `id_task`,
    1 AS `id_region`,
    1 AS `id_status`,
  
    (
      SELECT 
        id_user 
      FROM 
        system_profiles 
      WHERE 
        UPPER(CONCAT_WS('.', SUBSTRING(first_name, 1, 1), last_name)) = UPPER(REPLACE(mh.`Originated by`, ' ', ''))
      LIMIT 1
    ) AS `id_originated`,
  
    (
      SELECT 
        id_user 
      FROM 
        system_profiles 
      WHERE 
        UPPER(CONCAT_WS('.', SUBSTRING(first_name, 1, 1), last_name)) = UPPER(REPLACE(mh.`Approved by`, ' ', ''))
      LIMIT 1
    ) AS `id_approved`,
  
    (
      SELECT 
        id_user 
      FROM 
        system_profiles 
      WHERE 
        UPPER(CONCAT_WS('.', SUBSTRING(first_name, 1, 1), last_name)) = UPPER(REPLACE(mh.`Revised by`, ' ', ''))
      LIMIT 1
    ) AS `id_revised`,
  
    UNIX_TIMESTAMP(mh.`Origin Date`) AS `date_originated`,
    UNIX_TIMESTAMP(mh.`Approved date`) AS `date_approved`,
    UNIX_TIMESTAMP(mh.`Revision date`) AS `date_revised`,
  
    mh.`Quote Step` AS `quote_step`,
    mh.`Revision` AS `revision`,
    mh.`Repair Type` AS `repair`,    
    mh.`Rpi Number` AS `number`,
    
    CONCAT_WS("\r\n\r\n", mh.`Model`, mh.`Manual`, mh.`Manual Revision`, mh.`Reference`) AS `notes`
  FROM 
    `tRpi Master head` AS mh
  ORDER BY id_manufacturer, number ASC
);
The old tables used hardcoded names for manufacturers (I used manufacturer IDs and store manufacturer names and details in another table as shown) so before I do this I must normalize the manufacturer names and remove duplicates, etc. This is done with a series of UPDATE's to change mis-spelling, cap-differences, etc into a single manufacturer:

Old Data:

Code: Select all

UPDATE Procedures SET Manufacturer = 'Rolls Royce' WHERE Manufacturer = 'Rols Roice'
UPDATE Procedures SET Manufacturer = 'Rolls Royce' WHERE Manufacturer = 'Rolls Roys'
I have additional SQL scripts to check that everything is properly normalized/standardized before importing with the above script so everything lines 1:1.

The idea is pretty simple and powerful, saved me countless hours writing in in PHP. You basically take one or more old tables and reconstruct the new table using SELECT and sub-SELECT and finally wrap that in a INSERT using DISINCT to remove duplicates as required.

Not sure if this is what you were looking for, but when I read arborints post I figured it must be something like this.

Cheers,
Alex
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Batching an import

Post by josh »

Maybe I missed something but where/how does it rely on DISTINCT? I'm going to have to wrap my head around this and see if it could work.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: Batching an import

Post by alex.barylski »

Maybe I missed something but where/how does it rely on DISTINCT
That particular query didn't need DISTINCT, but you could use DISTINCT to remove any identical duplicates before insertion. Alternatively, I have used GROUP BY on occassions where I need to remove duplicates but only on certain fields.

There are times, if your data schema is very complex, plain SQL will not do the trick. I am not sure you could do this in MySQL (but I believe you can) but in Access, what I did was implement custom functions/modules and using those as field evaluators, passing in required information and then returning the field value you need, using an imperative language. I only had to resort to this once however, in converting about 12 heavy tables into about 50 normalized tables. So unless you have really weird requirements, you should be good.

Cheers,
Alex
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Batching an import

Post by josh »

Can you put an example of you use DISTINCT w/ this technique
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Batching an import

Post by Christopher »

josh wrote:Interesting, but does this provides an advantage over reading the CSV directly, in terms of speed?
It is faster because the LOAD DATA is quick and everything else is done in the database, so there is no PHP looping through records. Everything besides the import is queries to de-normalizing data.
(#10850)
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Batching an import

Post by VladSun »

You may wish to use a stored procedure. I know you "love" them :P
There are 10 types of people in this world, those who understand binary and those who don't
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Batching an import

Post by josh »

I don't have anything against them, I just have certain design heuristics. Like stored procedures should not be a first choice. In this circumstance its a possibility it would provide an elegant solution. I like the thing PCSpectra posted the most though, I'm going to try that out.

Is there any way to unit test a stored procedure?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Batching an import

Post by VladSun »

If you use a SP it will save you and your angry users the time needed to send/parse/compile your SQL queries. Together with the "LOAD DATA INFILE" approach it may be a big performance improvement.

Unit testing - wouldn't be as simple as writing a unit test for a PHP function?
There are 10 types of people in this world, those who understand binary and those who don't
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Batching an import

Post by josh »

Hmm well I'm all for making the users happy, but I want to consider all design factors. How can a stored procedure be "deployed" & "upgraded", as part of the system's install and upgrade? Is there like a DDL of sorts for installing & destroying stored procedures? I figured testing them thru a layer closer to the SP itself would be beneficial. Testing it through PHP could work I suppose, but is that the best way to go about it?
Post Reply