Batching an import [solved]
Posted: Sun Sep 19, 2010 4:53 pm
Right now I'm writing a plugin for another system. I can't really access cron or anything fancy.
My users are uploading huge files, these are CSV files with rows that correspond roughly to rows in a table (although the actual database is more normalized than the import format, the import format is intentionally de-normalized). Right now it goes row by row. First it issues a query to see if it already knows about the row. If it does not, it inserts a corresponding row.
Now the issue is I want it go 100 rows at a time, and batch the INSERT statements. The issue in doing so, that I currently see - is that it is make/model/year data. So lets step through in psuedo code
Import Format:
There are these tables: Make,Model,Year,Vehicle
The code looks at this row and issues a SELECT (or looks in memory) to see if it has loaded, or has in the database - a make called "Honda". Lets pretend we do and it was make #1. Next it looks at "Civic" and lets assume it has never been encountered before. The system issues an INSERT for Civic. Whatever ID number it gets back then has to be inserted into a master vehicles table. So if Honda=1, Civic=8, '2000' = 2 (years are normalized for technical reasons). It next has to:
Since its the hierarchy, and so many consistency checks are needed.. I don't really see how it can be batched. I could use ON DUPLICATE KEY and just issue a grotesque amount of inserts (which will probably be faster than trying to be all efficient with the SELECT statements?). However I'm admittedly having a hard time wrapping my head around it since its a hierarchy. Does anyone have any ideas or has done anything like this in the past they are willing to share? I guess the problem is, I have to know that Civic is for Honda (make #1) before I start replacing rows. If another manufacturer comes out with a like named model, it should not overwrite them even if they have the same name, because they have different parent makes. The current algorithm meets the spec but is not most efficient.
My users are uploading huge files, these are CSV files with rows that correspond roughly to rows in a table (although the actual database is more normalized than the import format, the import format is intentionally de-normalized). Right now it goes row by row. First it issues a query to see if it already knows about the row. If it does not, it inserts a corresponding row.
Now the issue is I want it go 100 rows at a time, and batch the INSERT statements. The issue in doing so, that I currently see - is that it is make/model/year data. So lets step through in psuedo code
Import Format:
Code: Select all
Honda,Civic,2000The code looks at this row and issues a SELECT (or looks in memory) to see if it has loaded, or has in the database - a make called "Honda". Lets pretend we do and it was make #1. Next it looks at "Civic" and lets assume it has never been encountered before. The system issues an INSERT for Civic. Whatever ID number it gets back then has to be inserted into a master vehicles table. So if Honda=1, Civic=8, '2000' = 2 (years are normalized for technical reasons). It next has to:
Code: Select all
INSERT INTO `vehicle` (make,model,year) values(1,8,2);