Your approach seems fine to me. Basically I do the same.
// more or less off-topic
I'm working on a very similar problem. What I'm trying to do with my import "engine" is to reduce the number of spelling mistakes and bad formated data. My CSV files are huge and every manufacturer has their own "style" of formatting them. It's often found an CSV line containing data for more than one vehicle. The worst is these imports have to be updated every year (or twice a year) and the old data may be edited in some weird way.
So, I have an ImportManager object which registers (and assigns column positions to) my Importer objects - like MakeImporter, ModelImporter etc.
These object know how to parse the CSV line and how to extract and reformat their specific data. E.g. my StartYearImporter object know it should look at column #4 for its input data and it may or may not look like some of these:
[text]"2004"
"04"
"04-08"
[/text]The last line contains both the start year and end year data.
Every import object may decide it has multiple records data in a single CSV line and request a multiplication of this line.
E.g. if the TypeImporter object receives this line:
[text]Honda Civic RS/SX/TS ...[/text] it will multiply this line into three lines:
[text]Honda Civic RS ...
Honda Civic SX ...
Honda Civic TS ...[/text]
It must be smart enough though
Also, some of the import objects know that their data may be in more than one CSV column.
I start with creating a temporary table containing the unparsed CSV lines and fields like "hasError", "errorMessage", "removed". I have duplicate import tables defined - make_import, model_import, etc. In order to keep track on previously parsed data, I keep the CSV data assigned to the corresponding vehicle.
Then I process it starting with the "make" column (updating only the temporary table and the import tables)
- "multiple values" are detected and lines are multiplied (and marked as such);
- check for spelling mistakes using the Levenshtein distance (an DB UDF) and mark these rows;
- user must correct them or mark them as correct and update the table;
- for new found makes and ID is generated; the rest must be assigned the existing IDs;
The hierarchy is processed in a similar way (i.e. model, type, variant etc.)
Finally, I add the parsed data from these import tables to the "production" tables.
I must admit that these imports are a nasty thing to do. I have nearly 200 records per every 10'000 CSV lines which must be taken care by an operator, because my import objects failed to parse them. I've been even thinking about NN solutions!?!