Page 1 of 2

Document Based System has Unacceptable Load Time

Posted: Fri Feb 27, 2009 10:27 am
by Skara
The way things work (too long to explain), a document-based architecture is required as opposed to storing data in a database.
MySQL is available, but I have to create the data in xml documents (and on different systems). Each document is around 1-20kb, usually containing base64_encoded jpegs.

Currently, I have ~200 documents. I expect in another year there will be ~1000 documents. After a year or so these documents are archived off the system, but that's still a large amount of data.

My problem is this:
I can't simply list the documents. I need to list some data for them as well--4 to 8 variables or so from each document need to be retrieved and listed along with a link to view the document.
In addition, I have a list of groups of documents. Each of these listings contain stats for the documents. In effect, I have to open every document to view this list.
** architecture is:
-- group 1
---- ~100-400 documents
-- group 2
---- ~100-400 documents
..etc.

NOW... It... works.. but not quickly. Load time on the group list is ~3-4 seconds. That's for 200 documents, remember.

I think my solution is this:
Keeping duplicate data in mysql tables. But, I'm not sure how far to take it, or how to go about it. This is a LOT of data that gets modified fairly regularly and I'm worried about inconsistencies developing between the documents and the mysql tables.

Please give me some idea of how to go about this. Thanks!!!

Re: Document Based System has Unacceptable Load Time

Posted: Fri Feb 27, 2009 10:35 am
by allspiritseve
Skara wrote:I think my solution is this:
Keeping duplicate data in mysql tables. But, I'm not sure how far to take it, or how to go about it. This is a LOT of data that gets modified fairly regularly and I'm worried about inconsistencies developing between the documents and the mysql tables
That sounds like your best bet, keeping meta info in the db. If you're worried about inconsistencies, it sounds like you're able to parse variables from the documents. What about having a script that you can run manually, that checks between the info in the database and the info in the document that it links to, and alerts you of any inconsistencies so you can remedy them. Sure, it'd run slow, but it's only for your eyes so it doesn't matter. You could even have it go through 100 documents at a time or whatever, to split it up.

Re: Document Based System has Unacceptable Load Time

Posted: Fri Feb 27, 2009 10:47 am
by inghamn
The 4-8 variables that you parse out of the documents could be stored in the database, along with a unique identifier for the document. Then, whenever you add or modify a document, you parse the variables out of the document, and update the database.

You would only ever read from the database the rest of the time. But you would be able to use the database whenever you listing information about the documents.

Re: Document Based System has Unacceptable Load Time

Posted: Fri Feb 27, 2009 3:13 pm
by pickle
Need to know a little more about your setup before I can recommend anything.

Are the files updated automatically (a cron job pulls them from the various servers, parses them & stores some data), or manually (you retrieve the files & import them into your system one at a time)?

Re: Document Based System has Unacceptable Load Time

Posted: Fri Feb 27, 2009 3:44 pm
by Eran
How about keeping all the data in the database and generating documents from it on demand? is that an option?

Re: Document Based System has Unacceptable Load Time

Posted: Sat Feb 28, 2009 10:11 pm
by josh
<1,000 documents is a piece of cake, throw an XML database or xpath querying mechanism at it. Now if we were talking 10,000 I'd agree you'd need a database. Heck for 1,000 documents you can probably get away with just flat files & flat file caches, it really depends on your application. If theres significant business logic you're going to have to split your model, I like pytrin's idea of using the database as your primary storage at that point and using the XML just for APIs

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 10:54 am
by Skara
No, MySQL is not available on the systems that create the data.

Files are manually uploaded from the mobile units to the home server at the end of a day to a week of work. I would love to automate this part, but I (ugh) have to make mac minis talk to a windows box... it isn't exactly the most practical solution...

"throw an XML database or xpath querying mechanism at it"
Please explain this. This sounds like a possible solution.
Although the files MAY get to 10000 in a year or two...

Thanks for the help.

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 11:15 am
by pickle
Mac Minis should be able to talk to Windows boxes through SMB or FTP.

This is what I'd do:
  1. Once you've uploaded all the files, run an import script
  2. The first time (ever) the script is run, it will extract the necessary data from each file & store it in a database (or an XML-file if you prefer). It also generates an MD5 hash of the file & stores that alongside the data for the file.
  3. On subsequent runs, the script checks the hash for the file. If it matches what's in the database, it doesn't need to open & extract the data. If the hash is different, it extracts the new data & generates a new hash.

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 11:43 am
by Skara
ooh, I kinda like that solution.

And yeah, I can get the two systems to talk, it's just not very easily automated.

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 11:48 am
by pickle
Automator might be able to help there. Or heck - even a bash script. With upwards of 10,000 files, I think it'd be worth investing some time on an automated solution.

Heck, if you set up the web server on each of the Mac Mini's properly, you could just pull the files from each of them through the web. You could pull through FTP as well. I think it'd be easier to pull the files from the Mac Minis, rather than push the files to your central server.

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 11:55 am
by Skara
Pulling the files may end up being the final solution. I've explored Automator and discovered it doesn't like the permissions differences... You can drag and drop, but not automate it. Weird.
But anyway, the number of files isn't an issue as all of them are dragged at once. Also, quite large image files (think ~5-25gb of data) also need to be copied over, and web-basing that isn't very feasible.

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 11:56 am
by josh
Just because the clients dont have mysql why cant the server? And as far as XML databases you'd have to search on your own i dont know much

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 12:06 pm
by Skara
Well yes, the home server does have mysql, so it's certainly an option. I just can't CREATE the data as mysql data. It has to be imported.
And also, using files makes for easy importing and archiving. We can also store a client's order file along with their images, so I really do need to keep that portability. The mysql would only be used for status display on lists. The point is to make the file-based system work, not develop an entirely new system (not that major changes aren't needed).

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 1:17 pm
by Skara
Ok, here's what I think would be the best solution for what I need. (Thanks all!)

I want to get some kind of approval on this so that I'm really not missing anything. I really don't want to screw this up and realize that it's still slow once I have ~10000 entries here...

file/data tree
I have:
location folders -> individual files -> data within files

mysql
I already have:
location table with very basic information about location

I think I need...
table `files`
columns: (autoinc int)id, (int)time, (int)status, (varchar)md5
// btw, the filenames are all [timestamp].order, so the time and the filename are stored all in one.

Now maybe this is a little esoteric, but would it be quicker to store multiple status variables as one number (such as octal permissions) or would it be quicker to store those as separate entries.
Such as...
status is at stage 4 and has flags a and c, therefore... (?)
1014 would be the status???
_OR_
simply store the different columns as
status, a, b, c

I should note that the a, b and c flags would only be called on file-list views, not on the location list view. However, the main status would have to be tallied for the location list view.

Re: Document Based System has Unacceptable Load Time

Posted: Mon Mar 02, 2009 2:32 pm
by pickle
An octal string might allow for more expansion in the future. If this is an established process though, I think it would be cleaner to have a column for each flag.