Page 1 of 1

Validate XML file, without loading into memory?

Posted: Mon Jun 27, 2011 8:45 am
by Noodleyman
Afternoon chaps :)

Firstly, I will explain what I am trying to achieve, and put things in the correct context.

I have an online service, that will pickup .xml files from another persons web-server and store them locally. Once the file is saved locally, I write the data into a MySQL table after validating each item in the file. Bad records go to a bad table, good records to a valid table.

The above, I already have working as per my design, however... I wish to secure the process a little more. Before attempting to load the XML file into the DB, I wish to do a check on the file to ensure it is an XML file.. After all, it is possible that somebody could have placed, say a .exe file with a .xml extension, and my script download it. So, is there a method to validate that a file with a .xml extension, is actually an XML file?

Now, to make things a little more complex, I need to avoid loading the entire file into memory, because a single XML file can be up to 128MB in size.

I've spent some time looking into this, but have yet to come across a solution.

Any suggestions?

Cheers :)

Re: Validate XML file, without loading into memory?

Posted: Mon Jun 27, 2011 12:14 pm
by twinedev
Try the following, see if it will work on your server (ie, does it allow you to use URL's on fopen):

Code: Select all

$fp = fopen('http://rss.slashdot.org/Slashdot/slashdot','r');
$data = fread($fp,250); // Grab just the first 250 characters
if (preg_match('/^<\?xml[^>]* version=(\'|")[0-9.]+\1[^>]*\?>/', $data)) {
	echo "Looks to be good... Do what ya gotta do...";
}
else {
	echo "The following didn't match: ",htmlspecialchars($data);
}
Note this only makes sure the file opens with a proper XML tag (I don't use XML a lot, so just did a generic regex for the <?xml ?>tag with version="x.x" and the options of other attributes either before or after the version one)

Be sure to keep something in the ELSE statement, so if in case the XML formatting changes, you have a log of it to see if you need to modify the regular expression. Should work pretty good.

If you cannot use the fopen this way on your server, the other option would probably be to use cURL. While I've never had the need to do it, there is probably a setting for that that will limit how much data it retrieves as well

-Greg

Re: Validate XML file, without loading into memory?

Posted: Thu Jun 30, 2011 4:36 am
by Weirdan
Once the file is saved locally, I write the data into a MySQL table after validating each item in the file. Bad records go to a bad table, good records to a valid table.
So, are we talking about validating in XML sense of the word or some semantic validation of the data contained in the file after the file was already established to be valid?

Re: Validate XML file, without loading into memory?

Posted: Thu Jun 30, 2011 4:46 am
by Noodleyman
I've broken this up into stages.

The first thing, is as above. read the first line of the file, and check that the text matches what is expected. if it doesn't, or it can't open the file becasue it isn't a text file it gets rejected.

I then go on to validate against a DTD file using DOMDocument::validate function.

If it passes that, then i go on to parse the data and do further checks specific to my needs. Looking for invalid data for my app.

It appears to all be working OK now :)

I was concerned that using DOMDocument::validate would cause excessive memory use on large files, however after doing some more testing, it appears to not chew much memory at all.