Page 1 of 1

PHP XML processing weirdness

Posted: Sat Oct 23, 2004 4:19 pm
by hokiecsgrad
Heya,

PHP 4.3.9 question. I am working on processing a very large XML file (35MB - it's a list of every screen of every movie theatre in North America) and, for obvious reasons I can't just load the entire doc into a string a parse the sucker. But in the PHP docs, it clearly states that I can read in an XML doc a few bytes at a time and use the event based XML parsers to parse the document.

For example:

Code: Select all

$xml_parser = xml_parser_create("UTF-8");
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
	die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
	if (!xml_parse($xml_parser, $data, feof($fp))) {
		die( sprintf( "XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser) ) );
	}
}
fclose($fp);
xml_parser_free($xml_parser);
However, when I parse the document in this way, I notice weird things happening with the data. It seems as if any data between tags gets truncated. However, the docs for the "xml_parse" function clearly state that this is something that should work just fine.

Does anyone have any explanation for what's happening and/or a better way to parse very large XML docs? The solution needs to assume less than 2MB of working RAM and a very standard PHP install with no extra packages (such as PEAR) installed.

Thanks for the insight.

follow-up

Posted: Sat Oct 23, 2004 4:37 pm
by hokiecsgrad
Greetings,

I just ran a test on an iTunes XML file and had similar results. Using the code listed above, I see the following output while dumping Artist, Album and Song info from my iTunes XML file. Notice the "Alice in Chains" entries near the bottom. That's an example of some of the "truncated" data I made reference to in my first post. If I load the entire XML file into memory and do one parse call, I don't see those anomalies.
The News Hard At Play Don't Look Back(10)
The News Hard At Play Time Ain't Money(11)
Creedence Clearwater Revival Chronicle, Vol. 1 Susie Q(1)
Creedence Clearwater Revival Chronicle, Vol. 1 I Put A Spell On You(2)
Creedence Clearwater Revival Chronicle, Vol. 1 Proud Mary(3)
Creedence Clearwater Revival Chronicle, Vol. 1 Bad Moon Rising(4)
Creedence Clearwater Revival Chronicle, Vol. 1 Lodi(5)
Creedence Clearwater Revival Chronicle, Vol. 1 Green River(6)
Creedence Clearwater Revival Chronicle, Vol. 1 Commotion(7)
Creedence Clearwater Revival Chronicle, Vol. 1 Down On The Corner(8)
Creedence Clearwater Revival Chronicle, Vol. 1 Fortunate Son(9)
Creedence Clearwater Revival Chronicle, Vol. 1 Travelin' Band(10)
Creedence Clearwater Revival Chronicle, Vol. 1 Who'll Stop The Rain(11)
Creedence Clearwater Revival Chronicle, Vol. 1 Up Around The Bend(12)
3 Doors Down Away from the Sun When I'm Gone(1)
Alice in Chains Dirt Them Bones(1)
Alice in Chains Dirt Rooster(5)
ins Dirt Would?(13)
Alice in Chains Music Bank (Box Set) No Excuses(17)
Audioslave Audioslave Cochise(1)
Audioslave Audioslave Like a Stone(5)
Audioslave Audioslave I Am the Highway(8)
Jack Johnson Brushfire Fairytales Inaudible Melodies(1)
Jack Johnson Brushfire Fairytales Middle Man(2)

Posted: Sat Oct 23, 2004 5:00 pm
by Weirdan
I recall there was a comment on the xml_parse manual page about parsing 300+ MB file... take a look: [php_man]xml_parse[/php_man]. Look for users' comments there.

[Solved] PHP XML processing weirdness

Posted: Tue Oct 26, 2004 9:08 am
by hokiecsgrad
Thanks for the heads up. I can't believe I missed that. I crawled all over that page looking for a solution. Very simple and obvious solution, I might add.