PHP XML processing weirdness

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
hokiecsgrad
Forum Newbie
Posts: 17
Joined: Fri Oct 22, 2004 2:55 pm

PHP XML processing weirdness

Post by hokiecsgrad »

Heya,

PHP 4.3.9 question. I am working on processing a very large XML file (35MB - it's a list of every screen of every movie theatre in North America) and, for obvious reasons I can't just load the entire doc into a string a parse the sucker. But in the PHP docs, it clearly states that I can read in an XML doc a few bytes at a time and use the event based XML parsers to parse the document.

For example:

Code: Select all

$xml_parser = xml_parser_create("UTF-8");
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
	die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
	if (!xml_parse($xml_parser, $data, feof($fp))) {
		die( sprintf( "XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser) ) );
	}
}
fclose($fp);
xml_parser_free($xml_parser);
However, when I parse the document in this way, I notice weird things happening with the data. It seems as if any data between tags gets truncated. However, the docs for the "xml_parse" function clearly state that this is something that should work just fine.

Does anyone have any explanation for what's happening and/or a better way to parse very large XML docs? The solution needs to assume less than 2MB of working RAM and a very standard PHP install with no extra packages (such as PEAR) installed.

Thanks for the insight.
hokiecsgrad
Forum Newbie
Posts: 17
Joined: Fri Oct 22, 2004 2:55 pm

follow-up

Post by hokiecsgrad »

Greetings,

I just ran a test on an iTunes XML file and had similar results. Using the code listed above, I see the following output while dumping Artist, Album and Song info from my iTunes XML file. Notice the "Alice in Chains" entries near the bottom. That's an example of some of the "truncated" data I made reference to in my first post. If I load the entire XML file into memory and do one parse call, I don't see those anomalies.
The News Hard At Play Don't Look Back(10)
The News Hard At Play Time Ain't Money(11)
Creedence Clearwater Revival Chronicle, Vol. 1 Susie Q(1)
Creedence Clearwater Revival Chronicle, Vol. 1 I Put A Spell On You(2)
Creedence Clearwater Revival Chronicle, Vol. 1 Proud Mary(3)
Creedence Clearwater Revival Chronicle, Vol. 1 Bad Moon Rising(4)
Creedence Clearwater Revival Chronicle, Vol. 1 Lodi(5)
Creedence Clearwater Revival Chronicle, Vol. 1 Green River(6)
Creedence Clearwater Revival Chronicle, Vol. 1 Commotion(7)
Creedence Clearwater Revival Chronicle, Vol. 1 Down On The Corner(8)
Creedence Clearwater Revival Chronicle, Vol. 1 Fortunate Son(9)
Creedence Clearwater Revival Chronicle, Vol. 1 Travelin' Band(10)
Creedence Clearwater Revival Chronicle, Vol. 1 Who'll Stop The Rain(11)
Creedence Clearwater Revival Chronicle, Vol. 1 Up Around The Bend(12)
3 Doors Down Away from the Sun When I'm Gone(1)
Alice in Chains Dirt Them Bones(1)
Alice in Chains Dirt Rooster(5)
ins Dirt Would?(13)
Alice in Chains Music Bank (Box Set) No Excuses(17)
Audioslave Audioslave Cochise(1)
Audioslave Audioslave Like a Stone(5)
Audioslave Audioslave I Am the Highway(8)
Jack Johnson Brushfire Fairytales Inaudible Melodies(1)
Jack Johnson Brushfire Fairytales Middle Man(2)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

I recall there was a comment on the xml_parse manual page about parsing 300+ MB file... take a look: [php_man]xml_parse[/php_man]. Look for users' comments there.
hokiecsgrad
Forum Newbie
Posts: 17
Joined: Fri Oct 22, 2004 2:55 pm

[Solved] PHP XML processing weirdness

Post by hokiecsgrad »

Thanks for the heads up. I can't believe I missed that. I crawled all over that page looking for a solution. Very simple and obvious solution, I might add.
Post Reply