Page 1 of 1

PHP reading/parsing HTML

Posted: Tue Aug 25, 2009 10:44 am
by kepardue
I've been learning some Xpath well enough to read and edit XML configuration files, but am now tasked to do the same with an HTML file. I'm also trying to do this with Xpath, but have noticed something peculiar: instead of returning the nodes underneath it, it seems to be returning the text content with html tags removed. The structure of the HTML is this, a series of repeating test question and answer choices:

Code: Select all

 
<div class="iDevice_inner">
<div class="question">
<div id="taquestion0b1" class="block" style="display:block">True or False. Cribbing blocks can be used under outriggers to level a bucket truck.</div><br />
<table><tr>
<td><input type="radio" name="key0b1" value="0" /></td>
<td><div id="taoptionAnswer0q0b1" class="block" style="display:block">True</div></td>
</tr><tr>
<td><input type="radio" name="key0b1" value="1" /></td>
<td><div id="taoptionAnswer1q0b1" class="block" style="display:block">False</div></td></tr>
</table>
</div><br />
...
...
</div>
 
I need to extract the questions and answers and assign them to a multidimensional array so that I can create an app that will allow the user to edit them. Unfortunately I'm kind of limited to this structure since I'm working with existing files. In my part real, part pseudo code, this is what I have:

Code: Select all

 
$course_dom = new DOMDocument;
$course_dom->load($course_file);
$xpath = new DOMXPath($course_dom);
$xpath->registerNamespace("m", "http://www.w3.org/1999/xhtml");
$query = $xpath->query("/m:html/m:body/m:div/m:div[@id='main']/m:div/m:form/m:div/m:div/m:div");
 
for($i=0;$i<$query->length;$i++){
     echo "<br />VALUE: ".$query->item($i)->nodeValue;
}
 
I'm sure there's got to be a way to reference the children of my $query->item($i), but I'm not sure of the syntax. Unfortunately, it appears that since there's so many different ways for PHP to deal with XML, I'm not sure how to go about it.

Re: PHP reading/parsing HTML

Posted: Wed Aug 26, 2009 9:44 am
by kepardue
I knew htere had to be a simple solution to the problem. Thanks very much for that pointer, I've got it parsing what I need smoothly now. With the exception of one thing. In the HTML file, there's a <script> tag that contains several Javascript variables inside of a <!-- //<![CDATA[ //]]> --> I can't seem to get this to return as a string so I can use PHP to parse out the variables that I need from it. Any advice on that?

Thanks!

Re: PHP reading/parsing HTML

Posted: Wed Aug 26, 2009 10:17 am
by kepardue
Good sir, I owe you a drink. I've been struggling with this for a week... who knew that the solution could be so simple. That works perfectly, and getting the data as a string is exactly what I need to work with.

Thanks so much!

Re: PHP reading/parsing HTML

Posted: Wed Aug 26, 2009 10:33 am
by kepardue
For some reason, I can't seem to find documentation on xpath's /comment() function. The script in the CDATA is rather lengthy, and it's only returning what appears to be the latter 20,000 characters of it. Unfortuantely, what I need is in the beginning of the script. Is some sort of a substring way to pull the first 5,000 characters? I apologize for asking what must seem to be dumb questions. There just doesn't seem to be a lot of documentation on this floating around out there.

Re: PHP reading/parsing HTML

Posted: Wed Aug 26, 2009 11:55 am
by kepardue
Nope, it's all in the same block. Here's the code where it begins getting the data:
Specifically in the code below, the var key* = * is what I'm needing to get.

Code: Select all

 
... 
                var key18 = 1;
                var key19 = 1;
                function getAnswer()
                {
                doLMSSetValue("cmi.interactions.0.id","key0b1");
                doLMSSetValue("cmi.interactions.0.type","choice");
                doLMSSetValue("cmi.interactions.0.correct_responses.0.pattern",
                          "0");
...
 
The result returned begins with:

Code: Select all

" //< 2; i++) { if (document.getElementById("quizForm1").key0b1[i].checked) { question0...."

Re: PHP reading/parsing HTML

Posted: Wed Aug 26, 2009 2:05 pm
by kepardue
Seems to have been an issue with the JavaScript. Something with the "<" symbols that would trigger it to represent all of the preceding code with a "//" Odd that it didn't even show the proper text in the source.

Wrapping the variable in htmlentities() worked just fine. Now I think I'm back in familiar territory. THanks so much for the help and advice!