Page 1 of 1

Parsing external documents

Posted: Thu Jun 20, 2002 9:48 am
by toppac
Is ther anyway in php to parse and external document for a certain value. for example, say I wanted to parse yahoo.com's main site for the word News, can php do that?

Parsing documents

Posted: Thu Jun 20, 2002 10:03 am
by BDKR
Yes!

Posted: Thu Jun 20, 2002 10:28 am
by toppac
lol ok thanks. I looked it up in the php dev cookbook, saw how to do it. But I am having trouble understanding the preg_match_all() function. How do I send it input to get all the values following a given word?

i.e.

I search a html page for the word "Points" and I want it to give me the number of points, which follows the word by a colon or something. Any help?

preg stuff...

Posted: Thu Jun 20, 2002 4:59 pm
by BDKR
I'm sorry, I was just feeling like being a little stupid and :twisted: I worked in a prison from '89 to '97 (dating myself) and with those fools, it was clown or be clowned. You got your jokes in as soon as the opportunity arose!

I'm at work right now, but when I get home and get back online this evening, I will also check some code form about a year ago where I did some of that stuff. Hopefully it will help you out some. If I foget (like by tomorrow), just kinda send me a message or something to prod my rememberance.

Later on,
BDKR (TRC)

reg ex

Posted: Fri Jun 21, 2002 12:35 am
by BDKR
This (regular expressions and stuff) is something I don't do much. I allways try to look for another way of doing it before I do use it. Most of the time, I don't have to. It's one of the ugliest looking things I've ever seen in programming, but anyone that can read and understand line after line of that stuff gets credit from me.

Anyways, I've had need of preg_replace() and ereg() and to be honest with you, I'm not sure of the difference between the two. I'm not sure I care.

Anyways, what I was doing in one of those instances was looking for all data between two points in an document. Here is a code snippet.

Code: Select all

if(strstr($story, "<!--- start title --->"))
    &#123;
    $search=ereg("<!--- start title --->(.*)<!--- end title --->", $story, $info);
    &#125;
Now the var $info is actually an array (php.net/ereg) and you should fiddle with the array just a tad to get the info out of it that you want.

The "(.*)" bidness is saying grab everything between the "start title" and "end title" tags. That, if I'm not mistaken, is what's being stored as an element in the $info var. It's obviously a title for a story.

Now I'm assuming (hoping actually) somewhat that perhaps there is a place in the document that's very similar to what we have above. How do you, and in turn, your script, know when to start parsing for the information. Is there a place similar to the above.

Code: Select all

<!--- point ---->
point information here
<!--- end points --->
If it's something like the above; something where you know where the information is going to begin and end, then you can use ereg as I did above and use "(.*)" to grab all the data between those two points. Did you create this document that is going to be parsed? Was the document created in such a way to make it easy to be parsed?

This is the kind of thing that xml is great for. But the document needs to be an xml document. If it is by chance that, then it's even easier. Let me know.

mas reg ex

Posted: Fri Jun 21, 2002 12:43 am
by BDKR
In reading your post some more, I paid more attention to the explanation you gave. Would the date be in a form kind of like....
Points "Number of Points" :
? I'm not sure if I understood that correctly. If that's the case, maybe soemthng like....

Code: Select all

$points=erg("Points(.*):", $document_buffer, $points_info);
? What it's grabbing is all information after "Points" and before the ":". You may need to use the trim function on the take.

Let me know how it goes.

Later on,
BDKR (TRC)