Parsing: with PHP Simple HTML DOM Parser - how to do it

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
lin
Forum Commoner
Posts: 49
Joined: Tue Dec 07, 2010 1:53 pm

Parsing: with PHP Simple HTML DOM Parser - how to do it

Post by lin »

hello dear community, good evening


i am currently working on a approach to parse some sites that contain datas on Foundations in Switzerland
with some details like goals, contact-E-Mail and the like:

See http://www.foundationfinder.ch/ which has a dataset of 790 foundations. All the data are free to use - with no limitations copyrights on it.

I have tried it with PHP Simple HTML DOM Parser - but , i have seen that it is difficult to get all necessary data -that is needed to get it up and running.

Who is wanting to jump in and help in creating this scraper/parser. I love to hear from you.

Please help me - to get up to speed with this approach?


regards
lin
User avatar
getmizanur
Forum Commoner
Posts: 71
Joined: Sun Sep 06, 2009 12:28 pm

Re: Parsing: with PHP Simple HTML DOM Parser - how to do it

Post by getmizanur »

show what you have done so far and then i can give you some help if you are stuck
lin
Forum Commoner
Posts: 49
Joined: Tue Dec 07, 2010 1:53 pm

Re: Parsing: with PHP Simple HTML DOM Parser - how to do it

Post by lin »

hello - and good day dear getmizanur, many many thanks for replying!
getmizanur wrote:show what you have done so far and then i can give you some help if you are stuck
Well - i am just musing which is the best way to do the job. Guess that i am in front of a nice learning curve. ;) This task will give me some nice PHP lessions. So here is a sample-page:

Image

... and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html

By the way: The Perl approach: i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?

But if i am going the PHP way i could do it with Curl - couldnt i!?

Here is my approach: I tried to figure it out. And i digged deeper in the Manpages and Howtos. We can have a loop constructing the URLs and use Curl - repeatedly

As noted above: here we have some resultpages;

http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html


Alternatively we can add a request_prepare handler that computes and add the query arguments before we send out the request.

Question: Do you think, that this fits the needs?

Again: What is aimed: i want to parse the data and afterwards i want to store it in a local MySQL-database

should i define a extern_uid !?

and go like this:

Code: Select all

for my $i (0..10000) {
  $ua->get('http://www.foundationfinder.ch/ShowDetails.php?Id=', id => 21, extern_uid => $i);
  # process reply
}
getmizanur- i need your help since i get stuck here - can i do the job like this!?

I love to hear from you!!

regards
lin
lin
Forum Commoner
Posts: 49
Joined: Tue Dec 07, 2010 1:53 pm

Re: Parsing: with PHP Simple HTML DOM Parser - how to do it

Post by lin »

i try to find a way to use file_get_contents: a download of set of pages:

.. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html

How to mechanize with a loop from 0 to 10000 and throw out 404 responses
once you reach the page we then could use beautifulsoup to get the content (in our case the image file address)
but we also could just loop trough the images directely with simple webrequests.

Well - how to proceed:

like this:


<?php
// creating a stream!
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);

// opens a file

$file = file_get_contents('http://www.example.com/', false, $context);
?>


a typical page is http://www.foundationfinder.ch/ShowDeta ... &Type=Html
and the related image is at http://www.foundationfinder.ch/ShowDeta ... Type=Image

after downloading the images we will need to OCR them to extract any useful info,
so at some stage we need to look at OCR libs.


I think google opensourced one, and since its google it has a good chance it has a good api

what do you think!
Post Reply