hello dear community, good evening
i am currently working on a approach to parse some sites that contain datas on Foundations in Switzerland
with some details like goals, contact-E-Mail and the like:
See http://www.foundationfinder.ch/ which has a dataset of 790 foundations. All the data are free to use - with no limitations copyrights on it.
I have tried it with PHP Simple HTML DOM Parser - but , i have seen that it is difficult to get all necessary data -that is needed to get it up and running.
Who is wanting to jump in and help in creating this scraper/parser. I love to hear from you.
Please help me - to get up to speed with this approach?
regards
lin
Parsing: with PHP Simple HTML DOM Parser - how to do it
Moderator: General Moderators
- getmizanur
- Forum Commoner
- Posts: 71
- Joined: Sun Sep 06, 2009 12:28 pm
Re: Parsing: with PHP Simple HTML DOM Parser - how to do it
show what you have done so far and then i can give you some help if you are stuck
Re: Parsing: with PHP Simple HTML DOM Parser - how to do it
hello - and good day dear getmizanur, many many thanks for replying!
This task will give me some nice PHP lessions. So here is a sample-page:

... and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
By the way: The Perl approach: i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?
But if i am going the PHP way i could do it with Curl - couldnt i!?
Here is my approach: I tried to figure it out. And i digged deeper in the Manpages and Howtos. We can have a loop constructing the URLs and use Curl - repeatedly
As noted above: here we have some resultpages;
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
Alternatively we can add a request_prepare handler that computes and add the query arguments before we send out the request.
Question: Do you think, that this fits the needs?
Again: What is aimed: i want to parse the data and afterwards i want to store it in a local MySQL-database
should i define a extern_uid !?
and go like this:
getmizanur- i need your help since i get stuck here - can i do the job like this!?
I love to hear from you!!
regards
lin
Well - i am just musing which is the best way to do the job. Guess that i am in front of a nice learning curve.getmizanur wrote:show what you have done so far and then i can give you some help if you are stuck

... and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
By the way: The Perl approach: i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?
But if i am going the PHP way i could do it with Curl - couldnt i!?
Here is my approach: I tried to figure it out. And i digged deeper in the Manpages and Howtos. We can have a loop constructing the URLs and use Curl - repeatedly
As noted above: here we have some resultpages;
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
Alternatively we can add a request_prepare handler that computes and add the query arguments before we send out the request.
Question: Do you think, that this fits the needs?
Again: What is aimed: i want to parse the data and afterwards i want to store it in a local MySQL-database
should i define a extern_uid !?
and go like this:
Code: Select all
for my $i (0..10000) {
$ua->get('http://www.foundationfinder.ch/ShowDetails.php?Id=', id => 21, extern_uid => $i);
# process reply
}I love to hear from you!!
regards
lin
Re: Parsing: with PHP Simple HTML DOM Parser - how to do it
i try to find a way to use file_get_contents: a download of set of pages:
.. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
How to mechanize with a loop from 0 to 10000 and throw out 404 responses
once you reach the page we then could use beautifulsoup to get the content (in our case the image file address)
but we also could just loop trough the images directely with simple webrequests.
Well - how to proceed:
like this:
<?php
// creating a stream!
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
// opens a file
$file = file_get_contents('http://www.example.com/', false, $context);
?>
a typical page is http://www.foundationfinder.ch/ShowDeta ... &Type=Html
and the related image is at http://www.foundationfinder.ch/ShowDeta ... Type=Image
after downloading the images we will need to OCR them to extract any useful info,
so at some stage we need to look at OCR libs.
I think google opensourced one, and since its google it has a good chance it has a good api
what do you think!
.. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
http://www.foundationfinder.ch/ShowDeta ... &Type=Html
How to mechanize with a loop from 0 to 10000 and throw out 404 responses
once you reach the page we then could use beautifulsoup to get the content (in our case the image file address)
but we also could just loop trough the images directely with simple webrequests.
Well - how to proceed:
like this:
<?php
// creating a stream!
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
// opens a file
$file = file_get_contents('http://www.example.com/', false, $context);
?>
a typical page is http://www.foundationfinder.ch/ShowDeta ... &Type=Html
and the related image is at http://www.foundationfinder.ch/ShowDeta ... Type=Image
after downloading the images we will need to OCR them to extract any useful info,
so at some stage we need to look at OCR libs.
I think google opensourced one, and since its google it has a good chance it has a good api
what do you think!