hello dear community
I want to parse a site that is called the foundation-finder: My Perl knowledge is pretty small!
I have tried various tutorials (examples of Mecha - that i have found on the CPAN) not oll of them work - some of them are broken!
Now i try t o get some real-world-task! - and i want to do it with PHP
the Foundation-Finder-task has several steps: Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland: http://www.edi.admin.ch/esv/00475/00698 ... sp?Id=3221
which has a dataset of 2700 foundations. All the data are free to use - with no limitations copyrights on it.
i mused about a starting-point: ould i use a Perl-module from CPAN and do the job with Perl.I guess that Mechanize or LWP could do a great job. Or HTML::Parser well - i am just musing which is
the best way to do the job. Guess that i am in front of a nice learning curve. This task will give me some nice PHP or Perl lessions.
Or can we do this with a Curl -task either!? I guess so! So here i am!
So here is a sample-page for the real-world-task a governmental site in Switzerland: more than 2'700 foundations in
http://www.edi.admin.ch/esv/00475/00698 ... sp?Id=3221
can i do this with Curl
love to get a hint
PHP-beginner - how to run Curl
Moderator: General Moderators
Re: PHP-beginner - how to run Curl
Have you looked at PHP's cURL manual yet?
Re: PHP-beginner - how to run Curl
hello dear Celauran
many many thanks for the reply!
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions
on the second search form on the page. Can we use DOM processing here.
Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages.
For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part of the script.
Approaches: In principle we could do the same algorithm with a single while loop if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
Look forward to hear from you
regards
many many thanks for the reply!
i am ironing out an perl approach.- How do you like his idea! Btw - i also can have a look at the curl page! Seems to be great! BTW - can you help me with this part here - below... see the followingCelauran wrote:Have you looked at PHP's cURL manual yet?
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions
on the second search form on the page. Can we use DOM processing here.
Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
Code: Select all
$mech->follow_link(url_regex => qr/webgrab_path=http://evs2000.*?
Id=d+$/, n => $result_nbr);
If we have several result pages we would also use the same trick to traverse through the result pages.
For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part of the script.
Approaches: In principle we could do the same algorithm with a single while loop if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
Look forward to hear from you
regards