many many thanks for running this board. I love this site. It has helped me so often! You are great fellows. What i do today is workin on a little php-parser!
I need to get all the data out of this site.See the target: http://www.aktive-buergerschaft.de/buer ... ungsfinder
I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.
I want to store the data in a Mysql-db for the sake of a better retrieval!
see an example:
I need to get all the data out of this site.
see the target: http://www.aktive-buergerschaft.de/buer ... ungsfinder
I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.
I need to have the data that are "behind" the link - is there any way to do thissee an example:
Bürgerstiftung Lebensraum Aachen
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062 Aachen
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: info@buergerstiftung-aachen.de
www.buergerstiftung-aachen.de
>> Weitere Details zu dieser Stiftung
Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung
with a easy and understandable parser - one that can be understood and written by a newbie!?
well i could do this with XPahts - in PHP or Perl - (with mechanize)
i started with an php-approach: But -if i run the code (see below) i get this results
Code: Select all
martin@suse-linux:~> cd perl
martin@suse-linux:~/perl> cd foundations
martin@suse-linux:~/perl/foundations> php arbie_finder_de.php
PHP Parse error: syntax error, unexpected '*' in /home/martin/perl/foundations/arbie_finder_de.php on line 3
martin@suse-linux:~/perl/foundations> php arbie_finder_de.php
PHP Parse error: syntax error, unexpected T_FOREACH in /home/martin/perl/foundations/arbie_finder_de.php on line 17
martin@suse-linux:~/perl/foundations> ^C
martin@suse-linux:~/perl/foundations>
Code: Select all
<?php
// Create DOM from URL or file
$html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');
// split it via body, so you only get to the contents inside body tag
$split = split('<body>', $html);
// it is usually in the top of the array but just check to be sure
$body = $split[1];
// split again with, say,<p class="divider">A</p>
$split = split('<p class="divider">A</p>', $body);
// now this should contain just the data table you want to process
$data = $split[1]
// Find all links from original html
foreach($html->find('a') as $element) {
$link = $element->href;
// check if this link is in our data table
if(substr_count($data, $link) > 0) {
// link is in our data table, follow the link
$html = file_get_html($link);
// do what you have to do
}
}
?>
the standard practice for scrapping the pages would be:
1. read the page into a string (file_get_html or whatever is being used now)
2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by
Code: Select all
<p class="divider">A</p>
3. If we wish to follow the link, just repeat the same process, but using the link.
4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way.
Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward