Page 1 of 1

starting-point: parser that runs Curl and DOM (with Xpath)

Posted: Sat May 21, 2011 3:37 pm
by lin
hello dear folks, good evening dear community.

I need a starting-point! A German DB that collects all the data from all German Foundations...

see: http://www.suche.stiftungen.org/index.p ... baseID=129

Here we find all Foundations in Germany: : 8074 different foundations: You get the full results if you choose % as wildcard in the Search-field.

How to do this with PHP: i think that we have to do this with curl or with file_get_contents_ - those are the best methods for doing this: What do you think, personally. I am curious to get your ideas to know! please. lemme know what you think!?
BTW - probably - the XPATH and DOM-Technique can be used too. I guess so!?


on a sidenote: But if you do that - then you get some kind of overflow... 350 results are the limit. More is not possible to show. So the question is: How can we create a spider that runs across the site and asks step by step - that we get all : 8074 results.


The second question is: We get the following dataset:

Name: Allers'sche Tagelöhnerstiftung Landesstube des alten Landes Wursten
Street: Westerbüttel 13
Postal-code and town: 27632 Dorum
additional infos: Fördernd: Ja
additional infos: Operativ: Ja
webpage: http://www.sglandwursten.de

main area of work: Aufgabengebiete: Mildtätigkeit Kinder-/Jugendhilfe
regional-base: Regionale Einschränkungen: please 27632, 27637, 27638, 27607, Mitgliedsgemeinden im Bereich der Samtgemeinde Land Wursten, Nordholz, Imsum, verschiedene Gemeinden im Bereich der Samtgemeinde, Land Wursten, Gemeinde Nadholz
Target-group: Zielgruppen: Feste Destinatäre: Bewohner DRK-Alten- und Pflegeheim. Kinder, Jugendliche, Landarbeiter

All the dataset are simmilar! They seem to look exactly like this...
Th question is. Can this be stored directly into a MySQL-DB!?

Note; some descriptions are quite very very long. Guess that a Excel-Sheet can be overloaded by this!?

What do you think - is this doable!?

Love to hear from you - best regards

Re: starting-point: parser that runs Curl and DOM (with Xpat

Posted: Sat May 21, 2011 4:39 pm
by emelianenko
Do you discard using open source web crawlers such as dataparksearch or wget ? using wget I nearly downloaded the whole Amazon site until I decided to stop it, once I had download it, I wrote a perl script and that is it

as per file_get_contents, what are you going to do there ? all you can do is point it to a file and pass it to a variable, and as per xpath you need to convert it to xml, sorry if I sound a bit ignorant, but...am I correctly understanding that what you actually want is to grab all their contents from their database ? do you have access to the DB ?

Updated:

reading another posting, I found, if you use file_get_contents

Code: Select all

 
$input = @file_get_contents($url) or die("Could not access file: $url"); 
you then use

Code: Select all

  
int preg_match_all ( string $pattern , string $subject , array &$matches [, int $flags = PREG_PATTERN_ORDER [, int $offset = 0 ]] )
 
and you would have to write a regular expression, but even so, it would be no match to using dataparksearch as I indicated above
Mit freundlichen Grüßen

Re: starting-point: parser that runs Curl and DOM (with Xpat

Posted: Sun May 22, 2011 2:16 am
by lin
good day dear emelinenko - [guten Morgen]

many thanks for the answer. (i can answer more in a longer message - but at the moment i am a bit short of time]

well i would love to make this parser-harvester-job in PHP (with Curl ) or - on the other hand side - i could do it with Perl-Mechanize...

Mechanize is a very very strong module - but the technique goes a bit over my head.

I answer later -.- have a great day

schoenen Sonntag !
regards lin