Page 1 of 1

get just a part from one web

Posted: Thu Sep 08, 2005 12:06 am
by webreake
Hello

So i visited a web with a list of available jobs there was too much info in this page (4 mb plain text) and the webmaster of this page shows everything in just one page this represent a problem for the users of the page because the page takes too long to complete its loading

Thas why i want to download only a portion of that big page or try to get filtered content with php

This is the url:

http://clasificados.mexplaza.com.mx/cgi ... mpleos.cgi

Any ideas ?

Posted: Thu Sep 08, 2005 12:45 am
by feyd

Code: Select all

[feyd@home]>php -r "preg_match_all('#<form.*?</form>.*?</blockquote>#s',file_get_contents('http://clasificados.mexplaza.com.mx/cgi-bin/clasificados/listarempleos.cgi'),$matches); echo count($matches[0]);"
3607
there are 3607 job entries, as of 5 minutes ago. :)

Posted: Sat Sep 10, 2005 11:27 am
by webreake
thanks feyd
locally works great but when i tried to run this script in my web i got an error because it is to much time to download the file(more than 30 seconds ) also i think this is known as steal bandwith
i guess thats why i got that error in my web

i saw my bandwidth usage and still downloading the entire site,
:(
So my question is:
will be possible to get filtered content from one web without downloading the entire html (for example download only the red font or only the links from the web without everything else)??

Posted: Sat Sep 10, 2005 12:06 pm
by raghavan20
I dont know whether this function would be anyway useful

ignore_user_abort()

http://uk2.php.net/manual/en/function.i ... -abort.php

Posted: Sat Sep 10, 2005 12:14 pm
by feyd
could use set_time_limit() as well or alternately...

Posted: Sun Sep 11, 2005 1:51 pm
by webreake
hi again

i tried with both functions :
ignore_user_abort()
set_time_limit()

but now my problem is how to force this function to return the array when i finish the script:

Code: Select all

preg_match_all('#<form.*?</form>.*?</blockquote>#s',file_get_contents('http://clasificados.mexplaza.com.mx/cgi-bin/clasificados/listarempleos.cgi'),$matches);
or maybe i will look for a function to capture a number of bytes and print them, wich bring me into another problem.
this is cool
programming is like a really big puzzle :o

Posted: Sun Sep 11, 2005 2:00 pm
by feyd
webreake wrote:programming is like a really big puzzle :o
yes, it is like a really big puzzle. :)

$matches in that code will contain all the found parts.

Posted: Sun Sep 11, 2005 8:05 pm
by timvw
I would cache the actual page.. So, every 5 minutes or so you retrieve a new version (Might want to use the conditional GET for that). And then, make your own site work on that cache. This will improve performance (at the cost of an average 2.5minutes delay in changes)

Posted: Wed Sep 14, 2005 4:18 pm
by webreake
hi
this is my happy end :)
finally finished my script
what it does is what timvw suggest
I would cache the actual page.. So, every 5 minutes or so you retrieve a new version
combined with the function preg_match_all to cache the page,
also i added a function to delete repeated job entries

thanks guys