Page 1 of 1
get just a part from one web
Posted: Thu Sep 08, 2005 12:06 am
by webreake
Hello
So i visited a web with a list of available jobs there was too much info in this page (4 mb plain text) and the webmaster of this page shows everything in just one page this represent a problem for the users of the page because the page takes too long to complete its loading
Thas why i want to download only a portion of that big page or try to get filtered content with php
This is the url:
http://clasificados.mexplaza.com.mx/cgi ... mpleos.cgi
Any ideas ?
Posted: Thu Sep 08, 2005 12:45 am
by feyd
Code: Select all
[feyd@home]>php -r "preg_match_all('#<form.*?</form>.*?</blockquote>#s',file_get_contents('http://clasificados.mexplaza.com.mx/cgi-bin/clasificados/listarempleos.cgi'),$matches); echo count($matches[0]);"
3607
there are 3607 job entries, as of 5 minutes ago.

Posted: Sat Sep 10, 2005 11:27 am
by webreake
thanks feyd
locally works great but when i tried to run this script in my web i got an error because it is to much time to download the file(more than 30 seconds ) also i think this is known as steal bandwith
i guess thats why i got that error in my web
i saw my bandwidth usage and still downloading the entire site,
So my question is:
will be possible to get filtered content from one web without downloading the entire html (for example download only the red font or only the links from the web without everything else)??
Posted: Sat Sep 10, 2005 12:06 pm
by raghavan20
I dont know whether this function would be anyway useful
ignore_user_abort()
http://uk2.php.net/manual/en/function.i ... -abort.php
Posted: Sat Sep 10, 2005 12:14 pm
by feyd
could use
set_time_limit() as well or alternately...
Posted: Sun Sep 11, 2005 1:51 pm
by webreake
hi again
i tried with both functions :
ignore_user_abort()
set_time_limit()
but now my problem is how to force this function to return the array when i finish the script:
Code: Select all
preg_match_all('#<form.*?</form>.*?</blockquote>#s',file_get_contents('http://clasificados.mexplaza.com.mx/cgi-bin/clasificados/listarempleos.cgi'),$matches);
or maybe i will look for a function to capture a number of bytes and print them, wich bring me into another problem.
this is cool
programming is like a really big puzzle

Posted: Sun Sep 11, 2005 2:00 pm
by feyd
webreake wrote:programming is like a really big puzzle

yes, it is like a really big puzzle.
$matches in that code will contain all the found parts.
Posted: Sun Sep 11, 2005 8:05 pm
by timvw
I would cache the actual page.. So, every 5 minutes or so you retrieve a new version (Might want to use the conditional GET for that). And then, make your own site work on that cache. This will improve performance (at the cost of an average 2.5minutes delay in changes)
Posted: Wed Sep 14, 2005 4:18 pm
by webreake
hi
this is my happy end
finally finished my script
what it does is what timvw suggest
I would cache the actual page.. So, every 5 minutes or so you retrieve a new version
combined with the function preg_match_all to cache the page,
also i added a function to delete repeated job entries
thanks guys