Page 1 of 1

Matching links having pdf extension

Posted: Wed Oct 19, 2005 5:36 am
by dibyendrah
Dear All,
I have made the functions which reads the remote url and search for the link pattern in the read html. But I want to match the link having pdf extenstion only.

The code that I have made is as follows:

Code: Select all

//function string open_page(string url)
function open_page($page_url){

	$handle = fopen($page_url, "rb");
	$HTML = '';

	while (!feof($handle)) {
	  $HTML .= fread($handle, 8192);
	}
	fclose($handle);

	return($HTML);
}



//function array scan_pdf(string html)
function scan_links($HTML){

	preg_match_all("/<a[^>]+href=\"([^\"]+)/i", $HTML, $match);
	return($match);
}
Any help will be appreciated!!

Thanks.

Love,
Dibyendra

Posted: Wed Oct 19, 2005 6:15 pm
by Ambush Commander
Are you sure you know what Regular Expressions are?

Code: Select all

preg_match_all("/<a[^>]+href=\"([^\"]+\.pdf\")/i", $HTML, $match);

thanks

Posted: Thu Oct 20, 2005 12:37 am
by dibyendrah
thanks ambush commander,
Thanks for the help and comment! I'm not so good in regular expression and learning regex (PCRE and POSIX). Anyway, thanks for the help.

love,
Dibyendra

Please help me to solve the problem

Posted: Tue Oct 25, 2005 1:29 am
by dibyendrah
Ambush Commander wrote:Are you sure you know what Regular Expressions are?

Code: Select all

preg_match_all("/<a[^>]+href="([^"]+\.pdf")/i", $HTML, $match);
Hello all,
I have the links to scan in this pattern

Code: Select all

preg_match_all("/<td><a[^>]+href="([^"]+\.pdf)" target="(.*)">(.*)<\/a></td><td>(.*)</td>/i", $HTML, $match);
but the above code gave the error.

Code: Select all

<tr> 
    <td><a href="http://orion.lib.virginia.edu/thdl/texts/reprints/nepali_times/Nepali_Times_268.pdf" target="_blank"># 
      268</a></td>
    <td width="42%">7 - 13 October  2005 [2.2 MB]</td>
    <td class="nodata">&nbsp;</td>
    <td class="nodata">&nbsp;</td>
  </tr>
I have to get the link inside href ="" and link name between<a></a> from the first cell <td><a href="http://orion.lib.virginia.edu/thdl/text ... es_268.pdf" target="_blank">#
268</a></td>. Also, I have to extract the pdf create date [size] in second cell <td width="42%">7 - 13 October 2005 [2.2 MB]</td>.
width in the second cell is optional and may not come.

Please help me to solve this problem.

With best regards,
Dibyendra

URL to scan pdf

Posted: Tue Oct 25, 2005 2:27 am
by dibyendrah
The url that I'm trying to scan the pdf link is :
http://www.digitalhimalaya.com/collecti ... /index.php

I want to to extract links and link nam from each first cell and link info from each second cell in the row.

Code: Select all

$pattern = "/<a[^>]+href=\"([^\"]+\.pdf)\" target=\"_blank\">(.*)/i";
this pattern finds the links but the problem is that <a href=""></a> is not in the same line. Can we match the link having the
link inside href and link name inside <a href="link">link name</a> even if closing tag of <a> is not in the same line?

Thank you for all the help around this forum!!

Regards,
Dibyendra

Posted: Tue Oct 25, 2005 8:42 am
by feyd
use the s pattern modifier. You'll need to make some adjustments to your pattern as well, as right now, it's greedy as all hell. .* should be .*?

Posted: Tue Oct 25, 2005 12:13 pm
by Chris Corbyn
feyd wrote:use the s pattern modifier. You'll need to make some adjustments to your pattern as well, as right now, it's greedy as all hell. .* should be .*?
... and ([^"]+\.pdf) should be ([^"]+?\.pdf) ;)

how to ignore the target property in link tag

Posted: Thu Oct 27, 2005 5:47 am
by dibyendrah
Dear all,
Thank you all for the help! But, I found that some page don't have have target property . How to make the pattern which ignores this property if not there?

I have made the patter like follows but don't match.

Code: Select all

preg_match_all("/<a[^>]+href=\"([^\"]+\.pdf)\" (?!target=\"?([^\s\">]*))>/i", $HTML, $match);
What might be the wrong??

Love,
Dibyendra

Posted: Thu Oct 27, 2005 5:50 am
by Chris Corbyn

Code: Select all

$pattern = '/<a\s+[^>]*href="([^"]+\.pdf)"[^>]*>/is';

Thank you d11wtq

Posted: Thu Oct 27, 2005 6:54 am
by dibyendrah
Dear d11wtq,
Thank you very much for your help.

DIbyendra