Matching links having pdf extension

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
dibyendrah
Forum Contributor
Posts: 491
Joined: Wed Oct 19, 2005 5:14 am
Location: Nepal
Contact:

Matching links having pdf extension

Post by dibyendrah »

Dear All,
I have made the functions which reads the remote url and search for the link pattern in the read html. But I want to match the link having pdf extenstion only.

The code that I have made is as follows:

Code: Select all

//function string open_page(string url)
function open_page($page_url){

	$handle = fopen($page_url, "rb");
	$HTML = '';

	while (!feof($handle)) {
	  $HTML .= fread($handle, 8192);
	}
	fclose($handle);

	return($HTML);
}



//function array scan_pdf(string html)
function scan_links($HTML){

	preg_match_all("/<a[^>]+href=\"([^\"]+)/i", $HTML, $match);
	return($match);
}
Any help will be appreciated!!

Thanks.

Love,
Dibyendra
Last edited by dibyendrah on Tue Oct 25, 2005 3:46 am, edited 2 times in total.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Are you sure you know what Regular Expressions are?

Code: Select all

preg_match_all("/<a[^>]+href=\"([^\"]+\.pdf\")/i", $HTML, $match);
User avatar
dibyendrah
Forum Contributor
Posts: 491
Joined: Wed Oct 19, 2005 5:14 am
Location: Nepal
Contact:

thanks

Post by dibyendrah »

thanks ambush commander,
Thanks for the help and comment! I'm not so good in regular expression and learning regex (PCRE and POSIX). Anyway, thanks for the help.

love,
Dibyendra
User avatar
dibyendrah
Forum Contributor
Posts: 491
Joined: Wed Oct 19, 2005 5:14 am
Location: Nepal
Contact:

Please help me to solve the problem

Post by dibyendrah »

Ambush Commander wrote:Are you sure you know what Regular Expressions are?

Code: Select all

preg_match_all("/<a[^>]+href="([^"]+\.pdf")/i", $HTML, $match);
Hello all,
I have the links to scan in this pattern

Code: Select all

preg_match_all("/<td><a[^>]+href="([^"]+\.pdf)" target="(.*)">(.*)<\/a></td><td>(.*)</td>/i", $HTML, $match);
but the above code gave the error.

Code: Select all

<tr> 
    <td><a href="http://orion.lib.virginia.edu/thdl/texts/reprints/nepali_times/Nepali_Times_268.pdf" target="_blank"># 
      268</a></td>
    <td width="42%">7 - 13 October  2005 [2.2 MB]</td>
    <td class="nodata">&nbsp;</td>
    <td class="nodata">&nbsp;</td>
  </tr>
I have to get the link inside href ="" and link name between<a></a> from the first cell <td><a href="http://orion.lib.virginia.edu/thdl/text ... es_268.pdf" target="_blank">#
268</a></td>. Also, I have to extract the pdf create date [size] in second cell <td width="42%">7 - 13 October 2005 [2.2 MB]</td>.
width in the second cell is optional and may not come.

Please help me to solve this problem.

With best regards,
Dibyendra
User avatar
dibyendrah
Forum Contributor
Posts: 491
Joined: Wed Oct 19, 2005 5:14 am
Location: Nepal
Contact:

URL to scan pdf

Post by dibyendrah »

The url that I'm trying to scan the pdf link is :
http://www.digitalhimalaya.com/collecti ... /index.php

I want to to extract links and link nam from each first cell and link info from each second cell in the row.

Code: Select all

$pattern = "/<a[^>]+href=\"([^\"]+\.pdf)\" target=\"_blank\">(.*)/i";
this pattern finds the links but the problem is that <a href=""></a> is not in the same line. Can we match the link having the
link inside href and link name inside <a href="link">link name</a> even if closing tag of <a> is not in the same line?

Thank you for all the help around this forum!!

Regards,
Dibyendra
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

use the s pattern modifier. You'll need to make some adjustments to your pattern as well, as right now, it's greedy as all hell. .* should be .*?
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

feyd wrote:use the s pattern modifier. You'll need to make some adjustments to your pattern as well, as right now, it's greedy as all hell. .* should be .*?
... and ([^"]+\.pdf) should be ([^"]+?\.pdf) ;)
User avatar
dibyendrah
Forum Contributor
Posts: 491
Joined: Wed Oct 19, 2005 5:14 am
Location: Nepal
Contact:

how to ignore the target property in link tag

Post by dibyendrah »

Dear all,
Thank you all for the help! But, I found that some page don't have have target property . How to make the pattern which ignores this property if not there?

I have made the patter like follows but don't match.

Code: Select all

preg_match_all("/<a[^>]+href=\"([^\"]+\.pdf)\" (?!target=\"?([^\s\">]*))>/i", $HTML, $match);
What might be the wrong??

Love,
Dibyendra
Last edited by dibyendrah on Thu Oct 27, 2005 6:17 am, edited 1 time in total.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Code: Select all

$pattern = '/<a\s+[^>]*href="([^"]+\.pdf)"[^>]*>/is';
User avatar
dibyendrah
Forum Contributor
Posts: 491
Joined: Wed Oct 19, 2005 5:14 am
Location: Nepal
Contact:

Thank you d11wtq

Post by dibyendrah »

Dear d11wtq,
Thank you very much for your help.

DIbyendra
Post Reply