Page 1 of 1

regex to find pdfs in html code

Posted: Tue Apr 25, 2006 2:52 am
by dibyendrah
This functions takes the html code as input and returns the matched pdf link as an array.

Code: Select all

//function array scan_pdf(string html)
  function scan_links($HTML)
  {
      global $HTTP_POST_VARS;
      //preg_match_all("/<a[^>]+href=\"([^\"]+\.pdf)\" target=\"_blank\">/i", $HTML, $match);
      preg_match_all('/<a\s+[^>]*href="([^"]+\.pdf)"[^>]*>/is', $HTML, $match);
      //clean up empty array
      
      foreach ($match as $k => $v) {
          if (empty($match[$k])) {
              unset($match[$k]);
          }
      }
      
      //preg_match_all("/<a\s+[^>]*href=\"([^"]+\.pdf)\"[^>]*>/is",$HTML,$match);
      //print_r($match); exit;
      if (count($match) != 0) {
          return($match);
      } else {
          
          $alert = 'Page contains no valid pdf paths!';
          print "<SCRIPT> alert('$alert');</SCRIPT>";
          return(false);
      }
  }

Cheers,
Dibyendra