Page 1 of 1

Web Scrapeing Help Needed. Maybe Regular Expressions?

Posted: Fri Oct 13, 2006 2:01 pm
by bradly
Hi everyone. I haven't a bit of trouble with a new script I am trying to write. Here is what is going on...

I am using cURL to grab the source of a html. I'm trying to write a function that is sent the display text of a link and then returns the url of that link.

For example, somewhere in the source file there is a link <a href="file.html">Click Here</a>
In this case I would want to be able to send the function "Click Here" and it return "file.html"

I have tried trying to use regular expressions all morning, but I usually do database e-commerce stuff and can't figure this stuff.

Thanks guys for any help you can offer!!

-Bradly

Posted: Fri Oct 13, 2006 2:05 pm
by feyd
Regular expressions is the general way to go about doing this. Would you care to post what you've tried and their results so we have a baseline to help you from?

Posted: Fri Oct 13, 2006 2:56 pm
by bradly
feyd wrote:Regular expressions is the general way to go about doing this. Would you care to post what you've tried and their results so we have a baseline to help you from?
LOL. I really wish I got close enough to have anything of use to post. This is my first time working with regular expressions and boy is it a process to learn!

I am really way out of my league here, i can't see it being to difficult for a regex pro. I am guessing it would go something like this

1.) search for a line containing a link with the title. Something like:

Code: Select all

$line = preg_grep('">' . $link_title . "</a>", $html_source);
I know this isn't rock solid as there could be some <b> or something in the link title, but in this case there isn't a chance of that.

2.) Then I would have a regex that would look for a string in between <a hre=" and ">$link_title


Does this make sense? Is this the proper way of doing something of this nature? Thanks for any advice/help you can offer!

-Bradly


p.s. I just noticed that there is a Regular Expressions forum here. Can someone witht he proper credentials move this topic? I don't want to cross-post.

Posted: Fri Oct 13, 2006 2:59 pm
by John Cartwright
bradly wrote:Can someone witht he proper credentials move this topic? I don't want to cross-post.
Moved to Regex.

Posted: Fri Oct 13, 2006 9:20 pm
by n00b Saibot
bradly wrote:

Code: Select all

$line = preg_grep('">' . $link_title . "</a>", $html_source);
you should use preg_match and use full <a href=""> tag for matching. you will get better results.