Web Scrapeing Help Needed. Maybe Regular Expressions?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
bradly
Forum Newbie
Posts: 2
Joined: Fri Oct 13, 2006 1:50 pm

Web Scrapeing Help Needed. Maybe Regular Expressions?

Post by bradly »

Hi everyone. I haven't a bit of trouble with a new script I am trying to write. Here is what is going on...

I am using cURL to grab the source of a html. I'm trying to write a function that is sent the display text of a link and then returns the url of that link.

For example, somewhere in the source file there is a link <a href="file.html">Click Here</a>
In this case I would want to be able to send the function "Click Here" and it return "file.html"

I have tried trying to use regular expressions all morning, but I usually do database e-commerce stuff and can't figure this stuff.

Thanks guys for any help you can offer!!

-Bradly
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Regular expressions is the general way to go about doing this. Would you care to post what you've tried and their results so we have a baseline to help you from?
bradly
Forum Newbie
Posts: 2
Joined: Fri Oct 13, 2006 1:50 pm

Post by bradly »

feyd wrote:Regular expressions is the general way to go about doing this. Would you care to post what you've tried and their results so we have a baseline to help you from?
LOL. I really wish I got close enough to have anything of use to post. This is my first time working with regular expressions and boy is it a process to learn!

I am really way out of my league here, i can't see it being to difficult for a regex pro. I am guessing it would go something like this

1.) search for a line containing a link with the title. Something like:

Code: Select all

$line = preg_grep('">' . $link_title . "</a>", $html_source);
I know this isn't rock solid as there could be some <b> or something in the link title, but in this case there isn't a chance of that.

2.) Then I would have a regex that would look for a string in between <a hre=" and ">$link_title


Does this make sense? Is this the proper way of doing something of this nature? Thanks for any advice/help you can offer!

-Bradly


p.s. I just noticed that there is a Regular Expressions forum here. Can someone witht he proper credentials move this topic? I don't want to cross-post.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

bradly wrote:Can someone witht he proper credentials move this topic? I don't want to cross-post.
Moved to Regex.
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

bradly wrote:

Code: Select all

$line = preg_grep('">' . $link_title . "</a>", $html_source);
you should use preg_match and use full <a href=""> tag for matching. you will get better results.
Post Reply