Question for the gurus!

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
bluesaga
Forum Newbie
Posts: 3
Joined: Mon Jan 28, 2008 2:09 pm

Question for the gurus!

Post by bluesaga »

hey guys, i have a pretty major regex headache at the moment, after toying with my regex for hours and hours i've come up with something ALMOST there (i think).....

Code:

Code: Select all

<a\s*?.+?>(?!(?:(?!<\/a>).)*')(.*?)</a>
As a breakdown, it matches '<a' then as many characters as possible before a > WHICH does not come before a ' this way you can match:

Code: Select all

<a href='>>>>>>'>blah</a>
without having to worry about the >'s contained within the two quotes, however after i figured i had finished this i came across the problem with input strings alike:

Code: Select all

<a href='blah'>'blah'</a>
Due to it finding a ' after the > it doesnt match anything, which ok i understand why and what the reason is however i dont know how to factor in a lookbehind that looks for the initial ' before the >

If anyone could help me with this, i'll love you forever!
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Question for the gurus!

Post by Christopher »

Maybe you should check for:

'<a' whitespace 'href="' characters in URL '"' whatever '>'
(#10850)
bluesaga
Forum Newbie
Posts: 3
Joined: Mon Jan 28, 2008 2:09 pm

Re: Question for the gurus!

Post by bluesaga »

Well of course if it was THAT easy!

Code: Select all

<a href='url here' onclick='javascript&#058;xyz("abc>")'>'xyz'</a>
It SHOULD match: 'xyz' alike modern browsers, however due to the >'xyz and my function looking for the > with a following ' it has issues. And please note, with "abc>" in the input, it will strangely match:

Code: Select all

")'>'xyz'</a>
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Question for the gurus!

Post by Christopher »

I forgot to ask, what part of the string are you looking for? It isn't clear from the post.
(#10850)
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Re: Question for the gurus!

Post by Ollie Saunders »

If anyone could help me with this, i'll love you forever!
Ooh I'm not sure I should answer this :D

I think it would probably be a hole bunch easier to parse the HTML into a DOM, get all the A tags and then get their nodeValues. Otherwise you're probably going to want to tackle this in two steps: write two preg_replace()s to remove all the attributes (one for single quoted and one for double quoted) and then

Code: Select all

preg_match('~<a>(.*?)<\/a>~ms')
bluesaga
Forum Newbie
Posts: 3
Joined: Mon Jan 28, 2008 2:09 pm

Re: Question for the gurus!

Post by bluesaga »

Hmm, i dont think thats going to be efficient enough for me, im doing major web crawling with this application and to that scale its going to get hellish slow. Im doing things like language detection, external link extraction (what this is for) and a few other pretty complicated bits and bobs.

Surely there is a way to do what i need? :banghead:
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Re: Question for the gurus!

Post by Ollie Saunders »

Hmm, i dont think thats going to be efficient enough for me
Which option are you referring to? And how do you know until you profile it?

The two most common reasons for poor performance are network communication and data persistence. So it's probably more likely that downloading the page will be slower. But it's all speculation until you profile it.
im doing major web crawling with this application
How major? If you're doing properly major crawling, you should probably be using C.
Post Reply