Page 1 of 1
Question for the gurus!
Posted: Mon Jan 28, 2008 2:11 pm
by bluesaga
hey guys, i have a pretty major regex headache at the moment, after toying with my regex for hours and hours i've come up with something ALMOST there (i think).....
Code:
Code: Select all
<a\s*?.+?>(?!(?:(?!<\/a>).)*')(.*?)</a>
As a breakdown, it matches '<a' then as many characters as possible before a > WHICH does not come before a ' this way you can match:
without having to worry about the >'s contained within the two quotes, however after i figured i had finished this i came across the problem with input strings alike:
Due to it finding a ' after the > it doesnt match anything, which ok i understand why and what the reason is however i dont know how to factor in a lookbehind that looks for the initial ' before the >
If anyone could help me with this, i'll love you forever!
Re: Question for the gurus!
Posted: Mon Jan 28, 2008 3:10 pm
by Christopher
Maybe you should check for:
'<a' whitespace 'href="' characters in URL '"' whatever '>'
Re: Question for the gurus!
Posted: Mon Jan 28, 2008 6:14 pm
by bluesaga
Well of course if it was THAT easy!
Code: Select all
<a href='url here' onclick='javascript:xyz("abc>")'>'xyz'</a>
It SHOULD match: 'xyz' alike modern browsers, however due to the >'xyz and my function looking for the > with a following ' it has issues. And please note, with "abc>" in the input, it will strangely match:
Re: Question for the gurus!
Posted: Mon Jan 28, 2008 6:39 pm
by Christopher
I forgot to ask, what part of the string are you looking for? It isn't clear from the post.
Re: Question for the gurus!
Posted: Tue Jan 29, 2008 8:24 am
by Ollie Saunders
If anyone could help me with this, i'll love you forever!
Ooh I'm not sure I should answer this
I think it would probably be a hole bunch easier to parse the HTML into a DOM, get all the A tags and then get their nodeValues. Otherwise you're probably going to want to tackle this in two steps: write two preg_replace()s to remove all the attributes (one for single quoted and one for double quoted) and then
Re: Question for the gurus!
Posted: Tue Jan 29, 2008 12:13 pm
by bluesaga
Hmm, i dont think thats going to be efficient enough for me, im doing major web crawling with this application and to that scale its going to get hellish slow. Im doing things like language detection, external link extraction (what this is for) and a few other pretty complicated bits and bobs.
Surely there is a way to do what i need?

Re: Question for the gurus!
Posted: Tue Jan 29, 2008 12:26 pm
by Ollie Saunders
Hmm, i dont think thats going to be efficient enough for me
Which option are you referring to? And how do you know until you profile it?
The two most common reasons for poor performance are network communication and data persistence. So it's probably more likely that downloading the page will be slower. But it's all speculation until you profile it.
im doing major web crawling with this application
How major? If you're doing properly major crawling, you should probably be using C.