Page 1 of 1

Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 12:15 pm
by ct_lee
I am trying to get regex to match my string which is:
&nbsp;&middot;&nbsp; <a href=\"http://www.link.ext/some-page/\" title=\"Title goes here\">
On some occasions there is content between the <a and the href part so i have used the regex:
.* <a .* href=\"
The code for my test program is in java as follows:

Code: Select all

public class test {
 
    public static void main( String[] arguments) {
        String test = "                                 &nbsp;&middot;&nbsp;                                                                    <a href=\"http://www.link.ext/some-page/\" title=\"Title goes here\">";
        if ( test.matches(".* <a .* href=\"") ) {
            System.out.println("matches");
        }
    }
 
}
Can anyone point out where i am going wrong or provide a solution to matching that example link?

Thanks.

Re: Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 12:23 pm
by prometheuzz
It doesn't match for two reasons:
1 - String.matches() returns true if the the entire String is matched by the regex. Since your regex stops after href=\", it won't match the entire String. Try adding another DOT-STAR at the end of your regex;
2 - there are two spaces in this part of your regex: <a .* href (before and after the DOT-STAR) while there is only one space in your text.

Another thing, matching text with DOT-STAR should be avoided if you can. Be more specific where possible. So you shouldn't do:

Code: Select all

"<a .*href"
but rather:

Code: Select all

"<a\s[^>]*href"

Re: Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 12:42 pm
by ct_lee
prometheuzz wrote:It doesn't match for two reasons:
1 - String.matches() returns true if the the entire String is matched by the regex. Since your regex stops after href=\", it won't match the entire String. Try adding another DOT-STAR at the end of your regex;
2 - there are two spaces in this part of your regex: <a .* href (before and after the DOT-STAR) while there is only one space in your text.

Another thing, matching text with DOT-STAR should be avoided if you can. Be more specific where possible. So you shouldn't do:

Code: Select all

"<a .*href"
but rather:

Code: Select all

"<a\s[^>]*href"
1. I had tried something like that in earlyer examples but still didnt get any success.
2. Thats because some links i am going through start the tag with <a id="1234abc" href="...
3. I had read it was greedy with memory using .* but just for getting used to regex i would use something simple but thanks for the tip.

I tried to use:

Code: Select all

"<a\s[^>]*href"
When i tried to compile my program i got an error saying i had an illegal escape character which pointed to the \s part of the regex string.

Any ideas?

edit:
In java i think i have to use \\s instead of \s ?... I used the code below and it didnt match.

Code: Select all

public class test {
 
    public static void main( String[] arguments) {
        String test = "                                 &nbsp;&middot;&nbsp;                                                                    <a href=\"http://www.link.ext/some-page/\" title=\"Title goes here\">";
        if ( test.matches("<a \\s[^>]*href") ) {
            System.out.println("matches");
        }
    }
 
}

Re: Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 12:46 pm
by prometheuzz
ct_lee wrote:
prometheuzz wrote:It doesn't match for two reasons:
1 - String.matches() returns true if the the entire String is matched by the regex. Since your regex stops after href=\", it won't match the entire String. Try adding another DOT-STAR at the end of your regex;
2 - there are two spaces in this part of your regex: <a .* href (before and after the DOT-STAR) while there is only one space in your text.

Another thing, matching text with DOT-STAR should be avoided if you can. Be more specific where possible. So you shouldn't do:

Code: Select all

"<a .*href"
but rather:

Code: Select all

"<a\s[^>]*href"
1. I had tried something like that in earlyer examples but still didnt get any success.
2. Thats because some links i am going through start the tag with <a id="1234abc" href="...
3. I had read it was greedy with memory using .* but just for getting used to regex i would use something simple but thanks for the tip.

I tried to use:

Code: Select all

"<a\s[^>]*href"
When i tried to compile my program i got an error saying i had an illegal escape character which pointed to the \s part of the regex string.

Any ideas?
Inside a String literal, you need to add an extra backslash, so it's not \s but \\s
Also note my remarks from point 1.

Re: Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 12:46 pm
by prometheuzz

Code: Select all

test.matches(".*<a\\s[^>]*href=\".*")

Re: Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 1:12 pm
by ct_lee
prometheuzz wrote:

Code: Select all

test.matches(".*<a\\s[^>]*href=\".*")
I had just figured that i had missed that before i read your post, thank you very much for your help.

Re: Cant seem to get the right regex to match...

Posted: Thu Aug 13, 2009 1:13 pm
by prometheuzz
ct_lee wrote:
prometheuzz wrote:

Code: Select all

test.matches(".*<a\\s[^>]*href=\".*")
I had just figured that i had missed that before i read your post, thank you very much for your help.
No problem.