Page 1 of 1

Strip leading HTML

Posted: Sat Aug 11, 2007 2:19 pm
by dustrg
I realize that Regex can't work on nested tags. But I needed to strip the HTML and all but one space, if any space(s) were present, from only the front of a string (say the sort retrieved by innerHTML). So there a complete string of HTML tags which are followed by the substring where I want to start, which substring I'll call the, text.

I tried the pattern - /<*[^>]*>/g,'' - but it removed all tags, even after the text began. I removed character entities the same way, with - /&*[^;]*;/g,'' - and it also removed them everywhere.

I ultimately had to resort to a function that I used, elsewhere, to essentially find the start of the text. But I wondered if a regex could still be used? It's the end of the text that matters if one is worried about nesting. Some function would probably have to be called that would quickly parse the hierarchy as a simple stack. But to find the start of the text, any HTML nesting wouldn't matter. How could I get the pattern - /<*[^>]*>|&*[^;]*;/g,'' - to stop at the first alphabet or digit it encounters outside of a tag or entity.

Posted: Sat Aug 11, 2007 7:22 pm
by superdezign
If you want use to give you a regex example, you should give us and example of what you are given and what you want to get out of it.

You may also want to use strip_tags().

Posted: Sun Aug 12, 2007 11:44 am
by dustrg
superdezign wrote:If you want use to give you a regex example, you should give us and example of what you are given and what you want to get out of it.
I needed to strip the HTML and all but one space, if any space(s) were present, from only the front of a string (say the sort retrieved by innerHTML). So there a complete string of HTML tags which are followed by the substring where I want to start, which substring I'll call the, text.

But if you want some specific example:

" <span><br> &nbsp;<br>He honed his players into <i>Hall of Famers</i>, <i>MVPs</i>, Pro&nbsp;Bowlers, household names and winners.<br></span>"

Into:

" He honed his players into <i>Hall of Famers</i>, <i>MVPs</i>, Pro&nbsp;Bowlers, household names and winners.<br></span>"
superdezign wrote: use strip_tags().
As far as I know, this regex strips every tag, every comment and every character entity:

/<*[^>]*>|&*[^;]*;/g,''

If followed by a / {2,}/g to strip consecutive spaces, I want it to stop at the 'H' in, He, above, or whatever non-whitespace character might be there that isn't a "<" or "&".

Posted: Sun Aug 12, 2007 12:13 pm
by superdezign
... That's.... odd.

Code: Select all

@(<[^>]+>|&[^;]+;|[^a-z])*@i
Takes HTML tags, character entities, and anything that isn't a letter a strips it away. As for keeping the space... You're on your own.

Posted: Sun Aug 12, 2007 2:02 pm
by dustrg
superdezign wrote:... That's.... odd.

Code: Select all

@(<[^>]+>|&[^;]+;|[^a-z])*@i
Takes HTML tags, character entities, and anything that isn't a letter a strips it away. As for keeping the space... You're on your own.
Thanks. This worked:

Code: Select all

^(<[^>]+>|&[^;]+;|[^a-z])*/gi,''
It would be a two-step operation in any case. One could just use the old method I had:

Code: Select all

<[^>]+>|&[^;]+;/g,''
followed by

Code: Select all

{2,}/g,' '
which would leave a single space at the front if any were found before the first non-alpha or digit.

Then you'd need a third operation, a conditional, to tack on the leading space if it was found. But with that caveat, you basically could use regex in this case.

So a

Code: Select all

strX .replace(/<[^>]+>|&[^;]+;/g,'').replace(/ {2,}/g,' ').substring(0,1)==' ' ? ' ' : '') +strX .replace(/^(<[^>]+>|&[^;]+;|[^a-z])*/gi,'')
I don't know if that's more of a maintenance headache than APL, perhaps. I don't know.

Anyway, thanks again. Problem solved. (I'd just add that the more general case would be, \S, instead of, a-z)