Asterisk bolding

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Asterisk bolding

Post by superdezign »

I want to be able to bold text by making use of different characters. I'd like asterisks to indicate bold text, a forward slash to indicate italicized text, and an underscore to indicate underlined text. The conditions for formatting are:
  • first character in a line or after a space is the formatting character
  • all characters between formatting characters are not spaces
  • last character in a line or before a space is the formatting character
Examples when to format:

*this*is*bolded* = this is bolded
/italics/ = italics
_and_underlined_ = and underlined

Examples when not to format:

http://website.tld/ = http://website.tld/
some_file_name.ext = some_file_name.ext
this*is*not*bold = this*is*not*bold


I can't seem to figure out how to catch the start and end of a line. I assumed that I could use this regex:

Code: Select all

~([\s\b])([\*\/\_])((?:[^\s]+?\2)+)([\s\b])~
For the benefit of those visitors who do not know regex, step-by-step:
([\s\b]) matches a whitespace character or a word break and saves it as \1
([\*\/\_]) matches an asterisk, a forward slash, or an underscore and saves it as \2
((?:[^\s]+?\2)+) matches multiple occurrences of non-whitespace characters followed by the content of \2 and saves it as \3
([\s\b]) matches a whitespace character or a word break and saves it as \4

If there is a space both before and after the formatted content, this works. However, this is oftentimes not the case, as paragraphs may start with a formatted word or sentences may end with a formatted word, followed by a period. It also seems that using \b doesn't help at all. Does anyone know how to accomplish this?
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Re: Asterisk bolding

Post by superdezign »

Nevermind. After thinking about it in a different direction, I decided that the conditions could be better defined as "not preceded by a letter" and "not followed by a letter." I used the lookarounds to accomplish this.

Code: Select all

~(?<!\w)([\*\/\_])((?:[^\s]+?\1)+)(?!\w)~
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Asterisk bolding

Post by prometheuzz »

superdezign wrote:I can't seem to figure out how to catch the start and end of a line.
If you enable "multi line" mode, then the ^ and $ not only match the start and end of the entire string, but also the start and end of each line. You can enable this by adding a 'm' at the end of your regex (after the delimiter), or put the flag "(?m)" at the start of your regex:

Code: Select all

$regex = '/^([*/_])(?:\S+?\1)+$/m';
// or
$regex = '/(?m)^([*/_])(?:\S+?\1)+$/';
As you can see, there is no need to escape '*', '/' or '_' inside your character set. And "[^\s]" is equivalent to "\S".
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Re: Asterisk bolding

Post by superdezign »

By catching the start and end of a line, I meant to use that as a possible condition rather than a required condition. I wanted to define the cases as starting with whitespace or the start of a line (or the start of the content) and as ending with whitespace or the end of a line (or the end of the content). However, I also wanted to be able to start with characters such as a parentheses or an asterisk. Defining as a non-word character ended up being much more precise after I noticed these exceptions. Example:

Plaintext:
*Bold* text, yada yada (/italicized/text/ offhand comment) foo _underline_me_ **bold*with*asterisks*around*it** etc.

Parsed:
Bold text, yada yada (italicized text offhand comment) foo underline me *bold with asterisks around it* etc.

As for escaping the characters, I had done so because my regex was generated from an array that I defined. So, to remain safe, I have the code escape all of the characters, just in case I were to use a bracket or something. I just posted the regex as PHP printed it to be sure I had the correct regex.

And as for the negation by "\S", I had completely forgotten about that. Thank you! Simplifying regex is always good. :D
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Asterisk bolding

Post by prometheuzz »

superdezign wrote:...
As for escaping the characters, I had done so because my regex was generated from an array that I defined. So, to remain safe, I have the code escape all of the characters, just in case I were to use a bracket or something.
...
Note that preg_quote can handle escaping metacharacters:

Code: Select all

$text = '1 + 1 = _ ? _';
echo preg_quote($text);
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Re: Asterisk bolding

Post by superdezign »

preg_quote() doesn't recognize where in the regex the characters are. In brackets, the asterisk has no meaning, but outside of the brackets it does. Regardless, it *does* make the code cleaner, so thanks for the advice. :D
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Asterisk bolding

Post by prometheuzz »

superdezign wrote:preg_quote() doesn't recognize where in the regex the characters are. In brackets, the asterisk has no meaning, but outside of the brackets it does. Regardless, it *does* make the code cleaner, so thanks for the advice. :D
You're most welcome!
Post Reply