Adding HTML paragraph tags to stand-alone text paragraphs?

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
carlos123
Forum Newbie
Posts: 5
Joined: Sun Apr 12, 2009 10:26 pm

Adding HTML paragraph tags to stand-alone text paragraphs?

Post by carlos123 »

Hi everyone,

The regular expression problem I am having is driving me absolutely nuts! I have spent hours on this to no avail. If anyone has any input it would be much appreciated.

Here's some text...

Code: Select all

 
 
About Me.
 
My profesional goal is to develop a client base that will allow me to make a living entirely
 
I am well on the way to realizing that goal and already work from home but am looking
 
The kind of web development services that I offer are not just related to what one
 
To see a more detailed listing and explanation of the kinds of things that I can do
 
Carlos
 
 
 
I want to insert the HTML <p></p> tags around each paragraph.  No problem using the following regular expression function (passing the text above to it in the $text parameter)...
 

Code: Select all

 
function addParagraphs($text)
{
   // Add paragraph elements
   $lf = chr(10);
   return preg_replace('/
      \n
     (.*)
     \n
     /Ux' , $lf.'<p>'.$lf.'$1'.$lf.'</p>'.$lf, $text);
}
 
 
The output from running the text through the above function is...
 

Code: Select all

 
 
<p>
About Me.
</p>
 
<p>
My profesional goal is to develop a client base that will allow me to make a living entirely
</p>
 
<p>
I am well on the way to realizing that goal and already work from home but am looking
</p>
 
<p>
The kind of web development services that I offer are not just related to what one
</p>
 
<p>
To see a more detailed listing and explanation of the kinds of things that I can do
</p>
 
<p>
Carlos
</p>
 
 
 
But I am trying to get the paragraph function to ignore paragraphs that are already hand code with any kind of HTML tag around it.  So for example if the first paragraph in my text file was...
 

Code: Select all

 
 
<p class="someclass">
About Me
</p>
 
 
 
I would want the function to put the <p></p> tags around every other paragraph but the first one.  I want the function to leave any paragraphs that already have HTML tags around them alone.  
 
Unfortunately having a first paragraph like the one above in the text breaks the function and messes things up.  
 
What happens is the function matches for the newline after "About Me" and then puts the paragraph tags around the following "</p>" resulting in...
 

Code: Select all

 
 
<p class="someclass">
About Me
<p>
</p>
</p>
 
 
The rest of the paragraphs get messed up as well.

Anybody got any suggestions?

Thanks.

Carlos
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by ridgerunner »

Comments:
  • Using the PCRE 'U' mode flag is not recommended - best practices is to explicitly specify the lazy quantifier in the regex itself
  • As-written, your regex paragraphifies empty space (i.e. multiple newlines).
  • Your style of adding a linefeed after each opening tag and before each closing tag makes this task very difficult (if not impossible) to achieve using a regex alone.
  • The regex uses the * which matches zero chars between newlines. This should match at least one char.
I don't think this problem can be solved with a regex alone. You first need to split up the text into "tagged paragraphs" and everything else, then apply your regex to the "everything else" sections. Here is a script which does exactly that:

Code: Select all

<?php
function addParagraphsNew($text)
{
// local variables
$returntext = '';       // modified string to return back to caller
$sections   = array();  // array of text sections returned by preg_split()
$pattern1   = '%        # match: <tag attrib="xyz">contents</tag>
^                       # tag must start on the beginning of a line
(                       # capture whole thing in group 1
  <                     # opening tag starts with left angle bracket
  (\w++)                # capture tag name into group 2
  [^>]*+                # allow any attributes in opening tag
  >                     # opening tag ends with right angle bracket
  .*?                   # lazily grab everything up to closing tag
  </\2>                 # closing tag for one we just opened
)                       # end capture group 1
$                       # tag must end on the end of a line
%smx';                  // s-dot matches newline, m-multiline, x-free-spacing
 
$pattern2   = '%        # match: \n--untagged paragraph--\n
(?:                     # non-capture group for first alternation. Match either...
  \s*\n\s*+             # a newline and all surrounding whitespace (and discard)
|                       # or...
  ^                     # the beginning of the string
)                       # end of first alternation group
(.+?)                   # capture all text between newlines (or string ends)
(?:\s+$)?               # clear out any whitespace at end of string
(?=                     # end of paragraph is position followed by either...
  \s*\n\s*              # a newline with optional surrounding whitespace
|                       # or...
  $                     # the end of the string
)                       # end of second alternation group
%x';                    // x-free-spacing
 
// first split text into tagged portions and untagged portions
// Note that the array returned by preg_split with PREG_SPLIT_DELIM_CAPTURE flag will get one
// extra member for each set of capturing parentheses. In this case, we have two sets; 1 - to
// capture the whole HTML tagged section, and 2 - to capture the tag name (which is needed to
// match the closing tag).
$sections = preg_split($pattern1, $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );
 
// now put it back together proccessing only the untagged sections
for ($i = 0; $i < count($sections); $i++) {
    if (preg_match($pattern1, $sections[$i]))
    { // this is a tagged paragraph, don't modify it, just add it (and increment array ptr)
        $returntext .= "\n" . $sections[$i] . "\n";
        $i++; // need to skip over the extra array element for capture group 2
    } else
    { // this is an untagged section. Add paragraph tags around bare paragraphs
        $returntext .= preg_replace($pattern2, "\n<p>$1</p>\n", $sections[$i]);
    }
}
$returntext = preg_replace('/^\s+/', '', $returntext); // clean leading whitespace
$returntext = preg_replace('/\s+$/', '', $returntext); // clean trailing whitespace
return $returntext;
}
 
// Read html file to be processed into $data variable
$data = file_get_contents('test.txt');
echo addParagraphsNew($data);
?>
Script notes: I modified your regex which matches text between newlines. This new one matches multiple consecutive newlines and whitespace preceeding the "paragraph" text. It will match paragraphs at the beginning and end of the text (with no linefeed delimiter). In the replace operation, the paragraphs are separated by two newlines and have no newlines after the <p> and before the </p> (my style preference - change as required). Also, this script will ignore inline tags within a paragraph.

Hope this helps.

Edit 2009-08-26 15:32 MDT: Removed unnecessary 'i' regex flag. Minor typo corrections.
Last edited by ridgerunner on Wed Aug 26, 2009 4:36 pm, edited 2 times in total.
carlos123
Forum Newbie
Posts: 5
Joined: Sun Apr 12, 2009 10:26 pm

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by carlos123 »

It's a dissapointment to know that there is no simple way to achieve what I want.

The more processing I have to do to achieve the end result the more time it will take for pages to load.

What I am trying to achieve is the ability to give my web development clients the capability to change the text on each web site page I develop for them by simple editing of text files with bare minimal HTML in place (such as only UL and paragraphs with CSS classes. The paragraph tags will be put in automatically for them when my PHP scrirpts import their revised text).

No CMS confusion. No fuss. Some HTML learning required but better than having to learn to use the likes of Wordpress or Drupal to create and modify simple web sites. And what you learn of HTML is relevant to all future web sites no matter what unlike learning Wordpress for example such that your time spent learning it is only relevant to Wordpress.

Anyway I will play with your code later. It does look impressive. Thanks for sharing it with me.

If it works it will be what I need.

I'm still wondering if there is not some simpler way but in liea of not being able to figure out what that simpler way might be I may have to do the extra processing you indicate to achieve what I want for my clients. Hopefully it won't add significant and noticeable load time to web pages.

Thanks again.

Carlos
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by ridgerunner »

Note that I've edited the script quite a bit since originally posting it. Be sure to grab the latest version (the one posted there now).

Cheers!
carlos123
Forum Newbie
Posts: 5
Joined: Sun Apr 12, 2009 10:26 pm

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by carlos123 »

Thanks for the heads up ridgerunner. Much appreciated. I will download it and play with it later when I have more time. Got to make calls today to hopefully get a few new clients :).

Carlos
carlos123
Forum Newbie
Posts: 5
Joined: Sun Apr 12, 2009 10:26 pm

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by carlos123 »

I finally had a chance to download and play with your code ridgerunner.

It works beautifully (aside from some minor and probably easily fixed extra linefeeds inserted here and there). Amazing!

How long did it take you to figure this code out ridgerunner?

I will need at least a good hour or even longer of solid study of your code to fully understand it. I mean I understand the gist of it and most of what you used in the regular expressions but I don't think I could ever have come with it in it's totality, on my own.

My hat off to you! Thanks very much!

Carlos
Ti Creative
Forum Newbie
Posts: 2
Joined: Thu Mar 18, 2010 3:32 pm

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by Ti Creative »

I love this! It was exactly what I needed when I needed it.

I made a couple of small mods so I could pass classes to the p tags

Code: Select all

function add_paragraph_tags($text, $class=NULL)
Then, towards the end around line 174:

Code: Select all

        { // this is an untagged section. Add paragraph tags around bare paragraphs
            // add class designation if set
            if ($class) { $p_class = ' class="'.$class.'"'; }
            $returntext .= preg_replace($pattern2, "\n<p".$p_class.">$1</p>\n", $sections[$i]);
        }
I'm going to look at adding a style option as well. Thanks for the code!!
Ti Creative
Forum Newbie
Posts: 2
Joined: Thu Mar 18, 2010 3:32 pm

Re: Adding HTML paragraph tags to stand-alone text paragraphs?

Post by Ti Creative »

That was easy enough.

Code: Select all

function add_paragraph_tags($text, $class=NULL, $style=NULL)

Code: Select all

{ // this is an untagged section. Add paragraph tags around bare paragraphs
            // add class designation if set
            if ($class) { $p_class = ' class="'.$class.'"'; }
            if ($style) { $p_style = ' style="'.$style.'"'; }
            $returntext .= preg_replace($pattern2, "\n<p".$p_class.$p_style.">$1</p>\n", $sections[$i]);
        }
Post Reply