Chapter 7 Exercises

Exercise 4: String Replacement with Regular Expressions — Part A (Boldface and Italics, Paragraphs and Line Breaks, Hyperlinks)

You can easily detect the presence of tags in a given text string using preg_match with the regular expression syntax you have just learned. However, what you need to do is pinpoint those tags and replace them with appropriate HTML tags. To achieve this, let’s look at preg_replace, another regular expression function offered by PHP.

preg_replace, like preg_match, accepts a regular expression and a string of text and attempts to match the regular expression in the string. In addition, preg_replace takes a second string of text and replaces every match of the regular expression with that string.

The syntax for preg_replace is as follows:

Here, regExp is the regular expression and replaceWith is the string that will replace matches to regExp in oldString. The function returns the new string with all the replacements made. In the above, this newly-generated string is stored in $newString.

You are now ready to build your custom markup language.

Boldface and Italic Text

Let's start by implementing tags that create boldface and italic text. Suppose you want [B] to begin bold text and [/B] to end bold text. Obviously, you must replace [B] with <strong> and [/B] with </strong>. Achieving this is a simple application of preg_replace:

Notice that, because an openings square bracket ([) normally indicates the start of a set of acceptable characters in a regular expression, we place a backslash before it in order to remove its special meaning.

Similarly, we must escape the forward slash in the [/b] tag with a backslash, to prevent it from being mistaken for the delimiter that marks the end of the regular expression.

Without a matching [, the ] loses its special meaning, so it is unnecessary to escape it, although you could place a (double) backslash in front of it as well if you wanted to be thorough.

Also notice that, since we’re using the i modifier on each of the two regular expressions to make them case insensitive, both [B] and [b] (as well as [/B] and [/b]) will work as tags in our custom markup language.

Italic text can be achieved in the same way:


Now, go ahead and create the code to implement bold and italic text in your jokes.php script (Script 7-5a).

Script 7-5a: jokes.php (excerpt)
Important: This particular code will not be used in your complete jokes.php script. Continue working with the coding for bold and italics taking it to the final iteration under Nested Tags (Part B - Script 7-5f).


Paragraphs and Line Breaks

While you could create tags for paragraphs just as you did for boldface and italicized text above, there is a simpler approach. Because end users will type the content into a form field that allows them to format text using the Enter key, we will take a single new line (\n) to indicate a line break (<br />) and a double new line to indicate a new paragraph (</p><p>).

You can represent a new line character in a regular expression as \n. Other whitespace characters you can write this way include a carriage return (\r) and a tab space (\t).

Exactly which characters are inserted into text when the user hits Enter depends on the operating system in use. In general, Windows computers represent a line break as a carriage return/new line pair (\r\n), whereas older Mac computers represent it as a single carriage return character (\r). Only recent Macs and Linux computers use a single new line character (\n) to indicate a new line.

To deal with these different line break styles, any of which may be submitted by the browser, we must do some conversion:

With our line breaks all converted to new line characters, we can convert them to paragraph breaks (when they occur in pairs) and line breaks (when they occur alone):


Now, go ahead and create the code for different operating systems and paragraphs and line breaks in your jokes.php script (Script 7-5b) using preg_replace.

Script 7-5b: jokes.php (excerpt)
Important: This particular code will not be used in your complete jokes.php script. We will be using a different regular expression (str_replace) in an upcoming exercise (Script 7-5c).


Note: Note the addition of <p> and </p> tags surrounding the joke text. Because any of your jokes may contain paragraph breaks, you need to ensure the joke text is output within the context of a paragraph to begin with.

This code does the trick. The line breaks in the text will now become the natural line- and paragraph-breaks expected by the user, removing the requirement to learn custom tags to create this simple formatting.

That's it! The text will now appear in the paragraphs expected by the end user, who has not had to learn any custom tags to format the content.


We’ve reviewed differences in the handling of line breaks between operating systems. But did you realize that the type of line breaks used can vary between software programs on the same computer. If you have ever opened a text file in Notepad to see all the line breaks missing, then you have experienced this frustration first hand! Advanced text editors used by programmers usually let you specify the type of line breaks to use when saving a text file.


It turns out that there is a simpler way to achieve the same result as above for different operating systems and paragraph and line breaks. There is no need to use regular expressions at all! PHP’s str_replace function works a lot like preg_replace, except that it only searches for strings—instead of regular expression patterns:

The syntax for str_replace is as follows:

We can, therefore, rewrite our line-breaking code as follows:

str_replace is much more efficient than preg_replace, because there is no need for it to interpret your search string for regular expression codes. Whenever str_replace (or str_ireplace, if you need a case-insensitive search) can do the job, you should use it instead of preg_replace.


Now, go ahead and create some new code using str_replace for your jokes.php script (Script 7-5c) for different operating systems and paragraphs and line breaks.

Script 7-5c: jokes.php (excerpt)
Note: This excerpt forms part of the complete jokes.php script.


    While this code looks more complicated than the original version with preg_replace, str_replace is much more efficient, because it does not need to interpret your search string for regular expression codes. Whenever str_replace (or str_replace) can do the job, you should use it instead of preg_replace.

    Important: You may be tempted to go back and rewrite the code for processing [B] and [I] tags with str_replace, but feel no need to take it further. The completed script calls for use of preg_replace, and this will be sufficient.

    For more information about the intricacies of str_replace, refer to the PHP manual.

     


    Regular Expressions in Double Quoted Strings

    All of the regular expressions we have seen so far in this chapter have been expressed as single-quoted PHP strings. The automatic variable substitution provided by PHP strings is sometimes more convenient, but they can cause headaches when used with regular expressions.

    Double-quoted PHP strings and regular expressions share a number of special character escape codes. \n is a PHP string containing a new line character.

    Likewise, \n/ is a regular expression that will match any string containing a new line character. We can represent this regular expression as a single-quoted PHP string ('/\n/'), and all is well, because the code \n has no special meaning in a single-quoted PHP string.

    If we were to use a double-quoted string to represent this regular expression, we’d have to write /\\n/—with a double-backslash. The double-backslash tells PHP to include an actual backslash in the string, rather than combining it with the n that follows it to represent a new line character. This string will therefore generate the desired regular expression, /\n/.

    Because of the added complexity it introduces, it is best to avoid using double-quoted strings when writing regular expressions. Note, however, that we have used double quotes for the replacement strings (\n) passed as the second parameter to preg_replace in the paragraph and line break code above. In this case, we actually do want to create a string containing a new line character, so a double-quoted string does the job perfectly.


    Hyperlinks

    While supporting the inclusion of hyperlinks in the text of jokes may seem unnecessary, this feature makes plenty of sense in other applications. Hyperlinks are a little more complicated than the simple conversion of a fixed code fragment into an HTML tag. We need to be able to output a URL as well as the text that should appear as the link.

    Another feature of preg_replace comes into play here. If you surround a portion of the regular expression with parentheses, you can capture the corresponding portion of the matched text and use it in the replacement string. For this, you will use the code \n, where n is 1 for the first parenthesized portion of the regular expression, 2 for the second, and so on, up to 99 for the 99th.

    Consider this example:

    In the above, $1 is replaced with ba in the replacement string, which corresponds to (.*) (zero or more non-new line characters) in the regular expression. $2 is replaced by nana, which corresponds to (nana) in the regular expression.

    You can use the same principle to create your hyperlinks.

    Let's begin with a simple form of link, where the text of the link is the same as the URL. We want to support this syntax:

    The corresponding HTML code that we want to output is as follows:

    First, we need a regular expression that will match links of this form. The regular expression is as follows:

    This is a rather complicated regular expression. You can see how regular expressions have gained a reputation for being indecipherable! Let’s break this down:

    As with all of our regular expressions, we choose to mark its beginning with a forward slash.

    This matches the opening [URL] tag. Since square brackets have a special meaning in regular expressions, we must escape the opening square bracket with a backslash to have it interpreted literally.

    This will match any URL.

    The square brackets contain a list of characters that may appear in a URL, which is followed by a + to indicate that one or more of these acceptable characters must be present.

    Within a square-bracketed list of characters, many of the characters that normally have a special meaning within regular expressions lose that meaning. ., ?, +, *, (, and ) are all listed here without the need to be escaped by backslashes. The only character that does need to be escaped in this list is the forward slash (/), which must be written as \/ to prevent it being mistaken for the end-of-regular-expression delimiter.

    Note also that to include the hyphen (-) in the list of characters, you have to list it first. Otherwise, it would have been taken to indicate a range of characters (as in a-z and 0-9).

    This matches the closing [/URL] tag. Both the opening square bracket and the slash must be escaped with backslashes.

    It will also match some strings that are invalid URLs, but it is close enough for our purposes. If you are especially intrigued by regular expressions, you may want to check out RFC 3986, the official standard for URLs.

    We mark the end of the regular expression with a forward slash, followed by the case-insensitivity flag, i.

    Now, to output the link, we will need to capture the URL and output it both as the href attribute of the <a> tag, and as the text of the link. To capture the URL, we surround the corresponding portion of our regular expression with parentheses:

    We can, therefore, convert the link with the following PHP code:

    As you can see, $1 is used twice in the replacement string to substitute the captured URL in both places.

    Note that because we are expressing our regular expression as a single-quoted PHP string, we have to escape the single quote that appears in the list of acceptable characters with a backslash.

    You also need to support hyperlinks whose link text differs from the URL. Such a link will look like this:

    Here is the regular expression for this form of link:

    Quite a mess isn't it?

    Squint at it for a little while, and see if you can figure out how it works. Grab your pen and break it into parts if you need to. If you have a highlighter pen handy, you might use it to highlight the two pairs of parentheses (()) used to capture portions of the matched string—the link URL ($1) and the link text ($2).

    This expression describes the link text as one or more characters, none of which is an opening square bracket ([^[]+).

    Here is how to use this regular expression to perform the desired substitution:


    Now, go ahead and create the code to handle both types of hyperlinks in your jokes.php script (Script 7-5d).

    Script 7-5d: jokes.php (excerpt)
    Important: This particular code will not be used in your complete jokes.php script. Continue working with the coding for hyperlinks taking it to the final iteration under Nested Tags (Part B - Script 7-5g).


     

    Exercise 4: String Replacement with Regular Expressions — Part B (Boldface and Italics, Hyperlinks, cont'd)

    Matching Tags

    A nice side-effect of the regular expressions we developed to read hyperlinks is that they will only find matched pairs of [URL] and [/URL] tags. An [URL] tag missing its [/URL] or vice versa will be undetected, and will appear unchanged in the finished document, allowing the person updating the site to spot the error and fix it.

    In contrast, the PHP code we developed for boldface and italic text in the section above entitled Boldface and Italic Text will convert unmatched [B] and [I] tags into unmatched HTML tags! This can lead to ugly situations in which, for example, the entire text of a joke starting from an unmatched tag will be displayed in bold—possibly even spilling into subsequent content on the page.

    We can rewrite the code for bold and italic text in the same style we used for hyperlinks. This solves the problem by processing only matched pairs of tags:


    Now, go ahead and rewrite your jokes.php code for bold and italic text in the same style as you used for hyperlinks (Script 7-5e).

    Script 7-5e: jokes.php (excerpt)
    Important: This particular code will not be used in your complete jokes.php script. Continue working with the coding for bold and italics taking it to the final iteration under Nested Tags (Part B - Script 7-5f).


    We still have some more work to do, however—not only on your code for the handling of bold and italics but also for hyperlinks.

    Nested Tags

    One weakness of these regular expressions is that they represent the content between the tags as a series of characters that lack an opening square bracket ([^\[]+). As a result, nested tags (tags within tags) will not work correctly with this code.

    Ideally, we would like to be able to tell the regular expression to capture characters following the opening tag until it reaches a matching closing tag. Unfortunately, the regular expression symbols + (one or more) and * (zero or more) are what we call greedy, which means they will match as many characters as they can. Consider this example:

    Now, if we leave unrestricted the range of characters that can appear between the opening and closing tags, we might come up with a regular expression like this one:

    Nice and simple, right? Unfortunately, because the + is greedy, the regular expression will match only one pair of tags in the above example—and not the pair you might expect! Here is the result:

    As you can see, the greedy + plowed right through the first closing tag and the second opening tag to find the second closing tag in its attempt to match as many characters as possible. What we need in order to support nested tags are non-greedy versions of + and *.

    Thankfully, regular expressions do provide non-greedy variants of these control characters! The non-greedy version of + is +?, and the non-greedy version of * is *?. With these, you can produce improved versions of your code for processing [B] and [I] tags:

    We can give the same treatment to the hyperlink processing code:


    Now, go ahead and refine the code to implement the non-greedy +? for boldface and italics (Script 7-5f) and for hyperlinks (Script 7-5g) in your jokes.php script.

    Script 7-5f: jokes.php (excerpt)
    Note: This excerpt forms part of the complete jokes.php script.

    Script 7-5g: jokes.php (excerpt)
    Note: This excerpt forms part of the complete jokes.php script.


    Unknown Tags

    You have now created the code within the jokes.php script to format the joke text.

    As a final step you need to know how to strip out unknown tags for potential security vulnerabilities. This time you will be creating code for the jokelist.php script.

    Here is the basic syntax along with explanations:

    Note: The commented $test variables are different "test" strings that could be run through the regex to show that it will strip the tags off the string, regardless of the format. You could uncomment any of them to run the test.

    As a part of your jokelist.php script, go ahead now and create the code for unknown tags (Script 7-5h).

    Script 7-5h: jokes.php (excerpt)
    Note: This excerpt forms part of the complete jokelist.php script.


    Return to Chapter 7, Assignments Page