Friday, June 19, 2015

PHP: Exploring Regular Expression Syntax

Using Anchors to Match at Specified Positions

Often we're interested in the position of a pattern within a target string, as much as the pattern itself. For example, say we wanted to make sure that a string started with one or more digits followed by a colon. We might try this:

echo preg_match( "/\d+\:/", "12: The Sting" );  // Displays "1"

However, this expression would also match a string where the digits and colon are somewhere in the middle:

echo preg_match( "/\d+\:/", "Die Hard 2: Die Harder" );  // Displays "1"

How can we make sure that the string only matches if the digits and colon are at the start? The answer is that we can use an anchor (also known as an assertion), as follows:

 echo preg_match( "/^\d+\:/", "12: The Sting" );           // Displays "1"
 echo preg_match( "/^\d+\:/", "Die Hard 2: Die Harder" );  // Displays "0"

The caret (^) symbol specifies that the rest of the pattern will only match at the start of the string. Similarly, we can use the dollar ($) symbol to anchor a pattern to the end of the string:

 echo preg_match( "/\[(G|PG|PG-13|R|NC-17)\]$/", "The Sting [PG]" ); // Displays "1"
 echo preg_match( "/\[(G|PG|PG-13|R|NC-17)\]$/", "[PG] Amadeus" );   // Displays "0"

By combining the two anchors, we can ensure that a string contains only the desired pattern, nothing more:

 echo preg_match( "/^Hello, \w+$/", "Hello, world" );  // Displays "1"
 echo preg_match( "/^Hello, \w+$/", "Hello, world!" ); // Displays "0"

The second match fails because the target string contains a non-word character (!) between the searched-for pattern and the end of the string.

We can use other anchors for more control over our matching. Here's a full list of the anchors we can use within a regular expression:

Anchor Meaning
^ Matches at the start of the string
$ Matches at the end of the string
\b Matches at a word boundary (between a \w character and a \W character)
\B Matches except at a word boundary
\A Matches at the start of the string
\z Matches at the end of the string
\Z Matches at the end of the string or just before a newline at the end of the string
\G Matches at the starting offset character position, as passed to the preg_match() function

Note: It's important to note that an anchor doesn't itself match any characters; it merely ensures that the pattern appears at a specified point in the target string.

\A and \z are similar to ^ and $. The difference is that ^ and $ will also match at the beginning and end of a line, respectively, if matching a multi-line string in multi-line mode  \A and \z only match at the beginning and end of the target string, respectively.

\Z is useful when reading lines from a file that may or may not have a newline character at the end.

\b and \B are handy when searching text for complete words:

echo preg_match( "/over/", "My hovercraft is full of eels" );       // Displays "1"
echo preg_match( "/\bover\b/", "My hovercraft is full of eels" );   // Displays "0"
echo preg_match( "/\bover\b/", "One flew over the cuckoo's nest" ); // Displays "1"

When using \b, the beginning or end of the string is also considered a word boundary:

echo preg_match( "/\bover\b/", "over and under" );  // Displays "1"

By using the \b anchor, along with alternatives within a subexpression, it's possible to enhance the earlier "date detection" example further, so that it matches only two- or four-digit years (and not three-digit years):

echo preg_match( "/\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)" .
"\/\d{1,2}\/(\d{2}|\d{4})\b/", "jul/15/2006" );  // Displays "1"

echo preg_match( "/\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)" .
"\/\d{1,2}\/(\d{2}|\d{4})\b/", "jul/15/206" );  // Displays "0"

The last part of the expression reads, "Match either two digits or four digits, followed by a word boundary (or the end of the string)":

(\d{2}|\d{4})\b

No comments:

Post a Comment