Advanced Regular Expression Tips and Techniques

Advanced Regular Expression Tips and Techniques

Tutorial Details
  • Technology: Regular Expressions
  • Difficulty: Advanced

Twice a month, we revisit some of our readers’ favorite posts from throughout the history of Nettuts+.

Regular Expressions are the Swiss Army knife for searching through information for certain patterns. They have a wide arsenal of tools, some of which often go undiscovered or underutilized. Today I will show you some advanced tips for working with regular expressions.


Adding Comments

Sometimes, regular expressions can become complex and unreadable. A regular expression you write today may seem too obscure to you tomorrow even though it was your own work. Much like programming in general, it is a good idea to add comments to improve the readability of regular expressions.

For example, here is something we might use to check for US phone numbers.

preg_match("/^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)

It can become much more readable with comments and some extra spacing.

preg_match("/^

			(1[-\s.])?	# optional '1-', '1.' or '1'
			( \( )?		# optional opening parenthesis
			\d{3}		# the area code
			(?(2) \) )	# if there was opening parenthesis, close it
			[-\s.]?		# followed by '-' or '.' or space
			\d{3}		# first 3 digits
			[-\s.]?		# followed by '-' or '.' or space
			\d{4}		# last 4 digits

			$/x",$number);

Let’s put it within a code segment.

$numbers = array(
"123 555 6789",
"1-(123)-555-6789",
"(123-555-6789",
"(123).555.6789",
"123 55 6789");

foreach ($numbers as $number) {
	echo "$number is ";

	if (preg_match("/^

			(1[-\s.])?	# optional '1-', '1.' or '1'
			( \( )?		# optional opening parenthesis
			\d{3}		# the area code
			(?(2) \) )	# if there was opening parenthesis, close it
			[-\s.]?		# followed by '-' or '.' or space
			\d{3}		# first 3 digits
			[-\s.]?		# followed by '-' or '.' or space
			\d{4}		# last 4 digits

			$/x",$number)) {

		echo "valid\n";
	} else {
		echo "invalid\n";
	}
}

/* prints

123 555 6789 is valid
1-(123)-555-6789 is valid
(123-555-6789 is invalid
(123).555.6789 is valid
123 55 6789 is invalid

*/

The trick is to use the ‘x’ modifier at the end of the regular expression. It causes the whitespaces in the pattern to be ignored, unless they are escaped (\s). This makes it easy to add comments. Comments start with ‘#’ and end at a newline.


Using Callbacks

In PHP preg_replace_callback() can be used to add callback functionality to regular expression replacements.

Sometimes you need to do multiple replacements. If you call preg_replace() or str_replace() for each pattern, the string will be parsed over and over again.

Let’s look at this example, where we have an e-mail template.

$template = "Hello [first_name] [last_name],

Thank you for purchasing [product_name] from [store_name].

The total cost of your purchase was [product_price] plus [ship_price] for shipping.

You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days.

Sincerely,
[store_manager_name]";

// assume $data array has all the replacement data
// such as $data['first_name'] $data['product_price'] etc...

$template = str_replace("[first_name]",$data['first_name'],$template);
$template = str_replace("[last_name]",$data['last_name'],$template);
$template = str_replace("[store_name]",$data['store_name'],$template);
$template = str_replace("[product_name]",$data['product_name'],$template);
$template = str_replace("[product_price]",$data['product_price'],$template);
$template = str_replace("[ship_price]",$data['ship_price'],$template);
$template = str_replace("[ship_days_min]",$data['ship_days_min'],$template);
$template = str_replace("[ship_days_max]",$data['ship_days_max'],$template);
$template = str_replace("[store_manager_name]",$data['store_manager_name'],$template);

// this could be done in a loop too,
// but I wanted to emphasize how many replacements were made

Notice that each replacement has something in common. They are always strings enclosed within square brackets. We can catch them all with a single regular expression, and handle the replacements in a callback function.

So here is the better way of doing this with callbacks:

// ...

// this will call my_callback() every time it sees brackets
$template = preg_replace_callback('/\[(.*)\]/','my_callback',$template);

function my_callback($matches) {
	// $matches[1] now contains the string between the brackets

	if (isset($data[$matches[1]])) {
		// return the replacement string
		return $data[$matches[1]];
	} else {
		return $matches[0];
	}
}

Now the string in $template is only parsed by the regular expression once.


Greedy vs. Ungreedy

Before I start explaining this concept, I would like to show an example first. Let’s say we are looking to find anchor tags in an html text:

$html = 'Hello World!';

if (preg_match_all('/.*<\/a>/',$html,$matches)) {

	print_r($matches);

}

The result will be as expected:

/* output:
Array
(
    [0] => Array
        (
            [0] => World!
        )

)
*/

Let’s change the input and add a second anchor tag:

$html = 'Hello
World!';

if (preg_match_all('/.*<\/a>/',$html,$matches)) {

	print_r($matches);

}

/* output:
Array
(
    [0] => Array
        (
            [0] => Hello
            [1] => World!

        )

)
*/

Again, it seems to be fine so far. But don’t let this trick you. The only reason it works is because the anchor tags are on separate lines, and by default PCRE matches patterns only one line at a time (more info on: ‘m’ modifier). If we encounter two anchor tags on the same line, it will no longer work as expected:

$html = 'Hello World!';

if (preg_match_all('/.*<\/a>/',$html,$matches)) {

	print_r($matches);

}

/* output:
Array
(
    [0] => Array
        (
            [0] => Hello World!

        )

)
*/

This time the pattern matches the first opening tag, and last opening tag, and everything in between as a single match, instead of making two separate matches. This is due to the default behavior being “greedy”.

“When greedy, the quantifiers (such as * or +) match as many character as possible.”

If you add a question mark after the quantifier (.*?) it becomes “ungreedy”:

$html = 'Hello World!';

// note the ?'s after the *'s
if (preg_match_all('/.*?<\/a>/',$html,$matches)) {

	print_r($matches);

}

/* output:
Array
(
    [0] => Array
        (
            [0] => Hello
            [1] => World!

        )

)
*/

Now the result is correct. Another way to trigger the ungreedy behavior is to use the U pattern modifier.


Lookahead and Lookbehind Assertions

A lookahead assertion searches for a pattern match that follows the current match. This might be explained easier through an example.

The following pattern first matches for ‘foo’, and then it checks to see if it is followed by ‘bar’:

$pattern = '/foo(?=bar)/';

preg_match($pattern,'Hello foo'); // false
preg_match($pattern,'Hello foobar'); // true

It may not seem very useful, as we could have simply checked for ‘foobar’ instead. However, it is also possible to use lookaheads for making negative assertions. The following example matches ‘foo’, only if it is NOT followed by ‘bar’.

$pattern = '/foo(?!bar)/';

preg_match($pattern,'Hello foo'); // true
preg_match($pattern,'Hello foobar'); // false
preg_match($pattern,'Hello foobaz'); // true

Lookbehind assertions work similarly, but they look for patterns before the current match. You may use (?< for positive assertions, and (?<! for negative assertions.

The following pattern matches if there is a ‘bar’ and it is not following ‘foo’.

$pattern = '/(?<!foo)bar/';

preg_match($pattern,'Hello bar'); // true
preg_match($pattern,'Hello foobar'); // false
preg_match($pattern,'Hello bazbar'); // true

Conditional (If-Then-Else) Patterns

Regular expressions provide the functionality for checking certain conditions. The format is as follows:

(?(condition)true-pattern|false-pattern)

or

(?(condition)true-pattern)

The condition can be a number. In which case it refers to a previously captured subpattern.

For example we can use this to check for opening and closing angle brackets:

$pattern = '/^(<)?[a-z]+(?(1)>)$/';

preg_match($pattern, '<test>'); // true
preg_match($pattern, '<foo'); // false
preg_match($pattern, 'bar>'); // false
preg_match($pattern, 'hello'); // true

In the example above, ’1′ refers to the subpattern (<), which is also optional since it is followed by a question mark. Only if that condition is true, it matches for a closing bracket.

The condition can also be an assertion:

// if it begins with 'q', it must begin with 'qu'
// else it must begin with 'f'
$pattern = '/^(?(?=q)qu|f)/';

preg_match($pattern, 'quake'); // true
preg_match($pattern, 'qwerty'); // false
preg_match($pattern, 'foo'); // true
preg_match($pattern, 'bar'); // false

Filtering Patterns

There are various reasons for input filtering when developing web applications. We filter data before inserting it into a database, or outputting it to the browser. Similarly, it is necessary to filter any arbitrary string before including it in a regular expression. PHP provides a function named preg_quote to do the job.

In the following example we use a string that contains a special character (*).

$word = '*world*';

$text = 'Hello *world*!';

preg_match('/'.$word.'/', $text); // causes a warning
preg_match('/'.preg_quote($word).'/', $text); // true

Same thing can be accomplished also by enclosing the string between \Q and \E. Any special character after \Q is ignored until \E.

$word = '*world*';

$text = 'Hello *world*!';

preg_match('/\Q'.$word.'\E/', $text); // true

However, this second method is not 100% safe, as the string itself can contain \E.


Non-capturing Subpatterns

Subpatterns, enclosed by parentheses, get captured into an array so that we can use them later if needed. But there is a way to NOT capture them also.

Let’s start with a very simple example:

preg_match('/(f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => foo'
echo "b* => " . $matches[2]; // prints 'b* => bar'

Now let’s make a small change by adding another subpattern (H.*) to the front:

preg_match('/(H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => Hello'
echo "b* => " . $matches[2]; // prints 'b* => foo'

The $matches array was changed, which could cause the script to stop working properly, depending on what we do with those variables in the code. Now we have to find every occurence of the $matches array in the code, and adjust the index number accordingly.

If we are not really interested in the contents of the new subpattern we just added, we can make it ‘non-capturing’ like this:

preg_match('/(?:H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => foo'
echo "b* => " . $matches[2]; // prints 'b* => bar'

By adding ‘?:’ at the beginning of the subpattern, we no longer capture it in the $matches array, so the other array values do not get shifted.


Named Subpatterns

There is another method for preventing pitfalls like in the previous example. We can actually give names to each subpattern, so that we can reference them later on using those names instead of array index numbers. This is the format: (?Ppattern)

We could rewrite the first example in the previous section, like this:

preg_match('/(?Pf.*)(?Pb.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches['fstar']; // prints 'f* => foo'
echo "b* => " . $matches['bstar']; // prints 'b* => bar'

Now we can add another subpattern, without disturbing the existing matches in the $matches array:

preg_match('/(?PH.*) (?Pf.*)(?Pb.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches['fstar']; // prints 'f* => foo'
echo "b* => " . $matches['bstar']; // prints 'b* => bar'

echo "h* => " . $matches['hi']; // prints 'h* => Hello'

Don’t Reinvent the Wheel

Perhaps it’s most important to know when NOT to use regular expressions. There are many situations where you can find existing utilities than you can use instead.

Parsing [X]HTML

A poster at Stackoverflow has a brilliant explanation on why we should not use regular expressions to parse [X]HTML.

…dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of corrupt entities…

Joking aside, it is a good idea to take some time and figure out what kind of XML or HTML parsers are available, and how they work. For example, PHP offers multiple extensions related to XML (and HTML).

Example: Getting the second link url in an HTML page

$doc = DOMDocument::loadHTML('
	<html>
	<body>Test
		<a href="http://www.nettuts.com">First link</a>
		<a href="http://net.tutsplus.com">Second link</a>
	</body>
	</html>
');

echo $doc->getElementsByTagName('a')
		->item(1)
		->getAttribute('href');

// prints: http://net.tutsplus.com

Validating Form Input

Again, you can use existing functions to validate user inputs, such as form submissions.

if (!filter_var($_POST['email'], FILTER_VALIDATE_EMAIL)) {

	$errors []= "Please enter a valid e-mail.";
}
// get supported filters
print_r(filter_list());

/* output
Array
(
    [0] => int
    [1] => boolean
    [2] => float
    [3] => validate_regexp
    [4] => validate_url
    [5] => validate_email
    [6] => validate_ip
    [7] => string
    [8] => stripped
    [9] => encoded
    [10] => special_chars
    [11] => unsafe_raw
    [12] => email
    [13] => url
    [14] => number_int
    [15] => number_float
    [16] => magic_quotes
    [17] => callback
)
*/

More info: PHP Data Filtering

Other

Here are some other utilities to keep in mind, before using regular expressions:


Thanks so much for reading!

Want to talk specifics? Discuss this post on the forums.

Add Comment

Discussion 82 Comments

Comment Page 1 of 21 2
  1. Jason says:

    Good stuff Burak, thank you for this!

  2. Nike says:

    Great tutorial, thanks.

  3. Good right up about regular expressions, I am learning the same features on Perl.

  4. Zoran says:

    Thank you for the tutorial!
    Regular expressions are very powerful tool in every programming language and very handy as well.

    • Davidmoreen says:

      That is true, if you have ever used Perl, regex makes up for a lot of the functionality. Well… maybe not a lot. Point is there a bunch of power within regular expressions/

  5. Mohamed Zahran says:

    I need a pattern to match PHP variables, I forgot how to do it, can anybody help me, please?

  6. Patrick says:

    Yeah, some very useful hints in there, especially conditional and non-capturing Patterns.

    But i think, you could write down some notices and hints for the function preg_replace, because you can use arrays ro replace multiple values in a bunch.

    Btw, one question… I’ve tested preg_replace vs. str_replace. Okay, last one is much more simply to use for simple replaces, but i really couldn’t find any plausible reason (other than the named one) to use str_replace to relace simple parts of a string. I read some articles about the performance benefits when using str_replace vs. preg_replace, but i couldn’t came to the same result. Is this a PHP 4 thing with no effect on PHP 5?

    • There is a huge performance difference between the two functions.

      One of them (str_replace) will just search for a string and replace it when it finds an occurrence.

      The other (preg_replace) will run the string (or regex) as a RegEx even though it’s just a string.

      It often pays off using 2 str_replace in place of a single preg_replace.

      • Patrick says:

        Thanks!

        But, this is it what i mean. I have tested this – replace a simple string with a simple string using str_replace vs. preg replace. I couldn’t see any difference between both functions. I used 10.000 replaces, both functions need nearly the same time to ran through this test. its just a little difference between 0.001 up to 0.003 seconds without any apache/php optimizing. So, it’s not the big performance difference as i expected.

        I really don’t wanna say “its totally irrelevant which function is better”, but i simply was a bit confused about this result.

      • Burak says:
        Author

        Regular expression function will work pretty fast, if you use a literal string. The performance difference should be very small.

        However, if you use preg_replace with just any string, you might encounter issues with special characters, so you have to filter the string first. Unless you already know for sure the match string contains no special characters.

        See the “Filtering Patterns” section of this article.

  7. Rick says:

    Thanks! Really good post! I always find it hard to get my head around these regular expressions but with your article in my archive I have more faith in challenging them! :-) Thanks again!

  8. Pablo Viquez says:

    Good stuff! nice tutorial.

  9. Philo says:

    Great tutorial! :)

  10. Aziz Light says:

    That is a GREAT tutorial! Thanks a lot Burak :)

  11. urlless says:

    Really good tutorial!
    Thank you. Burak

  12. Canyon says:

    Hello Barak Another great tutorial, :)
    For the part ‘using callback’, maybe it will be more fine to use ‘U’ (Ungreedy) in the regex and a very intereesting tip for matching results is use negative pattern

      '#\[([^]]*)\]#' 

    which is fastest
    I’ve seen a site under construct today ;-) Regards.

  13. “When greedy, the quantifiers (such as * or +) match as many character as possible.”

    That is incorrect. The following definitions should better explain quantifiers:

    ? (Zero or one of the preceding element.)
    * (Zero or more of the preceding element.)
    + (One or more of the preceding element.)
    {n} (Exactly n of the preceding element.)
    {n,} (n or more of the preceding element.)
    {m,n} (Between m and n of the preceding element.)

    Also, with non capturing subpatterns, if you don’t want to capture a pattern in the array, I believe you can simply exclude the parenthesis:

    /(H.*) (f.*)(b.*)/

    Becomes:

    /H.* (f.*)(b.*)/

    • Burak says:
      Author

      #1. I am not sure what you mean is incorrect about the greediness. For example, ‘*’ means zero or more. The number will be as high as possible when in ‘greedy’ mode, and it will be as low possible, when in ‘ungreedy’ mode. The example I gave demonstrated this.

      #2. Subpatterns sometimes need to be within parentheses. The example I gave did not have to be. So yeah, they could have been removed. But in many cases they have to be there for grouping purposes. There is a reason the ‘non-capturing’ feature is there; we can’t consider it useless.

      • Hi Burak,

        It was more that the explanation did not provide all the details — I suppose not necessarily that it was really incorrect. Since greedy quantifiers include the latter three examples I gave, and some include a zero result, then saying there is no bounds on the limit I thought needed clarification.

        Brian

  14. RegExp says:

    Great post for PHP regular expresion techniques. Stumbled…

  15. Musa says:

    I think you need to define $data as global within function in “Using Callbacks Section” :p (irrelevant although)
    Nice Post, book marked

  16. Saif Bechan says:

    Nice post. I didn’t know about the callbacks. I have to switch to this method in my template classes. I didn’t even realize I was looping trough my content over and over again.

  17. coool. great tut….!!! i was waiting 4 that tut…….. coool!!!

  18. xRommelx says:

    really good stuff

  19. Khalil says:

    Great Tips… waiting for CodeIgniter Tuts ;)

  20. mehdy says:

    Nice clean and useful tutorial, thank you so much Burak Guzel
    can you introduce an official reference for Regular Expressions Specially to use in preg_match ?
    thanks again

  21. Shay Falador says:

    Great article!
    Just the right amount of information and examples :)

  22. Deoxys says:

    Adding Comments:

    Doesn’t it have to be

    # optional ’1-’, ’1.’ or ’1 ‘

    ?

  23. Deoxys says:

    And what happend to the 4th line of the Conditional Patterns example?
    That looks a bit wrong…

  24. Lindrian says:

    Definitely can’t agree on saying this is an “Advanced” tutorial, but it will serve rookies well!

    • I wouldn’t call myself a regex master but would definitely agree with you. This seems more like a get started with Regex then advanced regex tutorial to me. However I would love to see a true advanced regex tutorial. Common pitfalls would be a nice addition.

  25. Christophor S. says:

    As a PHP developer regular expressions is something that I hate dealing with, this was a great tut. Thanks.

  26. Nipun says:

    Nice tutorial… waiting Advanced Regular Expression Tips and Techniques part-2….hope it will soon…

  27. Zopieux says:

    There is a funny mess in the code following “For example we can use this to check for opening and closing angle brackets”. However this article is useful, thanks.

  28. Ron says:

    Great Tut !!!

    Thanks.

  29. wasabi says:

    I found this post great. As my complications with coding, arise from
    organization and streamlining of the code; not with understanding
    the syntax and behavior.

  30. Irene says:

    Great!
    but It is difficult T_T*

  31. RonnieSan says:

    The opposite of “greedy” is known as “lazy.” It’s lazy because it finds the first match then gives up.

  32. Lindrian says:

    When you talk about lookarounds you forget to mention to mention the single
    most important thing about them: they are zero-width, often called lookaround zero-width assertion. Since this is an advanced tutorial you should explain this and give decent
    smples.

  33. Arturo says:

    Not too good as a tut, but good as a “Hello! there’s more to Regex”. This is exactly the regex stuff I was missing in my brain. Thx.

  34. Josh says:

    Good stuff, i didn’t know about filter_var function, thanks!

  35. I like the embedded comments idea, but why not use string concatenation and multiple rows instead and have the comments (in the respective programming language’s style) outside the regular expression string? That way you don’t slow down the regex parser (unless those are precompiled regex when building – precompiling at launch time still takes some small penalty at startup)

  36. TheAL says:

    Never been a fan of regular expressions, but ya gotta learn them. And they can be super helpful, however much they may sometimes look like garbles of nonsense. I first dabbled with them years back when I picked up Perl. They seem to be just as rampant in PHP. So thanks for this!

  37. Guiltouf says:

    Powerful, but complicated syntax, I think regular expressions would be always the dark side of the force for me ^^

    So thanks for this tip page ! After reading, I was just a little less dumb than before, but what a step !

  38. Thank you very much, great article, it can help to understand Regular Expressions in many languages, so they have a common behavior and common syntax. Some languages like Perl just have seasoned Regular Expressions, but the core of them is the same :)

  39. Henrik says:

    Great! I’ve always been pretty confident about regex – but I learnt it on my own, so there are a few things that I knew how to use, with little explanation on -why- they worked.

    For the life of me (and google :P) I simply couldn’t get as straight-forward an explanation as you’ve writen here (particularly with look ahead/behind assertions).

    Thanks!

  40. Awesome list , thank you .

  41. Barbarossa says:

    Very good tips, thanks a lot.

    The conditional patterns interested me the most. However, it seems not to in with javascript. Do you (or does someone) know if there’s any equivalent for (?(condition)true|false) in javascript ?

  42. Jorge says:

    Great post!

  43. twohawks says:

    Awesome article. THank you so much ;^)

  44. rob says:

    A good adjunct to this tutorial would be a graph on the expense/time intensity of each regular expression.
    For instance negative assertions are quite expensive.

  45. This is great and has been very helpful! Thanks for sharing.

  46. Kholid says:

    Why not just using matches[0] ? since you don’t have a subpattern?

  47. rob says:

    Great article. Great tips. Thanks!

  48. Guilherme Ventura says:

    Nice article, Regular Expressions are complicated before you really learn the concept.

    Good job, and thanks!

Comment Page 1 of 21 2

Add a Comment

To add a code snippet to your comment, please wrap your code like so: <pre name="code" class="html">YOUR CODE</pre>. You can replace the class name with "js," "css," "sql," or "php." If there are any "<" or ">" within your code, please search and replace them with: &lt; and &gt; respectively.