8 Regular Expressions You Should Know

8 Regular Expressions You Should Know

Regular expressions are a language of their own. When you learn a new programming language, they’re this little sub-language that makes no sense at first glance. Many times you have to read another tutorial, article, or book just to understand the “simple” pattern described. Today, we’ll review eight regular expressions that you should know for your next coding project.


Background Info on Regular Expressions

This is what Wikipedia has to say about them:

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Now, that doesn’t really tell me much about the actual patterns. The regexes I’ll be going over today contains characters such as \w, \s, \1, and many others that represent something totally different from what they look like.

If you’d like to learn a little about regular expressions before you continue reading this article, I’d suggest watching the Regular Expressions for Dummies screencast series.

The eight regular expressions we’ll be going over today will allow you to match a(n): username, password, email, hex value (like #fff or #000), slug, URL, IP address, and an HTML tag. As the list goes down, the regular expressions get more and more confusing. The pictures for each regex in the beginning are easy to follow, but the last four are more easily understood by reading the explanation.

The key thing to remember about regular expressions is that they are almost read forwards and backwards at the same time. This sentence will make more sense when we talk about matching HTML tags.

Note: The delimiters used in the regular expressions are forward slashes, “/”. Each pattern begins and ends with a delimiter. If a forward slash appears in a regex, we must escape it with a backslash: “\/”.


1. Matching a Username

Matching a username

Pattern:

/^[a-z0-9_-]{3,16}$/

Description:

We begin by telling the parser to find the beginning of the string (^), followed by any lowercase letter (a-z), number (0-9), an underscore, or a hyphen. Next, {3,16} makes sure that are at least 3 of those characters, but no more than 16. Finally, we want the end of the string ($).

String that matches:

my-us3r_n4m3

String that doesn’t match:

th1s1s-wayt00_l0ngt0beausername (too long)


2. Matching a Password

Matching a password

Pattern:

/^[a-z0-9_-]{6,18}$/

Description:

Matching a password is very similar to matching a username. The only difference is that instead of 3 to 16 letters, numbers, underscores, or hyphens, we want 6 to 18 of them ({6,18}).

String that matches:

myp4ssw0rd

String that doesn’t match:

mypa$$w0rd (contains a dollar sign)


3. Matching a Hex Value

Matching a hex valud

Pattern:

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

Description:

We begin by telling the parser to find the beginning of the string (^). Next, a number sign is optional because it is followed a question mark. The question mark tells the parser that the preceding character — in this case a number sign — is optional, but to be “greedy” and capture it if it’s there. Next, inside the first group (first group of parentheses), we can have two different situations. The first is any lowercase letter between a and f or a number six times. The vertical bar tells us that we can also have three lowercase letters between a and f or numbers instead. Finally, we want the end of the string ($).

The reason that I put the six character before is that parser will capture a hex value like #ffffff. If I had reversed it so that the three characters came first, the parser would only pick up #fff and not the other three f’s.

String that matches:

#a3c113

String that doesn’t match:

#4d82h4 (contains the letter h)


4. Matching a Slug

Matching a slug

Pattern:

/^[a-z0-9-]+$/

Description:

You will be using this regex if you ever have to work with mod_rewrite and pretty URL’s. We begin by telling the parser to find the beginning of the string (^), followed by one or more (the plus sign) letters, numbers, or hyphens. Finally, we want the end of the string ($).

String that matches:

my-title-here

String that doesn’t match:

my_title_here (contains underscores)


5. Matching an Email

Matching an email

Pattern:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Description:

We begin by telling the parser to find the beginning of the string (^). Inside the first group, we match one or more lowercase letters, numbers, underscores, dots, or hyphens. I have escaped the dot because a non-escaped dot means any character. Directly after that, there must be an at sign. Next is the domain name which must be: one or more lowercase letters, numbers, underscores, dots, or hyphens. Then another (escaped) dot, with the extension being two to six letters or dots. I have 2 to 6 because of the country specific TLD’s (.ny.us or .co.uk). Finally, we want the end of the string ($).

String that matches:

john@doe.com

String that doesn’t match:

john@doe.something (TLD is too long)


6. Matching a URL

Matching a url

Pattern:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Description:

This regex is almost like taking the ending part of the above regex, slapping it between “http://” and some file structure at the end. It sounds a lot simpler than it really is. To start off, we search for the beginning of the line with the caret.

The first capturing group is all option. It allows the URL to begin with “http://”, “https://”, or neither of them. I have a question mark after the s to allow URL’s that have http or https. In order to make this entire group optional, I just added a question mark to the end of it.

Next is the domain name: one or more numbers, letters, dots, or hypens followed by another dot then two to six letters or dots. The following section is the optional files and directories. Inside the group, we want to match any number of forward slashes, letters, numbers, underscores, spaces, dots, or hyphens. Then we say that this group can be matched as many times as we want. Pretty much this allows multiple directories to be matched along with a file at the end. I have used the star instead of the question mark because the star says zero or more, not zero or one. If a question mark was to be used there, only one file/directory would be able to be matched.

Then a trailing slash is matched, but it can be optional. Finally we end with the end of the line.

String that matches:

http://net.tutsplus.com/about

String that doesn’t match:

http://google.com/some/file!.html (contains an exclamation point)


7. Matching an IP Address

Matching an IP address

Pattern:

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

Description:

Now, I’m not going to lie, I didn’t write this regex; I got it from here. Now, that doesn’t mean that I can’t rip it apart character for character.

The first capture group really isn’t a captured group because

?:

was placed inside which tells the parser to not capture this group (more on this in the last regex). We also want this non-captured group to be repeated three times — the {3} at the end of the group. This group contains another group, a subgroup, and a literal dot. The parser looks for a match in the subgroup then a dot to move on.

The subgroup is also another non-capture group. It’s just a bunch of character sets (things inside brackets): the string “25″ followed by a number between 0 and 5; or the string “2″ and a number between 0 and 4 and any number; or an optional zero or one followed by two numbers, with the second being optional.

After we match three of those, it’s onto the next non-capturing group. This one wants: the string “25″ followed by a number between 0 and 5; or the string “2″ with a number between 0 and 4 and another number at the end; or an optional zero or one followed by two numbers, with the second being optional.

We end this confusing regex with the end of the string.

String that matches:

73.60.124.136 (no, that is not my IP address :P)

String that doesn’t match:

256.60.124.136 (the first group must be “25″ and a number between zero and five)


8. Matching an HTML Tag

Matching an HTML tag

Pattern:

/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Description:

One of the more useful regexes on the list. It matches any HTML tag with the content inside. As usually, we begin with the start of the line.

First comes the tag’s name. It must be one or more letters long. This is the first capture group, it comes in handy when we have to grab the closing tag. The next thing are the tag’s attributes. This is any character but a greater than sign (>). Since this is optional, but I want to match more than one character, the star is used. The plus sign makes up the attribute and value, and the star says as many attributes as you want.

Next comes the third non-capture group. Inside, it will contain either a greater than sign, some content, and a closing tag; or some spaces, a forward slash, and a greater than sign. The first option looks for a greater than sign followed by any number of characters, and the closing tag. \1 is used which represents the content that was captured in the first capturing group. In this case it was the tag’s name. Now, if that couldn’t be matched we want to look for a self closing tag (like an img, br, or hr tag). This needs to have one or more spaces followed by “/>”.

The regex is ended with the end of the line.

String that matches:

<a href=”http://net.tutsplus.com/”>Nettuts+</a>

String that doesn’t match:

<img src=”img.jpg” alt=”My image>” /> (attributes can’t contain greater than signs)


Conclusion

I hope that you have grasped the ideas behind regular expressions a little bit better. Hopefully you’ll be using these regexes in future projects! Many times you won’t need to decipher a regex character by character, but sometimes if you do this it helps you learn. Just remember, don’t be afraid of regular expressions, they might not seem it, but they make your life a lot easier. Just try and pull out a tag’s name from a string without regular expressions! ;)


Note: Want to add some source code? Type <pre><code> before it and </code></pre> after it. Find out more
  • karabey
  • http://drostie.org/ Drostie

    While I appreciate the effort, you should delete example 2 (the password regular expression) as soon as possible so that people on the Internet do not follow your (evil) recommended practice.

    You should not store my password in your database in cleartext anyway, and you should not place any limitations on my password except possibly for some sort of “too easily guessed” limitation — and really, that’s generally *my* problem, not yours.

    Databases should store a salt and a verifier. Salts can be generated in PHP by base64_encode(openssl_random_pseudo_bytes(24)) and will be 32 ASCII characters. Verifiers can be generated in PHP by hash(‘sha256′, “$salt:$password”) and will be 64 ASCII characters. Store (salt, password) in the database and you’re safe. This is a very simple procedure and it allows someone to safely use arbitrary characters in their password, while protecting you from database compromise: just getting someone’s verifier is not enough to log in as them, and there are no good ways to speed up the password-guessing.

    • Larz Conwell

      They never said they were storing passwords in clear text, checking input passwords against regex is a good way to test the accuracy of a good password. Also limitations on passwords are a good practice to make sure users are using strong passwords, so rainbow attacks aren’t as feasible.

      Also if your DB gets hacked and all the passwords are leaked it would be the site owners fault because it was their responsibility to maintain enough security to stop attacks. So it would be their problem.

      Also the implementation you used above is not very good at all. Instead of rolling your own method you should check out bcrypt/scrypt and/or PBKDF2. Most bcrypt libraries also have a built in salting mechanism.

  • Jakub Nesetril

    While it’s a nice resource to teach people Regular Expressions, almost every single one of these regexps is a poor example: URLs, emails, HTML tags – all these should be parsed by sophisticated parsers, as the variation can be great and you’re unlikely to write a fully conforming parser yourself. For example your email address RE won’t even recognize common name+ext@gmail.com.

    In other cases your REs are unnecessarily restrictive: usernames should allow uppercase, passwords should not place limits on characters used, etc…

    Again – these are nice examples to explain how REs work, but please, please, choose different examples and don’t get people used to these bad habits.

  • http://qfox.ru qfox

    /^#?([a-f0-9]{6}|[a-f0-9]{3})$/
    can be rewritten as
    /^#?(([a-f0-9]{3}){1,2})$/
    its actually the same but slightly easier

  • Dominic Son

    One of the coolest writeups on regular expression! I mean, dang, regex has color?

  • http://saurini.com Rob Saurini

    /(?<=[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/)[0-9a-z-]+/

    Given a WordPress url such as http://www.yoursite.com/2012/04/30/your-awesome-post/comments, the above regex will return only 'your-awesome-post'. I'd highly recommend using a tool like http://gskinner.com/RegExr/ to test out your regexes as you type. It's quite helpful.

  • Larz Conwell

    If using Ruby and maybe on other platforms/languages as well, instead of using the chars ^$ to denote beginning and end of lines, you should use \A\z to denote the real end instead.

    using ^$ you can easily exploit regex checks because it checks every line while \A\z checks beginning and ends of the actual string.

    using: ^$
    alert(‘some sort of XSS here’); <– This will not be checked and will be saved with it
    MyUsername <– Will make the regex still pass

  • Ripul

    URL regex certainly does not cater to many possibilities in URL’s. I agree with Rob above.

  • George

    Hi Vasili, the email regular expression still allows for dots on the end.

    Thanks

  • http://altreus.blogspot.com Altreus

    Literally everyone one of these except number 3 is wrong, and number 3 is badly worded.

    Most of them are wrong because they discount non-ASCII characters.

    The regex for matching an email is 500 characters long and invalid in many dialects: hence it is impossible for all practical purposes to match an email address with a regex.

    It is impossible to match HTML with a regex.

    A “hex value” is /^[[:xdigit:]]+$/. Yours deals with hex colours, which is 3 or 6 hex digits, and maybe a hash – although who could imagine that there are 3- or 6-letter words in the range [a-f]!

    /^[[:xdigit:]]{3}([[:xdigit:]]{3})?$/

    Stop trying to use regexes as a generic tool. Now you have two problems.

  • Diane Mitchlin

    Regular expression matching is ALWAYS wrong. If you ever see preg_match() in code, you can immediately assume that it is buggy. I have never, ever, in over 15 years of software development, seen regex matching done in a way that doesn’t result in bugs. Ever.

    Regular expression replacement is a different beast altogether. I have no problems with simple whitelist character filters with preg_replace(). You are filtering out bad input into something expecting clean input. But you can’t use it to validate an e-mail address or do much else with it. Which is exactly how regular expressions are supposed to be used.

    “But I want to do e-mail address validation,” you whine. Then get a library to do it that uses a state engine to correctly parse each little rinky-dinky detail of the RFCs for e-mail. And if that library uses preg_match(), it is doing it wrong. Doing e-mail correctly is ridiculously hard even though it is something that seems simple. Doing it wrong is easy, lazy, and results in broken websites. So, go get a library that does it right so you don’t piss off your users when they type in “myemail+youareaspammer@gmail.com”. You will thank me later.

    All the other examples can and should be done by parsing the string yourself. A regex match is an inferior way to write software.

    • http://jonathanweatherhead.com Jonathan Weatherhead

      Diane, it’s not that RegEx’s lack the expressive power (a DFA is a state-driven parser) it’s more that the RFC spec for URLs and emails is really complicated despite the deceptively simple format and so a rather large RegEx must be written to handle all the fine points. I do agree with the general direction of the comments however; great introductory article on RegEx but not to be taken seriously for production code.

      Larz Conwell, there is a RegEx flag for handling multi-line strings.

    • http://jonathanweatherhead.com Jonathan Weatherhead

      Diane, it’s not that RegEx’s lack the expressive power (a DFA is a state-driven parser) it’s more that the RFC spec for URLs and emails is really complicated despite the deceptively simple format and so a rather large RegEx must be written to handle all the fine points. I do agree with the general direction of the comments however; great introductory article on RegEx but not to be taken seriously for production code.

      Larz Conwell, there is a RegEx flag for handling multi-line strings.

    • Ítalo

      :P

  • BZ

    Hello There,

    Why don’t you validate email with the filter_var function instead of a difficult preg_match?
    Take a look at this:

    filter_var(‘test@test.com’, FILTER_VALIDATE_EMAIL);

    Returns bool.

  • Ali Akbar Panahi

    It’s Really Good.

    in URL Matching it accept : http://a.co.m/hi

    i Think it’s better Regex for that : ^(https?://)?([da-z.-]+).([a-z]{2,6})/+([/w .-]*)*/?$

  • themookieb

    great post!!!!! i love the simple breakdown of a complicated subject. graphics helped out!

  • pdxrod

    The URL regex above leads to an infinite loop in Ruby 1.9.3 with Rails 3.2.9:

    $ rails c
    Loading development environment (Rails 3.2.9)
    1.9.3p125 :001 > URL_REG = /^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$/
    => /^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$/
    1.9.3p125 :002 > LONG_URL = “http://www.google.com/hostednews/ap/article/ALeqM5iwQW7HgPFajZJSBwsJivEBBeetOQ?docId=29fa56719a3c43cfa6766bebc7a62da2″
    => “http://www.google.com/hostednews/ap/article/ALeqM5iwQW7HgPFajZJSBwsJivEBBeetOQ?docId=29fa56719a3c43cfa6766bebc7a62da2″
    1.9.3p125 :003 > LONG_URL =~ URL_REG

  • Guest

    Email pattern should be /^([a-z0-9_.+-]+)@([da-z.-]+).([a-z.]{2,6})$/ otherwise you won’t match e+mail@domain.com which is a correct address.

  • Zlatko

    Please please please make a note next to most of the regexes used that they’re wrong, simplified versions and that they’re only here to help you teach some of regex concepts.

  • http://yuriybabenko.com Yuriy Babenko

    There’s no point memorizing or otherwise ‘knowing’ any REGEX. Learn REGEX so you can read it just like any other code and you’ll be way better off than memorizing solutions. Never hurts to have common expressions saved as easily accessible snippets, though.

  • Dotty

    “I have escaped the dot because a non-escaped dot means any character.”

    Not true: in a character class (as here), the escaping rules are different, and dot means a literal dot.

    • Dotty

      It’s superfluous, and that makes it confusing. As someone reading the code, I’m wondering if you added a by mistake (and meant just [.]) or forgot that still needs to be escaped here (and meant [\.]).

      Granted, some people are paranoid and just escape everything, but you don’t seem to have any problem with leaving other special characters (like / or -) bare inside a character class when it’s technically not required, which would suggest to me that you probably meant [\.].

      In this article, you made a big diagram showing what it was supposed to mean, plus a couple paragraphs of text, but no regular expressions in real programs actually have either of those.

  • http://www.facebook.com/sergiu.negara Sergiu Nes Negara

    Seriously, why should one use these wrong patterns?

  • BRICK

    Also for the URL pattern, it should match GET variables as well!

  • Lapa

    Why doesn’t the password 333333 work with example #2? How do I write the code to have any range (i.e., from 3-10 for example) of any combination of alphanumeric characters match the password?

  • justme

    Images help a Ton! Thanks for your time on this!

  • yarick123

    “Matching a URL” – so good as nothing. There are much more schemata as “http” and “https”, user name and password can be also in an URL,..

  • guest

    thank you…