8 Regular Expressions You Should Know

8 Regular Expressions You Should Know

Regular expressions are a language of their own. When you learn a new programming language, they’re this little sub-language that makes no sense at first glance. Many times you have to read another tutorial, article, or book just to understand the “simple” pattern described. Today, we’ll review eight regular expressions that you should know for your next coding project.


Background Info on Regular Expressions

This is what Wikipedia has to say about them:

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Now, that doesn’t really tell me much about the actual patterns. The regexes I’ll be going over today contains characters such as \w, \s, \1, and many others that represent something totally different from what they look like.

If you’d like to learn a little about regular expressions before you continue reading this article, I’d suggest watching the Regular Expressions for Dummies screencast series.

The eight regular expressions we’ll be going over today will allow you to match a(n): username, password, email, hex value (like #fff or #000), slug, URL, IP address, and an HTML tag. As the list goes down, the regular expressions get more and more confusing. The pictures for each regex in the beginning are easy to follow, but the last four are more easily understood by reading the explanation.

The key thing to remember about regular expressions is that they are almost read forwards and backwards at the same time. This sentence will make more sense when we talk about matching HTML tags.

Note: The delimiters used in the regular expressions are forward slashes, “/”. Each pattern begins and ends with a delimiter. If a forward slash appears in a regex, we must escape it with a backslash: “\/”.


1. Matching a Username

Matching a username

Pattern:

/^[a-z0-9_-]{3,16}$/

Description:

We begin by telling the parser to find the beginning of the string (^), followed by any lowercase letter (a-z), number (0-9), an underscore, or a hyphen. Next, {3,16} makes sure that are at least 3 of those characters, but no more than 16. Finally, we want the end of the string ($).

String that matches:

my-us3r_n4m3

String that doesn’t match:

th1s1s-wayt00_l0ngt0beausername (too long)


2. Matching a Password

Matching a password

Pattern:

/^[a-z0-9_-]{6,18}$/

Description:

Matching a password is very similar to matching a username. The only difference is that instead of 3 to 16 letters, numbers, underscores, or hyphens, we want 6 to 18 of them ({6,18}).

String that matches:

myp4ssw0rd

String that doesn’t match:

mypa$$w0rd (contains a dollar sign)


3. Matching a Hex Value

Matching a hex valud

Pattern:

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

Description:

We begin by telling the parser to find the beginning of the string (^). Next, a number sign is optional because it is followed a question mark. The question mark tells the parser that the preceding character — in this case a number sign — is optional, but to be “greedy” and capture it if it’s there. Next, inside the first group (first group of parentheses), we can have two different situations. The first is any lowercase letter between a and f or a number six times. The vertical bar tells us that we can also have three lowercase letters between a and f or numbers instead. Finally, we want the end of the string ($).

The reason that I put the six character before is that parser will capture a hex value like #ffffff. If I had reversed it so that the three characters came first, the parser would only pick up #fff and not the other three f’s.

String that matches:

#a3c113

String that doesn’t match:

#4d82h4 (contains the letter h)


4. Matching a Slug

Matching a slug

Pattern:

/^[a-z0-9-]+$/

Description:

You will be using this regex if you ever have to work with mod_rewrite and pretty URL’s. We begin by telling the parser to find the beginning of the string (^), followed by one or more (the plus sign) letters, numbers, or hyphens. Finally, we want the end of the string ($).

String that matches:

my-title-here

String that doesn’t match:

my_title_here (contains underscores)


5. Matching an Email

Matching an email

Pattern:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Description:

We begin by telling the parser to find the beginning of the string (^). Inside the first group, we match one or more lowercase letters, numbers, underscores, dots, or hyphens. I have escaped the dot because a non-escaped dot means any character. Directly after that, there must be an at sign. Next is the domain name which must be: one or more lowercase letters, numbers, underscores, dots, or hyphens. Then another (escaped) dot, with the extension being two to six letters or dots. I have 2 to 6 because of the country specific TLD’s (.ny.us or .co.uk). Finally, we want the end of the string ($).

String that matches:

john@doe.com

String that doesn’t match:

john@doe.something (TLD is too long)


6. Matching a URL

Matching a url

Pattern:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Description:

This regex is almost like taking the ending part of the above regex, slapping it between “http://” and some file structure at the end. It sounds a lot simpler than it really is. To start off, we search for the beginning of the line with the caret.

The first capturing group is all option. It allows the URL to begin with “http://”, “https://”, or neither of them. I have a question mark after the s to allow URL’s that have http or https. In order to make this entire group optional, I just added a question mark to the end of it.

Next is the domain name: one or more numbers, letters, dots, or hypens followed by another dot then two to six letters or dots. The following section is the optional files and directories. Inside the group, we want to match any number of forward slashes, letters, numbers, underscores, spaces, dots, or hyphens. Then we say that this group can be matched as many times as we want. Pretty much this allows multiple directories to be matched along with a file at the end. I have used the star instead of the question mark because the star says zero or more, not zero or one. If a question mark was to be used there, only one file/directory would be able to be matched.

Then a trailing slash is matched, but it can be optional. Finally we end with the end of the line.

String that matches:

http://net.tutsplus.com/about

String that doesn’t match:

http://google.com/some/file!.html (contains an exclamation point)


7. Matching an IP Address

Matching an IP address

Pattern:

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

Description:

Now, I’m not going to lie, I didn’t write this regex; I got it from here. Now, that doesn’t mean that I can’t rip it apart character for character.

The first capture group really isn’t a captured group because

?:

was placed inside which tells the parser to not capture this group (more on this in the last regex). We also want this non-captured group to be repeated three times — the {3} at the end of the group. This group contains another group, a subgroup, and a literal dot. The parser looks for a match in the subgroup then a dot to move on.

The subgroup is also another non-capture group. It’s just a bunch of character sets (things inside brackets): the string “25″ followed by a number between 0 and 5; or the string “2″ and a number between 0 and 4 and any number; or an optional zero or one followed by two numbers, with the second being optional.

After we match three of those, it’s onto the next non-capturing group. This one wants: the string “25″ followed by a number between 0 and 5; or the string “2″ with a number between 0 and 4 and another number at the end; or an optional zero or one followed by two numbers, with the second being optional.

We end this confusing regex with the end of the string.

String that matches:

73.60.124.136 (no, that is not my IP address :P)

String that doesn’t match:

256.60.124.136 (the first group must be “25″ and a number between zero and five)


8. Matching an HTML Tag

Matching an HTML tag

Pattern:

/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Description:

One of the more useful regexes on the list. It matches any HTML tag with the content inside. As usually, we begin with the start of the line.

First comes the tag’s name. It must be one or more letters long. This is the first capture group, it comes in handy when we have to grab the closing tag. The next thing are the tag’s attributes. This is any character but a greater than sign (>). Since this is optional, but I want to match more than one character, the star is used. The plus sign makes up the attribute and value, and the star says as many attributes as you want.

Next comes the third non-capture group. Inside, it will contain either a greater than sign, some content, and a closing tag; or some spaces, a forward slash, and a greater than sign. The first option looks for a greater than sign followed by any number of characters, and the closing tag. \1 is used which represents the content that was captured in the first capturing group. In this case it was the tag’s name. Now, if that couldn’t be matched we want to look for a self closing tag (like an img, br, or hr tag). This needs to have one or more spaces followed by “/>”.

The regex is ended with the end of the line.

String that matches:

<a href=”http://net.tutsplus.com/”>Nettuts+</a>

String that doesn’t match:

<img src=”img.jpg” alt=”My image>” /> (attributes can’t contain greater than signs)


Conclusion

I hope that you have grasped the ideas behind regular expressions a little bit better. Hopefully you’ll be using these regexes in future projects! Many times you won’t need to decipher a regex character by character, but sometimes if you do this it helps you learn. Just remember, don’t be afraid of regular expressions, they might not seem it, but they make your life a lot easier. Just try and pull out a tag’s name from a string without regular expressions! ;)


Note: Want to add some source code? Type <pre><code> before it and </code></pre> after it. Find out more
  • amir

    really useful tips abt regx

  • rizza

    This is what I am looking for

    Great post

  • http://www.nouveller.com/ Benjamin Reid

    I keep coming back to these regex’s for a class I’m writing, there so helpful!

    :)

  • http://ehussain.in Hussain Cutpiecewala

    Awesome..

  • http://brianary.blogspot.com/ Brianary

    Good introduction to regular expressions, but as production code, not so good. :(

    1. Simple username: /^[a-z0-9]+([_.-][a-z0-9]+)*$/i since you probably don’t want consectutive underscores, dots, or dashes and don’t want them to start the username. It really depends on whether you really need to create 7-bit ASCII usernames, though, which isn’t particularly international-friendly. Also reasonable: /^\w+(\S\w+)*$/ .
    2. Don’t validate passwords. That’s not a good idea.
    3. Remember to match the hex values case-insensitively.
    4. “Slug” is OK.
    5. Same problems as the username match. Try /^[a-z0-9]+([_.-+][a-z0-9]+)*@([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z0-9]{3,6}$/i . This will also match address+tag@gmail.com, john.q.public@us-1.example.org, &c., but not —@—.– or _@_.org . You should also be sure to exclude /@example\.(org|net|com)$/ so that you don’t get any phony RFC 2606 addresses (you could also do this with a negative lookahead assertion).
    6. Same problems as the email matching. Try /([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z0-9]{3,6}(\/([^ \t\n!*"'();:@&=+$,/?%#\[\]]|%[a-f0-9]{2}|%u[a-f0-9]{4})+)*(\?([^ \t\n!*"'();:@&=+$,/?%#\[\]]|%[a-f0-9]{2}|%u[a-f0-9]{4})*)?(#([^ \t\n!*"'();:@&=+$,/?%#\[\]]|%[a-f0-9]{2}|%u[a-f0-9]{4})*)?$/i . You don’t want _.com or –_.info or __.___ . You should also check against RFC 2606 here, too. You also want to make sure there is no whitespace, and that the correct characters are used according to the spec.
    7. IP address is OK, though you could also allow for integer notation.
    8. Don’t use regular expressions to parse HTML or XML. These are not regular languages, and this will always lead to code that will eventually break. It’s OK for the occasional ad-hoc Perl one-liner, but you shouldn’t be maintaining regex code for markup parsing. There are libraries for that.

  • http://www.mycrazydream.net mycrazydream

    These are very helpful to someone that has no clue about regex, but some fall short of accuracy. Rather, they will match what each determines to match, but other things as well. In other words, they are not specific enough. And that is what regex is all about. I want to capture what I want. Only that. Nothing else.

    But all in all good work

    • AppreciateThings

      I bet you dont know shit about regex compared to the top 1% of regex knowledgeable people out there. So quit acting like a dick and STFU.

  • http://www.mycrazydream.net mycrazydream

    But I have to continue in lieu of the comment to not “parse HTML or XML.” Seriously? Not only would that be seriously limiting but almost every great javascript framework in use on the web relies on the highest degree of pattern matching in the DOM to return the elements one needs to control. In other words, I take that warning to be, “don’t try to use regex where data on the internet is concerned.” Laughable.

    • http://brianary.blogspot.com/ Brianary

      It’s just not the ideal tool for the job. Parsers are available in any language for markup. And the beauty of the DOM is that it *is* a parser.

      Give me any regex and I will find either valid HTML/XML or working tag soup (stuff that isn’t valid, but looks right in current browsers) that won’t match correctly.

      Sure, it’ll usually work, and you can keep patching it each time you run across breaking code or the code changes unexpectedly, but eventually that will become your full-time job, and that’s not laughable.

    • http://brianary.blogspot.com/ Brianary

      It’s just not the ideal tool for the job. Parsers are available in any language for markup. And the beauty of the DOM is that it *is* a parser. You may want to double-check the claim that “almost every great javascript framework” uses regex to parse the DOM.

      Give me any regex and I will find either valid HTML/XML or working tag soup (stuff that isn’t valid, but looks right in current browsers) that won’t match correctly.

      Sure, it’ll usually work, and you can keep patching it each time you run across breaking code or the code changes unexpectedly, but eventually that will become your full-time job, and that’s not laughable.

  • http://zverek.net vovkin

    Just great!
    It’s really what I need.

  • http://redpyxll.com/redcode RedPyxll

    Great!
    Linked to it on my new code blog here: http://redpyxll.com/archives/408

  • Paw

    One of the best posts EVER!

  • http://www.hysia.com hysia

    Awesome post. I like it very much!

  • http://www.icpep.org Allan

    The email isn’t really working when you have
    IT does not consider it as a tag. But its good though.

  • http://www.ualberta.ca/~hwsamuel Hamman Samuel

    This is the best tutorial on regular expressions on the whole world wide web! I just begun learning them, and this place helped me the most

  • MR

    Actually validating an email with a regexp is a bit more camplicated, as explained here : http://www.regular-expressions.info/email.html.

    For instance, your expression wouldn’t match john+doe@mymail.com tough it’s a perfectly valid and RFC compliant address.

    And don’t forget the /i (case-insensitive) modifier.

  • Sumit

    Great Dude…

    I got details which I could get before today.
    Keep It Up.
    Also publish some more tutorials on PHP and OOPS with some basic examples of web applications implemented in 3-tier or N-tier architectures. I want to know basic of it. If any body of you all have links to those websites please mail me on my mail id: joshisumitnet@yahoo.com
    Thanks in advance. I want more about Tiered Architecture examples.

  • http://spotdex.com/ David Moreen

    Vasili I love you man, you just saved my booty on a project!

  • http://www.hiittech.com/ faraz

    very nice its save my time

  • Debo

    great post, Thank you!! Vasili.

  • http://sobujarefin.wordpress.com Sabuj Arefin

    very usefull post, thanks a lot.

  • ani

    awesome!

  • http://www.binarydreams.biz jon

    Yes i love this.

  • Arpan

    Great it is.

  • nXqd

    Awesome post . Very nice image explanation :)
    Thank you so much

  • http://www.usaanabolic.com Anabolic

    Very nice tutorial and very easy Thanks Anabolic

  • BigJim

    Problem with the email filter. If you enter

    xyz@domain.com.

    (note the trailing period) it passes. I have been trying to find a regex that does not allow the trailing dot. Any ideas?

  • http://www.programcreek.com Ryan

    Thanks a lot. I got some ideas for my site.

  • http://www.usaanabolic.com/ Anabolic

    Its Awesome I like ideas I will use them thanks

  • http://www.hiittech.com/ hiittech

    I got now complete idea about 8 Regular Expressions Thanks a lot

  • Elite Hussar

    Just awesome

  • Neha

    Hello

    actually i am trying to validate url.I used above mentioned function every thing is working well except “http://www.aaa” this.Can you please help me out?

  • http://flexdiary.blogspot.com Amy

    It would be nice if you defined some of the terms you used in this article. For instance, after reading the passage about how to recognize a “slug,” I still have no idea what a slug is in this context. Of course I’d know a slug if I found it in the garden (and would feed it to my chickens). And what does “TLD is too long” mean? What is the spell out of TLD?

    I’m a bit unclear on why you would want to prohibit strong passwords containing special characters. Could you elaborate?

    • http://davgarcia.com David Garcia

      TLD is Top Level Domain which is the last portion of a domain name such as .com or .co.uk.

  • craig

    In first example, what “$” sign is for ?

    • draciP

      “…and finally the end of the line” ;-)

  • http://www.integralwebsolutions.co.za/blog.aspx Robert Bravery

    This is great work. For some reason regex escapes me. But you have made it seem so simple. I like the way you isolate each aspect of the expression. Can’t say that it will stick in my brain. But it’s worth a try

  • Ajay Sharma

    hi,

    The examples you have shown are really good to understand regular expressions.

    Can you please tell me the books where i can learn good regular expressions or some site with regular expressions tutorials.

    Best Regards,
    Ajay

  • http://mattquinlan.com Matthew Quinlan

    Be careful using the URL matching regex when doing search/replace. $1$2$3$4 does not recreate the URL as you might expect because the DOTs / periods between the elements of the servername/domainname are not part of the numbered expressions.

  • Jamil Shah Afridi

    Great tutorial
    thank you so much

  • http://starikovs.com Vacheslav

    Hi! Cool creative pictures that explain these regexps :)

  • http://foleyoloro@hotmail.com Foley Oloro

    This is very explanatory. Awesome! Thanks.

  • Reynald

    Thank you very much!! bookmarking this for future referencing. :)

  • kd

    loved the graphical representation. great work

  • http://blog.tinogomes.com Tino Gomes

    IMHO, the unique validation you can use, via regexp, is to check minimum length, so…

    /^.{6,}$/

    but, the strong of that, you should use some function to check it…

  • http://azinkey.in AZinkey

    Hey!

    please clear me whats the role of () in patter
    means why we use this symbol

  • JiminP

    Whoops.

  • Abhishek

    This tutorial stands apart from any other tutorial available on the net. Coz of the graphical explanation. It is really amazing. Thank you!!!

    I was trying to build a regular expression which can do following:
    There must be atleast 1 dash i.e. – in the string anywhere.
    That is:
    It should match:
    1. -ABCD
    2. A-B
    3. A-B-C
    4. A–B-C
    5. 1-A–C
    It should not match the following
    6. ABCD
    7. 1ABCD
    etc etc
    Any ideas guruz!!

  • shahana

    well its really nice
    i want to validat date in the form of
    dd-mm-yy
    and
    dd/mm/yy
    wt shoud i do?

  • Carol

    How would I match “123″, but only when *NOT* surrounded by “abc” and “xyz”?

    1. abc123xyz (not matched)
    2. aaa123zzz (matched)
    3. aa123zxz (matched)
    4. aba123zdz (matched)
    5. aca123zaz (matched)

    • http://www.jeffrey-way.com Jeffrey Way

      (?<!abc)123(?!xyz)

  • http://www.webproxyblog.com Carl

    This is a fantastic post for someone like myself who’s spent a few days looking at php regular expressions.

    As many commentators have pointed out, you might not necessarily use regex to do these functions, but because everyone understands that an IP should be x.x.x.x, an email should be like some.thing@address.com – that’s exactly why these are working examples on how to use the (?:xx) and stuff I’ve been accumulating.

    Many thanks for helping me wrap things up nicely. Now to try a few practical examples of my own :)

  • http://cliklabs.com Anthony Fulginiti

    I don’t know if this was previously posted, but almost all of your regex’s should be updated to allow capital letters as well..

    Email (/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/)
    - This will validate: test@test.com
    - This will not validate: Test@test.com (common for phone’s entering emails)

    Change it to (/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/).. At least for preg_match in PHP.

    Otherwise great guide!

    • http://cliklabs.com Anthony Fulginiti

      Apologies, the corrected one is:

      (/^([a-zA-Z0-9_\.-]+)@([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6})$/)

  • Jamie

    I love the visuals, but please don’t teach folks to validate passwords. It’s a very bad idea. The only reason you would want to validate a password is if you’re storing it in plain text… Which you should never do! md5 or SHA doesn’t care what you give it as an input, so it shouldn’t matter. I know guys who use spaces, dollar signs, slashes, you name it, in passwords. Regexing it just reduces the security.

    • http://www.theprojectxblog.net/ ProjectX

      I agree with Jamie. Regexing does reduce the security.

      • NegLewis

        Not quite so true.
        This is a concept.
        RegEx can increase security too if it’s done correctly.

    • Hashman

      I can think of lots of reasons to validate passwords. For example, if you let users type any symbol, there’s a much higher chance they won’t be able to type it next time (when they’re on a different keyboard, on their phone, can’t remember the key combo, etc.). This particular regex is far too strict, but the principle is sound. For another, it can be used to ensure a minimum length (as done here).

      I really don’t know what password storage has to do with password validation. You can validate a password with a regex, and then hash it for storage. The two are completely independent concepts.

  • fouzia

    very useful