Base What? A Practical Introduction to Base Encoding

Base What? A Practical Introduction to Base Encoding

Tutorial Details
  • Difficulty: Intermediate
  • Completion Time: 15 Minutes

At a young age, we learn to count on our fingers – starting out with 1-5, then 1-10, and maybe, if you’re particularly enterprising as a toddler, you will learn to count to 20, 30, and beyond. No one ever attempts to enlighten us that we are actually making some more complex mathematical assumptions; we all know Base10, to be precise.

In this article, we’ll start by gaining a more rounded understanding of Base10 and its structure, then we will discuss binary (Base2, the building blocks of computing). Finally, we’ll finish things up by talking about Base32 and Base64. At each stage we will discuss the advantages and uses for each type.


Why Base10

We have 10 fingers.

So, why did we choose Base10? It’s not because the letterforms 0-9 exist; that was actually a result of the choice to use Base10. In fact, it is most likely because of the learning process we decided above – we have 10 fingers. This makes it much easier to understand the system.

So, let’s talk a bit about how Base10 actually is structured. This will be the foundation of understanding that we’ll use in the subsequent discussion.

Starting at 0, we count up to 9, filling the “1′s” column. Once the ones column is full (has 9), that is the maximum for the column. So we move to the next column (to the left), and start at 1. For all intents and purposes, we can postulate that there are an infinite number of leading zeros before our first significant column. In other words, “000008″ is the same as “8″. So as each column fills up, the next column is then increased by one, and we start back at the previous column to fill it up again in the same manner as before. Specifically, the 1s column increases from 0-9, and then another ten is added to the tens column. This is continued, and if the tens column is at 9 and the 1s column is at 9, 1 is added to the 100′s column, and so forth. We all know this piece of the pizzle.

Consider the number 1020. Starting from the right, we can understand this as “0*1 + 2*10 + 0*100 + 1*1000″. Now, consider the number 5,378. We can understand this as “8*1 + 7*10 + 3*100 + 5*1000″. A generalized function to understand Base10, then, is as follows:

(10 raised to the power of the column from the right -1) * (the number found in the column)

Therefore, if there is a 6 in the 5th column from the right, 10^4*6 = 60,000.

We can see that there this is a generalizable formula for understanding all base systems.


Base2 (Binary)

This is why these systems are referred to as Base(N).

The next system we will talk about is Base2, or binary. Binary consists of two digits, 0 and 1. This lends itself well to computing for many reasons, most fundamentally because computers rely on switches that have two states: on or off. Binary is the most basic system needed for all logical operations (think “true” and “false”).

So, how does binary work? Take the formula from above, and instead of using ten, use two. And on that note, this is why these systems are referred to as Base(N).

(2 raised to the power of the column from the right -1) * (the number found in the column)

So, let’s take the arbitrary number 1001101 in binary, and apply this formula.

(1 * 1) + (0 * 2) + (1*4) + (1 * 8) + (16 * 0) + (32 * 0) + (64 * 1) = 77

“Wait!”, you’re thinking. “If binary is all that computers are made of, how would you write letters in binary?” Good question. This actually brings us to our introduction of Base16.


Base16

It would instead be a single-digit representation of 10.

Let’s, for a moment, imagine that we had 11 fingers. We would be naturally using a system of Base11. Besides it seeming uncomfortably hard to imagine currently, what other implications would this have? Perhaps the most important implication is that we would have had another increment beyond 9 in the 1s column. But it wouldn’t be a “10″, because 10 isn’t confined to the 1s column. It would instead be a single-digit representation of 10. And, in fact, that is exactly how letters function in base systems beyond Base10 up to Base62, with some caveats (which we’ll get to later when we talk about Base32).

Let’s imagine using Base11, but substitute a capital A for the single-digit “10″ we discussed above. How would we write the number 54?

Since we know the first column from the left is the “11′s” column, we would begin by dividing 54 by eleven, which gives us 4 with a remainder of 10. If “A” represents 10, in Base11 the number 54 would be represented as 4A.

Let’s do that in reverse, with the formula we used previously.

(11 raised to the power of the column from the right - 1) * (the number found in the column)

In this case, that would mean:

(1 * A) + (4 * 11)

Now, substitute 10 for A:

(1*10) + (4*11) = 54

Hexadecimal

How is this useful, you’re wondering? Base11 may not necessarily be useful (unless you have some kind of data structure that would benefit from a Base11 system). However, Base16 is used throughout computer systems for multiple purposes. Also known as hexadecimal, Base16 uses the numbers 0-9 followed by the letters a-f (not case-sensitive). In particular, you will see hexadecimals used to define RGB colors in CSS (and in most color-picker widgets on desktop software), with two digits for each of the channels red, green, and blue.

So, for instance, #A79104 would produce r = A7, g = 91, b = 04. In decimals, this would be equivalent to r = 167, g = 145, b = 4; the resulting color would be a golden yellow. Two hexadecimal digits put together can represent 256 different numbers, and thus there are 256^3 (16,777,216) possible number combinations in the RGB hexadecimal system, represented by only 6 characters (or 3 if you use the shortcut method, where each of three digits is implicitly doubled; e.g. #37d == #3377dd).

Base16 is often used in assembly languages, which is the lowest level accessible programming language. Because hexadecimals are easy to convert to binary, they are an easier way to write assembly code instructions.

Note: The same is generally true of the popularity of Base32 and Base64; these encodings are used because they are naturally better for binary data (because they are powers of 2), and because there are, at least, 64 safe characters (and there aren’t 128 safe characters) on almost every computer.

For a hexadecimal example, take the number 1100 in hexadecimal, which is equivalent to 4352 in decimal. The same number in binary is 0001 0001 0000 0000. Converting from hexadecimal to binary is a simple operation of using a conversion table, where 0 in hexadecimal is 0000 in binary and F in hexadecimal is 1111 in binary.

Note that the 0′s to the left of the first number denotes that the binary number is in bits, where the 0′s to the far left are simply empty columns. Fundamentally, these are not needed; however, you will encounter binary written this way almost exclusively. This practice is called padding, and is practiced because the length of the data is unknown, and thus could cause problems when multiple data transmissions occur; by padding the final string, the data size is guaranteed to be, for instance 4 bits long (for binary). Padding also occurs in other commonly used and specification-based encoding schemes; in particular, Base32 and Base64 both use the equals sign (“=”) for padding.


Base32

One might assume that Base32 is the numbers 0-9 and then the first 22 letters of the alphabet (up to V).

Remember when we mentioned the caveat above? This is the caveat: the most commonly accepted Base32 definition is actually an encoding that starts with the first 26 letters of the alphabet and ends with the numbers 2-7. This is defined in The Internet Engineering Task Force’s Request for Comments (RCFC) 4648, which also defines Base16 and Base64. Note, the difference is that the encoding for 0 is A, not 0. To encode a string in Base32, the following instructions happen.

First, the string to be encoded is split into 5 byte blocks (40 bits in binary). Letters are represented by 8 bit blocks in ASCII (the standard for computers), so for every 5 letters, there are 40 bits. (This 8-bit definition for each letter allows for a total of 255 characters in ASCII.)

Next, divide these 40 bits into 8 five-bit blocks; so, for every 5 letters, there are 8 blocks to encode in base32. Map each of these blocks to a 5-bit character mapping in the Base32 alphabet. For instance, if the five bit block is 00010 (or decimal 2), the mapped character is the letter, c. If the five bit block is 01010 (decimal 10), the mapped character is the letter K.

Let’s apply these steps to the string “yessir”.

CharacterASCII Decimal8-bit ASCII Binary
y8901111001
e10101100101
s11501110011
s11501110011
i10501101001
r11401110010

Let’s take the binary representations and concatenate them now, splitting them into 5-bit groups

01111 	00101 	 10010 	 10111 	 00110 	 11100 	 11011 	 01001
01110 	010(00)   null 	 null 	 null 	 null 	 null 	 null

A note on the above: because the specification defines that the encoding must be done in chunks of 8 5-bit pieces, we have to pad with 0 if the number of bits isn’t divisble by 5 (hence the 010(00) on the second line) and with = if the number of chunks isn’t divisible by 8. The “null” values will be replaced by the padding character, “=”.

Each of these 5-bit binary numbers map to a character in the 32-bit alphabet; specifically, the output for yessir would be PFSXG43JOI======

A similar process is followed for Base64. There are a few fundamental differences between Base32 and Base64. Base64 includes the letters A-Z, a-z, numbers 0-9, and the symbols + and /. As mentioned previously, the “=” symbol is used for padding. The differences are mainly that all letters are case-sensitive, and all digits are used (instead of the subset 2-7). The symbols + and / are also added.

The Base64 encoding process takes 24-bit strings (3 letters) and breaks them into four 6-bit chunks, mapping the resulting binary number to the Base64 alphabet. So, lets take a look at our previous example, the string “yessir”.

8-bit binary: 01111001 01100101 01110011 01110011 01101001 01110010
6-bit chunks: 011110 010110 010101 110011 011100 110110 100101 110010
Base64:		  eWVzc2ly

There are a few important things to note. First, Base64 is case-sensitive. Second, because the number of bits (48) was divisible by 6, no bit-padding was necessary. The number of 6-bit chunks was divisible by four as well (which also means that the number of input characters was divisible by 3), so no null (“=”) padding was necessary either.


A Summary of Base16, Base32, and Base64

These binary-friendly bases are leveraged throughout programming structures.

These binary-friendly bases are leveraged throughout programming structures. Binary data is encoded in these bases to ensure the fidelity of the transfer and block against errors that might rise out of accidental un-encoded binary data transfer. They rely on standards-based tables of characters, and are only guaranteed to work if both the encoder and decoder use the same table; for instance, there are widely accepted modified versions of base32, including one by Douglas Crockford that changes some of the acceptible characters, including the letter “u” as to avoid unintentional obscenity.


Encoding in Practice

In addition to using hexadecimal numbers on a regular basis for CSS colors, Base32 and Base64 are used on the web consistently. Though the official encoding process for Base32 and Base64 bloat the size of the string, encoding numbers in Base64 or Base32 can be very beneficial for things like URL shortening, where a URL might point to /foo/id. Consider the following decimal numbers and their Base32 and Base64 equivalents.

DecimalBase16Base32
20UU
50bsy
9676hPH
745619WYETC2CT
7241930G5AGKboDK
798312345192xhpr7ltiLnfH65o

As you can see, There are signficant advantages to using Base64 or Base32 for number shortening. When every character counts, using these base encodings allows you to save characters. In many cases, the encoded number is about half the length of the non-encoded number.


A Note On Base62 and Url-Modified Base64

What other types of web applications would you find uses for these encodings?

If you Base64 encode the number 959, the result is O/. Of course, this isn’t a url-safe value because of the “/”, so a url pointing to O/ would not be decoded as O/, but as O (which is the decimal value 14). It would defeat the purpose, also, to encode the “/” as its ASCII code equivalent (%47%), as that lengthens the URL significantly. Two main solutions have risen to combat this issue. One is a url-safe variant of Base64 that replaces the + and / with – and _, respectively. It also removes the specification of adding = characters for padding. The other option is to go to a Base62 encoding, which retains almost all of the benefits of Base64 and removes the + and /. However, Base62 encoding is not as easily applicable as a binary transmission substitute, and therefore is far less popular.


Conclusion

That’s wraps it up! Now, you have a fundamental knowledge of base systems, particularly as they apply to the encoding of binary data. What other types of web applications would you find uses for these encodings?

Note: Want to add some source code? Type <pre><code> before it and </code></pre> after it. Find out more
  • tpk

    very, very good article! :)

  • http://www.google.dk Lol

    Cool story bro

  • Ian

    I find this fascinating. Thanks for the extra info.

    It can be fun (for me anyway) to practice subnetting and writing base64/62 encoded messages using a conversion chart, pencil, and paper to get a deeper understanding. It’s fun like when kids used to get secret message decoder rings and pretend to be secret agents.

    Base62 can be used to get the full range of upper and lower case letters in a url shortner app.

  • http://www.codeconquest.com Charles @ CodeConquest.com

    It’s definitely important to understand systems like base 16, because of its use in CSS hex codes. I wrote an article on my blog about how to understand the way hex codes are generated, check it out at:

    http://codeconquest.com/how-to-understand-the-hex-color-codes-used-in-css/

  • http://andrewensley.com/ Andrew Ensley

    This post obliges me to post the following comic (best link I could find for it):

    http://cowbirdsinlove.com/43

    Every base is base 10 :-p

    • Will M

      There are 10 types of people in this world.

      Those that understand binary and those that don’t

  • http://www.elimcmakin.com Eli McMakin

    Does bit.ly use Base64?

  • ruurd

    Foo! Besides base-10 you’re terribly biased towards bases that are a power of two. What about base-3, base-7, base-11 or base-lowercase-26 :-) Oh and really one could do with a bit of zero-based indexing. Numbering starting with 1 is so GW-BASIC… <gdl&rh/>

  • Jeremy Lindblom

    Thank you for explaining these concepts in a clear manner. The article ended up being more interesting than I originally expected as well. Great job.

  • http://phpmaster.com/building-your-own-url-shortener/ Alex Fraundorf

    Excellent article that clearly explains different base numbering systems.
    I actually wrote an article for PHPmaster last month on how to build a URL shortener with a custom numbering base (approximately base-50) which uses 0-9 and upper/lowercase letters minus vowels (to prevent unintended words from being created) and removing letters that are easily confused with each other (like capital i and lowercase L).
    You can check out the article at http://phpmaster.com/building-your-own-url-shortener/ if you’re interested.
    Thank you for the article. I really enjoyed it!
    Keep up the good work.

  • Jason Killian

    I actually wrote a similar article for ActiveTuts a little while back, figured I might as well link to it here: http://active.tutsplus.com/tutorials/coding-concepts/number-systems-an-introduction-to-binary-hexadecimal-and-more/

    The articles do have a different focus and hopefully mine can give you a slightly different perspective on the same topic if you want another look at it.

  • Romit

    Gr8! Thanks for good info! :)

    A little off topic, but, can anyone suggest about best method to encode/encrypt query string variables?
    If it is Mcrypt, which algorithm is suggested – does someone have any working code around?

  • Chris

    This finally made me understand hex. Thanks.

  • Abhijit

    Very good and interesting article. BTW, “We all know this piece of the pizzle” – I think that should have been “puzzle”.

  • http://clippingpathcenter.com shamim

    super tutorials and thanks to share..

  • http://google.com phpman

    did iconcube use this to encode the php code ?

  • Curt Rostenbach

    The Aztec used base 20. I presume because they counted their toes as well.

  • Denis

    In your grid in section Encoding in Practice, you have Base16 and Base32 as headers, but I think you meant Base32 and Base64…

  • dies irae

    i was lost at the first important paragraph: “Starting at 0, we count up to 9, filling the “1′s” column. …”
    sry, this is not well written. i think a little hallway testing would have been a good idea. you’re writing for someone who already understands, not for someone who doesn’t.

  • josponce

    esta muy bueno este post, gracias por la explicación de esto, que lo uso pero no lo entendía.
    XD

  • Ken

    Thanks for a nice article. One error though:

    “Letters are represented by 8 bit blocks in ASCII (the standard for computers), so for every 5 letters, there are 40 bits. (This 8-bit definition for each letter allows for a total of 255 characters in ASCII.)”

    ASCII is defined as 7-bit, not 8-bit and consist only of 128 characters. The upper 128 are considered extensions (ie. ANSI) and may vary from system to system. Only the bottom 128 (the ASCII codes) will be equal on all systems. https://en.wikipedia.org/wiki/ASCII

  • Stan Marsh

    Nice article,
    Now I know what goes on behind those base convertors,
    Here is one of those ‘base64′ encoder I personally use (to obfuscate my php code):
    http://www.dynamicguru.com/tools/base64_encode_decode.php

  • http://www.marisurak.sk/ Márius Rak

    Hi, is there some posibility of usage as ID in MySQL with all the stuff like auto-increment?

  • http://www.marisurak.sk/ Márius Rak

    Hi, Is there some posibility of use Base32 as ID in MySQL? With all the stuff like auto-increment and unique key. Thanks

  • Christian Sciberras

    I was under the impression that the most popular bases in use were binary(2), octal(8), decimal(10), hexadecimal(16) and tetrasexagesimal(64).

    While you can have any base system in theory, I’ve never personally seen code with base 32.

    Also, you can actually use any kind of numeral system, as long as you have enough characters.

    There’s also no reason why you can’t do this encoding with “unsafe characters”, unless your displaying the result in a plain-text format (such as HTTP header section, URLs, etc). Such a large numerical system makes it easier to store huge patterns of text, but pretty bad at small patterns, I’m pretty sure there’s a mathematical ratio between compression rate and the base system.

  • Chris

    I’m confused about the Base 11 example.

    In base 10 you break it down as follows:

    The number 60,000 is represented using the equation (extra parenthesis added for clarity)

    (10 ^ (power of the column from the right -1)) * (the number found in the column)
    (10 ^ (5 – 1)) * 6
    (10 ^ 4) * 6
    10000 * 6
    60,000

    In base 11 the number 54 would be calculated as follows:

    (11 ^ (2 – 1)) * 5
    (11 ^ 1) * 5
    11 * 5 = 55

    …and…

    (4 ^ (1 – 1)) * 4
    (4 ^ 0) * 4
    1 * 4 = 4

    55 + 4 = 59

    But then the ’10′ is represented by ‘A’, so how does it fit in with the literal interpretation of the equation?

    It appears you left out a couple steps to get to (1 * A) + (4 * 11).

    Can you break this down for the rest of us?

    Thanks.

  • Ron

    Your article is a great primer, and would probably be easier for persons trying to grasp the concept if you were consistent in the application of your basic formula in the examples that you give, the basic formula being:

    <quote>(Base raised to the power of the column from the right -1) * (the number found in the column)</quote>

    Almost all of your examples are exactly the opposite, e.g.:

    <quote>Now, consider the number 5,378. We can understand this as “8*1 + 7*10 + 3*100 + 5*1000″</quote>

    If it followed the order of operations in your formula, it should be “1*8 + 10*7 + 100*3 + 1000*5″

    The only exception being the Base2 example where for some reason you mix both styles.

  • My, myself and I

    Really good!

  • Guest

    Character

    Character

    ASCII Decimal8-bit ASCII Binaryy8901111001

  • http://www.facebook.com/jasper.vandergronde Jasper van der Gronde

    In the sub-topic base-32 there is a fault, should correct it for mis-understanding ;-)
    It’s correct that the character “y” represents the ASCI II Decimal 89, but the 8-bit binary is wrong.
    Corrected:
    Character | ASCI II Decimal | 8-bit ASCI II Binary
    y | 89 | 01011001

  • CUP

    A 42-year-old once told me that he was 21 base 21.

  • Eric

    The ascii value for y is 121. This article worked for me.