All your Base64 are belong to us...
I've used Base64 plenty of times without really knowing why. One popular use is storing small icons in the text of CSS files, avoiding the need to make multiple requests to the server. JSON web tokens store payloads encoded in Base64. It's also used a lot in cryptography (but please, PLEASE don't confuse Base64 encoding with encryption!). I encountered it recently when revisiting the first cryptopals challenge. Every time I've knowingly Base64 encoded something, it was because whatever I was using required it, but I still couldn't really wrap my head around why I was using that particular scheme.
One question led to another, and off I was chasing this strange white rabbit. In many ways, this is a frivolous exercise. You don't need to know any of this to Base64 encode something. Existing libraries in most of your favorite languages will do that for you without too much fuss. Even so, it's not a bad exercise to take a trip to the lower levels of how things work every now and then. By the end of this post, it should be fairly easy to answer the question: should I Base64 encode this? You'll know the trade offs and the strengths of Base64 vs sending information in pure binary format. You'll know why, if you send a just under 5MB Base64 encoded email attachment, and you have a 5MB hard limit, your message will still be too big. And you'll understand why 64 is the magic number for this sort of thing.
Let's get to it!
Basic encoding and decoding
Say you want to transfer an image, but you know that at least some of the places it's going to have to pass through only speak in regular old text. How do you make sure it gets from point A to point B without getting easily mangled?
Well, at its core, that image is going to be a bunch of 1
s and 0
s (or, in other words, binary information). So, you could just make those 1
s and 0
s into text and send that.
1010111010111010101111101011110101010110
0101010110101101010101101001100110101011
0111010100100110101010001010010010100101
...and so on.
We don't even have to worry about the line endings here. They're meaningless. We'll throw them away on the other side. All we're interested in are those 1
s and 0
s, which we'll turn back into real bits on the other end.
(I have no idea whether the above would be in any section of a valid image... as you may suspect, I just randomly typed 1
s and 0
s.)
Problem solved?
Well, sort of. You could do that. And on the other side, you could take the textual 1
s and 0
s and turn them back into actual bits. However, it's not going to be very efficient.
Each character is going to take at least a byte of bandwidth. That's 8 bits. 8 times the amount of information you're actually sending. So a 1MB image will now take 8MB of bandwidth to send in this way.
Perhaps you remember that binary, as base 2, is pretty easy to convert into hexadecimal format. That's just taking every possible sequence of 4 of those bits (there will be 24 == 16 of those) and assigning a digit from 0-9 or a letter from A-F to it. So you put two of those together and you're able to send a byte of information.
AEBABEBD56
55AD5699AB
7526A8A4A5
Each of those basic ASCII characters you send are going to be a byte. Now you're only having to send information that's twice as big as your image. We're making progress!
But wait a minute. There are way more ASCII characters at our disposal than those 16. Why don't we use more of them to reduce our overhead? In fact, why are we even doing these gymnastics in the first place? Why don't we just use the entire ASCII character set and send our image with no encoding overhead at all?
Easy corruption
If you take a look at the first few entries in the ASCII table, you may see a problem:
0 NUL Null
1 SOH Start of Heading
2 STX Start of Text
3 ETX End of Text
4 EOT End of Transmission
5 ENQ Enquiry
6 ACK Acknowledgement
7 BEL Bell
8 BS Backspace
9 HT Horizontal Tab
10 LF Line Feed
None of those are characters you'd expect to be able to see. Anything along the way that might remove a character (perhaps guarding against malicious input) or interpret it differently would be able to corrupt your image. And certainly, we'd be very surprised if someone was able to copy and paste the text representation and then decode it into the original image. What would it even mean to copy a backspace to the clipboard?
So trying to go for no overhead at all is a recipe for disaster. Even if we weren't worried about humans copying and pasting the data, we'd be crossing our fingers every step of the way that we don't encounter anything that can only handle readable text.
How about only the readable characters?
There are 95 printable characters in ASCII. What if we used all of those? The more characters we have available, the more information we should be able to encode, right? Well, sort of, but here's the problem. Remember that we're ultimately trying to efficiently encode binary information. In addition to making it as efficient as possible to send back and forth, we also want to make it as easy as possible to encode and decode back to binary.
If we used all 95 characters we could, let's think of what would happen when we reached the threshold of a single character:
95th value == 1011111
96th value == 1100000
95 characters allow us to potentially send 7 bits of information, meaning we'd be using an 8 bit character to send the 7 bits we actually want to send, resulting in an approximately 14% overhead. However, there would still be 33 combinations (27 - 95) of 1
s and 0
s in every one of our image's 7-bit chunks that we need to be able to represent. The next possible value we'd need to represent by using a 2nd character would be the 96th value, which would still have 7 bits. That's pretty weird. Now we have to read two characters to determine whether or not the 7 bits represented by the 1st character is the real value? This problem is just going to keep compounding, and it would create huge problems any time we're trying to stream data. If we can't know that what we've got so far is a valid representation of bits until we read the next character, that means we won't really be able to know it's valid until we read the last character.
In fact, any base that isn't a power of 2 is going to have this problem. Let's look at converting something in base 10 to base 2.
10 (base 10) == 1010 (base 2)
1010 (base 10) == 1111110010 (base 2)
10
turns into 1010
in binary. Put 2 10
s in sequence, which we'd read as 1010
in our regular base 10 system, and you don't get 10101010
in binary. You get 1111110010
. There's not even a single 1010
pattern in that.
Looking at a similar situation, but with a base 16 encoding, and you can see the encoding/decoding advantages of a base that's a power of 2. Here 5F5F
5F (base 16) == 01011111 (base 2)
5F5F (base 16) == 0101111101011111 (base 2)
Here 5F5F
, which is 5F
twice, is also the 01011111
sequence twice. With every single character we get, we get a representation of every possible pattern of 1
s and 0
s in a chunk of 4 bits. So once we get that hexadecimal character, those 4 bits are figured out, and we can safely move onto the next. But now we're back to sending 2 times the size of our image.
Lowered expectations
Okay, so we know we can't really make use of those 95 printable characters without creating other unacceptable problems. How many can we make use of? Well, we know we can make use of 16 characters to represent every possible combination of 4 bits.
We'd need 25 == 32 characters to represent every possible combination of 5 bits.
We'd need 26 == 64 characters to represent every possible combination of 6 bits.
And we'd need 27 == 128 characters to represent every possible combination of 7 bits.
We know already that we have a maximum of 95 printable characters to work with, so 7 bits is just too high. We're going to have to settle for 6 bits.
Conclusion
In other words, Base64 is the most compact way to transfer binary information using printable ASCII characters, and that's why we use it and not Base128 or Base95 or any other base when we need to do this sort of thing.
That's also why anything you encode will be about 33% bigger. You're using 8 bits for every 6 bits you have to send. So if you want to avoid the overhead of multiple requests for a bunch of small files, it might be a good trade off. But if you're sending several TB of information, you probably want to use a pure binary format, even if it's a little less flexible.