One if by land, one-zero if by sea

June 4, 2007 – 5:27 am

How bits represent information and form the basis of computing.

An installment in a series of posts on basic computing concepts for beginning programmers.

As the general public has come into daily contact with computers, people have been disabused of their former notions that computers ‘think’ and ‘know’ things. Sadly, for most of us, the mystifying metaphor of human thought has not been replaced by some better conception of how computers work. While many people have begun to correctly think of computers as merely electrical and mechanical devices, not only do most people remain ignorant of how all the ‘gears’ work, they still can’t fathom how a computer could be made up of anything like gears, whether electronic ones or otherwise. And while we have been told to think of computers as ‘machines that do math’, most of us can’t fathom how math transforms into pictures, audio, video, user interfaces, games, or even just text.

To demystify the greater part of how computers work, you don’t have to learn all too much about computer hardware or electronics, for while these subjects are fantastically and fascinatingly complex in their own right, the role of computer hardware essentially comes down to performing a handful of simple tasks when instructed to do so—copy , add, subtract, and compare data, etc; most of the complications in hardware have to do with getting the hardware to do its simple job faster.

So it is the sequence of instructions—the software—fed to the hardware which explains the better part of the story, as this is what turns computers from calculating automatons into useful and seemingly intelligent devices. To explain software, the best place to start is in how bits represent information, for a piece of software is ultimately just a bunch of instructions and data expressed as bits, and manipulating bits is at heart all hardware does. So yes, sadly, every discussion of programming must begin with the subject of data, a topic as profoundly uninteresting as reading the phone book. Only severe autistic cases get into programming to manage data (no offense, serious autistic cases!) — the rest of us want to get our computers to do something, so do we really have to talk about something basically inert? Yes, quite simply because reading and writing data is exactly how computers do anything.

What is a bit?

As everyone these days knows, a bit is simply a thing that holds one of two states (represented with the symbols ‘0′ and ‘1′) and which can alternate between these two states, so any computer data is a series of bits, e.g. 00101111101011101111111100110100. The actual physical mechanism of ‘holding a bit state’ varies from one computer technology to another—memory chips use either capacitors or transistors; optical discs (CD’s and DVD’s) use microscopic grooves read and written by lasers; floppy disks and hard drives use charges on magnetically sensitive surfaces—but really, a bit is just an abstraction, not any particular tangible thing, so in fact, bits can even be found outside of computers, e.g. a flag on the side of a mailbox can be considered a bit because it holds one of two states, up or down.

The simplicity of bits is what makes them a good, universal, lowest-common-denominator representation of data. In fact, a bit is the smallest unit of information possible: you might think that something which held only one state would be the smallest unit of information, but you would be wrong because such a thing would not convey any distinctions, and without distinctions, you have no semantic content (see here).

Quantities of bits

Before discussing exactly how bits represent complex information, we should clear up some confusion around the terminology for expressing quantities of bits:

A single bit by itself can’t represent much, so we usually concern ourselves with series of multiple bits, and certain quantities of bits have names:

byte = 8 bits
nybble (or nibble) = 4 bits

The term ‘nybble’ is used quite rarely, but ‘byte’ is used perhaps even more frequently than ‘bit’.

(A byte is actually not always 8 bits: properly speaking, the size of a byte for a particular system refers to the size of ‘the smallest addressable unit of memory’ (i.e. the size of the cells into which memory is divided up; it is these cells which can be independently read and modified). The memory of some systems, especially some older ones, is divided into cells of some size other than 8 bits, but 8-bit bytes are found in almost all systems made in the last 30 years, including PC’s. There’s nothing intrinsically special about the quantity 8, except it has the virtue of being not too big and not too small while also being a power of two.)

We use Greek prefixes to indicate bits in certain quantities of powers of ten:

1 kilobit (Kb) = 10^3 bits = 1,000 bits
1 megabit (Mb) = 10^6 bits = 1,000,000 bits
1 gigabit (Gb) = 10^9 bits = 1,000,000,000 bits

…well, not quite. These are the popular (read, lazy), rounded-off definitions. The stricter system used by computer scientists and programmers defines these quantities in powers of two:

1 kilobit (Kb) = 2^10 bits = 1,024 bits
1 megabit (Mb) = 2^20 bits = 1,048,576 bits
1 gigabit (Gb) = 2^30 bits = 1,073,741,824 bits

(Actually, programmers very often use the lazy definitions in informal contexts—just don’t think a computer won’t notice the difference.)

The Greek prefixes can be used to indicate quantities of bytes as well, such as saying ‘one kilobyte’ to mean 1,000 (or 1,024) bytes.

Pay particular attention to abbreviations for whether the ‘b’ is capitalized or not: lowercase ‘b’, means ‘bit’, but uppercase ‘B’ means ‘byte’ e.g. ‘1 Kb’ is 1,024 bits, but ‘1 KB’ is 1,024 bytes. If you’re not paying attention, you could misinterpret a quantity of bits by a factor of eight! (You’ll also see the ‘k’, ‘m’, and ‘g’ in lower case, but this doesn’t have any significance.)

For some obscure reason, when talking about quantities of stored data, the convention is to use bytes, kilobytes, megabytes, and gigabytes, but when talking about data throughput (such as in the context of data transfer rates over a network or between computer components), the convention is to use bits, kilobits, megabits, and gigabits.

Character sets

So how do bits represent information humans care about? Well in the case of text, the relation between a particular string of bits and a text character is arbitrary, e.g. we could decide that the bit string 10111000 should designate the Roman character capital ‘J’, and as long as all of our hardware and software in the correct contexts treated that bit string as if it represented ‘J’, then it doesn’t matter that there’s no logical reason for doing so.

So to represent text as bits, we designate a unique string of bits for every character we wish to use, and this set of designations is called a character set. The most widely used character set in the Western world is called ASCII (American Standard Code for Information Interchange), which contains 128 characters, each mapped to its own 7-bit string, e.g. 1001101 in ASCII represents upper case ‘M’ while 1100010 represents lower case ‘b’.

For decades, virtually all programs written for English-speakers have used ASCII, but because ASCII doesn’t contain characters needed in other cultures, other locales used alternatives, so for many years, a hodge-podge of character sets prevailed world-wide. This began to change in the 90’s with the introduction of a universal character set, called Unicode. Unicode reserves enough space for 1,114,112 different characters, which means each character is designated a 21-bit string. As old software is being replaced by new software, Unicode is gradually being adopted as the replacement for all other character sets, including ASCII.

1,114,112 is more characters than is needed to contain all the characters of every language in the world (including even Chinese, Japanese, and Korean), and in fact, only a few hundred thousand characters are currently designated in Unicode, leaving many 21-bit strings available for future addition of characters. Some of the characters in Unicode are not language characters at all but rather symbols used for other purposes, such as math or musical notation.

Numbers

Some kinds of information, such as text characters, get by using arbitrary assignment of pieces of information to their representations, but other kinds of data are suited for a logical system. Numbers are best represented using a logical system for a variety of reasons, most obvious among them the fact that there is an infinite range of numbers, so it’s just impossible to give each number an arbitrarily selected bit string; using a logical set of rules for representing numbers allows us to encode as bits any number of any size in a consistent and predictable way.

So what is this logical system for representing numbers? Quite simply, a string of bits—11001, for instance—is a number, but in binary form rather than the decimal form with which you’re familiar: binary is nothing but a numbering system (i.e. way of expressing quantity) that works just as well as decimal. Unfortunately, while people use decimal all their lives, it becomes so ingrained that they can’t see how it works and therefore have a hard time imagining any alternative. Though the details of how binary works are really very simple once you understand them, the concept of an alternative number system is famously hard to convey succinctly and successfully to those uninitiated, so it’s something we’ll gloss over here. For the duration, just take it on faith that there is, for instance, a logical reason why 35 is expressed as the bit string 100011.

Once we have a logical correlation between any bit string and a number, it then makes sense to think of arbitrary assignments in terms of numbers rather than bit strings, e.g. if, in a character set, ‘G’ is assigned to the bit string 1000111, then it can also be said to be assigned to the decimal number which corresponds to that bit string, 71, and this is in fact how we normally think of such arbitrary assignments.

The perfect ambiguity of bits

Whatever manner is used to encode our information as bits, whether logical or arbitrary, it’s important to understand that the meaning of any string of bits is not intrinsic to the bits themselves: the meaning of any string of bits ultimately relies upon agreement between the writer of the bits and the reader as to how to interpret the bits. This is really no different from human languages, where the words of a language only have meaning because of (mostly informal) established agreements between a community of speakers of that language. It bears illustration, though, because this is not how most people commonly think of meaning. Consider:

Imagine I write 7 decimal digits on a piece of paper. Is it a phone number, the population of Milwaukee, or the number of angels on the head of a pin? Now imagine a bank that uses 14-digit account numbers. If I write down a series of 28 digits with no spaces or separators, how many phone numbers and how many bank account numbers do I have? Well, I may have known what I meant at the time I wrote those numbers down, but nothing in the data tells anyone—including myself ten minutes from then—what the numbers mean at all.

This same problem exists in computers. Consider a sequence of bits: 001110100111111110100100. The first thing not discernible is how the bits should be grouped: is this meant to be interpreted as three bytes, or six nibbles, or 5 bits followed by 19 bits, or what? Just as bad, the bits themselves indicate nothing of what kind of data they’re supposed to represent, whether numbers or text or otherwise, nor how that kind of data is encoded.

The lesson here is that, for data to be interpreted correctly, the thing doing the interpreting has to assume the length, encoding, and location of the data, and the only way to make these assumptions correctly is to strictly keep track of where data is placed and what it was supposed to mean when you placed it there.

What ensures then that, say, a file of ASCII text is interpreted as ASCII text and not treated as a bunch of numbers or perhaps as text of a different character set? Nothing! In fact, you can do the reverse: take any file and open it in a text editor to see it as ASCII text; if the file wasn’t intended to be human-readable ASCII text, you’ll almost certainly just see a sea of garbage like:

ïT¬?ì?Rïûê” ïBHïT$4ïz,ì?+ïH?;-t¦@¶ ïL$4ïQ,ï ël$?¤î? ï?ï£$+ + à +~@ï°ìV ?+Bn+K$+B?+K,¦ -+?+ LAâ-\;-|+ïåä? ¦? ;-ëL$?¤îH? ì« ? ël$$ïU ïD

…at least, it would be remarkably surprising if you opened a random file not intended to contain text but found that it happened to contain long sections of English, or any other human language, for that matter (well, actually, many files not intended to be read as text have text data embedded within them, so you will often see some strings of human language in such files). So be clear that, while nothing stops you from reading a piece of binary data as representing some kind of data which it wasn’t intended to be, barring remarkable coincidence, doing so just produces garbage.

Pretty pictures

A very large majority of all programming deals only with number and text data, but bits are also used to represent sexier kinds of data, namely images, video, and audio. At this point, how bits could possibly represent such information may still seem mysterious, so we should at least breach the matter by briefly illustrating how to represent images.

Quite simply, a computer image is just like the big scoreboard-grids of lights at sports arenas except that the individual emitters of light are much smaller than light bulbs, producing a much finer image. A computer-screen image is essentially a grid of discrete light-emitting points, called pixels, so to produce a certain image, we have to get each pixel to emit the right color.

(Of course, monitors don’t contain any light bulbs, but the physical process isn’t important to us, and besides, the actual physical process is very different between CRT monitors—Cathode-Ray Tubes, the bulky monitors that are a foot or more deep in dimension but which are now going out of use—and LCD’s—Liquid-Crystal Displays, the monitors less than an inch deep in dimension that are used in all laptops and have nearly taken over all new monitor sales for desktops.)

The solution, like with characters, is to use an arbitrary assignment: imagine that we establish a mapping of numbers to colors, and imagine that knowledge of this mapping is hardwired into our monitor; if I then feed a number to the monitor, it could set a pixel to the corresponding color. But which pixel does it set? Well, the ‘next’ pixel: a monitor is hard-wired to set the colors of its pixels in a certain order, drawing lines pixel-by-pixel, left-to-right starting from the top of the screen, moving down line-by-line, and cycling back up to the top to repeat the process. So data is fed into a monitor sequentially, thereby drawing the image sequentially, pixel-by-pixel; it’s just done so fast you don’t see the process.

The monitor image is updated from the computer at a fixed rate, usually sixty times a second or more, whether or not the image changes at all. Now, it would be horribly wasteful to have the CPU do this job itself of constantly feeding the monitor data just to show a still screen, so this responsibility is offloaded onto a specially purposed device, the video controller. An essential component of the video controller is its dedicated memory, called the framebuffer, wherein the current image to display is stored, and many times a second, the video controller transmits the contents of its framebuffer to the monitor; left to itself, the video controller can handle feeding the current image data in the framebuffer to the screen all by itself without attention from the CPU. In fact, from a programmer’s perspective, the data in the framebuffer is the image on the screen, and therefore, to change the image on the screen, the programmer simply instructs the CPU to modify the data in the framebuffer, causing a different image to be seen the next time the video controller sends the monitor the contents of the framebuffer.

The details of binary number representation and the ASCII and Unicode character sets will be presented in later posts.

Post a Comment