The Inuit have 100 words for ‘array’

June 27, 2007 – 5:32 am

A learner’s guide to the very important concepts of ‘arrays‘ and ‘associative arrays‘ and the very confusing, overlapping terminology thereof.

In programming, the term ‘array’, in its most general sense, means ‘a sequence of units of data’, but confusingly, a preponderance of terms all fit that same definition, each with its own variation on the theme. This wouldn’t be so bad if programmers and programming languages could decide amongst themselves which connotations belong to which terms, but in truth, there is no definitive usage that keeps them all straight. At best, the various different things called ‘array’ can be classified by a few properties:

  • Is the number of elements (a.k.a. the units of data) in the array fixed at the array’s creation time, or can the number of elements grow and/or shrink after creation?
  • Are the elements homogeneous (all of the same kind) or heterogeneous (of different kinds)?
  • Are the elements contiguous in memory (i.e. do the elements all sit directly adjacent to each other)?
  • Do we care about the order? While the elements of an array are always indexed numerically (i.e. each element has a place in line relative to the others), we may simply want to use an array as a collection of things without regard to the order of its elements.

In any case, here are the strongest meanings of each term as best as I can piece together:

  • The dominant use of the term ‘array‘ itself comes from the feature called ‘array’ in the C language and languages strongly influenced by C (which includes C++, Java, and C#). In this usage, an array is an ordered, homogeneous, contiguous, fixed number of elements. While not being able to mix different types of elements together in one array and not being able to add additional elements to an array after creating it makes these arrays bothersome to work with, C’s arrays purposefully forsake these features for performance and memory-usage advantages: by being homogeneous, fixed in length, and contiguous, a C array takes up a minimal amount of memory and generally requires less processing work to access and/or modify its elements. (As it happens, it is from the C language that the convention began of indexing the elements of an array starting from 0 rather than 1, and most of today’s languages stick with that convention even though it initially feels unnatural to students of programming.)
  • A ‘list’ is an array which is not fixed in length: elements can be added to the end of the list or inserted or removed at any point in the list. Lists are not necessarily homogeneous nor necessarily heterogeneous: in most languages, you can create either kind of list. By allowing growth after creation, a list is generally more expensive performance- and memory-wise than C-style arrays; for instance, consider if you want to add elements to the list but the memory space at the end of the list is being occupied by some other data: to accommodate more elements, the whole list would have to be moved elsewhere where there’s more space, something quite expensive to do. (As I’ll describe in a follow-up post, there are two basic ways to implement lists, called ‘array lists’ and ‘linked lists’, both with their own performance trade-offs.) In languages where maximizing performance and memory conservation is not a primary design goal of the language (this includes Python, Ruby, Perl, and Javascript), lists are used in place of arrays for their flexibility; in C, C++, Java, or C# programming, however, lists are typically only used when really necessary.
  • There are a number of interchangeable synonyms for ‘list’, including ‘dynamic array’ and ‘growable array’.
  • A ’set’ is a collection of things in which no element can be the same as any other of the collection’s elements. You could simply use an array or list as a set, but if you then want to make sure no elements are ever found more than once in the array or list, you would have to write logic that enforces that rule when an element of the array/list gets added or modified.
  • The term ’sequence’ doesn’t have any predominant use, but it is sometimes used as a generic term for an ordered collection. Some languages co-opt the term for some particular context, e.g. Java has a notion of ‘character sequences’ in its libraries, and Python classifies some of its types as ’sequences’.
  • The term ’string’ is virtually always used to mean an array of characters, a.k.a. a piece of text data. However, ’string’ is very, very occasionally used in a more generic sense to mean a homogeneous sequence of some type other than characters. (I’ve seen this usage in the context of assembly programming, but not to my recollection in the context of high-level languages.)

An ‘associative array’, though also a kind of data collection, is actually a rather different thing than an ‘array’ or any ‘array’-like thing already discussed. [Hereafter, I will usually use the synonym ‘dictionary’ for ‘associative array’, as it avoids confusion with ‘array’.] Each element of a dictionary is comprised of two pieces of data, one the ‘key’ and the other its associated ‘value’, together called a ‘key-value pair’. It isn’t necessary for either the keys or values to be homogeneous in type, and it’s perfectly fine for two or more values to be identical, but no two keys can be identical. The idea is that, while the elements of an array are located in the array by numerical index, the elements of a dictionary are located in the dictionary by key: we store a value in the dictionary by associating it with a key, and then we retrieve it from the dictionary by asking for the value associated with that key.

Probably the most commonly used type of dictionary is one with text strings for keys because it’s just very useful to be able to store and retrieve data by some meaningful bit of text, e.g. I could store people’s ages by their names:

  • key: “John Lennon” value: 67
  • key: “Paul McCartney” value: 65
  • key: “Ringo Starr” value: 67
  • key: “George Harrison” value: 64

Now to look up George Harrison’s age, I ask the dictionary for the value associated with the string “George Harrison” and get back the integer 64.

Again, any kind of object can be used for a key. While text strings are most commonly used, I could also use integers, e.g. I could store the names of people by their ages:

  • key: 67 value: [”John Lennon”, “Ringo Starr”]
  • key: 65 value: [”Paul McCartney”]
  • key: 64 value: [”George Harrison”]

(We account for the possibility of multiple people having the same name, so we store our values as arrays of strings (as indicated by the [ ] syntax) not just individual strings.)

If you’re going to look people’s ages up by their names and look their names up by their ages, then it might actually make sense to have both of these dictionaries even though it means storing the data twice over. If I only had a dictionary of age-by-name, looking up names by age would require creating a new list and then checking every element of the dictionary, adding to the list each name associated with the age I’m looking for. If my age-by-name dictionary is very large, this would make looking up names by age much more expensive performance-wise than if I had a names-by-age dictionary to use (as I’ll explain in a later post, dictionaries are almost always implemented in a manner that makes finding values-by-key very fast).

Now, if you’re going to associate values with integers, why not just use an array? Well with an array, if I have an element at index 78, then I must also have places in memory for indexes 0 to 77 whether I use those indexes or not. In contrast, a dictionary typically only takes up little more memory than is needed to store all its elements (again, as I’ll discuss in a later post).

Understand that, even if a dictionary has integers for keys, it is still considered to be ‘unordered’—there is no first element, no last element, no in-between elements—each element is the same as any other as far as “position” in the dictionary is concerned. In practice, of course, the key-value pairs sit in memory in some order, but if you cared about that order, you would use an array instead. Most implementations of a dictionary provide some means of getting an array of all the dictionary’s keys, thereby allowing a way to iterate over every value in the dictionary, but the order of the keys in this array produced from the dictionary is random.

You might be wondering why keys must be unique. It’s true that allowing multiple keys could be useful, e.g. if I ask for the value associated with key x when there are multiple keys x, I could get back an array of all values associated with x. Such dictionaries don’t exist because:

  1. If I want to change the value of key x, I would somehow have to specify which key x I meant.
  2. It’s conceptually simpler to pack together all the values you want to associate with key x into an array and then associate that array with unique key x.
  3. Unique keys make the implementation simple and efficient.

‘Dictionary’ is just one synonym for ‘associative array’; like with ‘array’, there is a preponderance of synonyms and near-synonyms for ‘associative array’, including:

  • dictionary: A straight synonym and the preferred term of Python programmers.
  • table: Basically a straight synonym for ‘associative array’, though be careful that ‘table’ is just as often used by programmers to mean a ‘database table’ or a table of information (like a row-by-column chart of figures in a document—not really a programming concept, but a lot of code deals with presenting such tables to users).
  • string table: Like ‘table’, but implies that all the keys are strings and possibly that all the values are strings too.
  • lookup table: A straight synonym in general use and probably the least ambiguous term you could use other than ‘associative array’ itself.
  • map: A straight synonym and the preferred term of users of some languages. In C++, ‘map’ implies an associative array in which the keys are kept sorted (the criteria of how to sort the keys must be supplied by you when you create the map, for the map doesn’t necessarily know how to sort the kinds of objects you supply for keys).
  • hash, hashtable, hashmap: Basically all synonyms for ‘associative array’ except the ‘hash’ part refers to a technique used in implementing associative arrays (again, something I’ll discuss in a later post); just be clear that the terms ‘hash’ and ‘hashing’ are not exclusively associated with associative arrays, as hashing is a fundamental technique used in many areas of programming.

“Quality”, i.e. Why software is hard

June 13, 2007 – 1:56 am

Yahoo’s Javascript guru, Douglas Crockford, has another excellent video talk (watchable in-browser or as a download), this time a survey of software engineering titled “Quality”. While general pontifications of this nature are common, Crockford’s strikes a nice balance between breadth and concision and between correctness and novelty (not too dull, not too narrow, etc.), and, in fact, the talk would be quite watchable and interesting for neophyte programmers and perhaps even non-programmers. On the downside, Crockford doesn’t really give prescriptions, but that follows from his main point: we still haven’t really solved or mitigated some hard problems (and perhaps we never will); for now, the only real consensus we have is that you’re better off aware of these issues than not.

The only Javascript speed test that really matters.

June 11, 2007 – 11:46 pm

More bullshit performance claims from Apple?

So Apple, out of nowhere, has released Safari for Windows. Among the reactions today, Joel Spolsky jumped the gun, complaining about how Apple Safari for Windows loads very slowly, only later to retract—apparently, Safari for Windows stopped loading slowly for Joel after the first couple times he launched it. Still, I understand Joel’s lack of faith, given Apple’s history of dubious performance claims (e.g. ‘PowerPc’s are super computers’) and the multimedia shit-fest they inflict upon Windows users just seeking Quicktime playback and control over their iPods.

I decided to try Safari myself, and while I never saw anything but zippy load times, the rendering and Javascript performance didn’t impress me. Surely rendering and Javascript engines claiming a 1.6x speed advantage over those in Firefox should translate into more responsive dynamic reflows or at least some kind of visible benefit, yes? Well, in casual browsing, I couldn’t see any difference. Then I tried the only Javascript speed test that really matters: dragging in Google Maps.

The verdict in purely subjective testing: dragging the map in Safari, compared to in IE, is a noticeable improvement, but is crap compared to Firefox (even with 7 plug-ins running, including Firebug, which has some bug/feature that eats your memory as you drag the map).

So even if Apple truly conducted its benchmarks honestly and with an inquisitive bent towards finding the user-subjective truth, they’re simply wrong: Safari is not The World’s Fastest Browser (TM).

UPDATE:

The experience I reported was when dragging a Google map with a large display area, but the difference between the browsers becomes less pronounced for a smaller area map (though Safari always seems to have annoying brief stutters).

When it comes to adding and removing many markers from a Google map, Safari does far better than Firefox: adding ~60 markers to the view took almost 10 seconds in Firefox but only 2 seconds in Safari. The problem is that, once you have that many markers on your map, it can be dragged with usable smoothness in Firefox but not at all in Safari (you can still click-and-drag, but the map won’t move until you release, at which point it will warp to its destination).

Reverse ad blindness: web usability tip #7523154

June 7, 2007 – 9:50 pm

In a post on user search behaviors, Jeff Atwood links to this report on how quick users are to judge websites. Ironically, upon viewing that report, a full minute passed before I noticed the left column did not consist of Google ads but rather of intra-site links, and in fact, there are no ads on the page at all. I guess the lesson is that, while some sites deliberately blur the visual distinctions between their ads and content, other sites—even those without ads—should be careful not to do the same accidentally.

Nationalschriftattribut

June 5, 2007 – 11:37 pm

I concur with Ken Arnold that stylistic choice in languages should be stamped out at the parser/compiler level. In fact, the two programming languages I have on the drawing board, Pidgin (an educational language) and Animus (a more Lisp-ish Python), both do just that. Animus is strict for the reasons Arnold mentions, but Pidgin is strict mainly for the sake of learning simplicity. In designing Animus, the only legitimate place for style I found was in deciding how to spread a busy expression on to multiple lines. For example, consider a complex expression in prefix notation:

(foo a b (bar c d (moo) e f g) h i j k l)

This may seem pretty reasonable, but only because the names are unrealistically short. A more realistic version would be considerably uglier, so Animus allows you to split such an expression on to multiple lines (without getting into the particulars, Animus is indentation sensitive and has rules about leaving some parentheses implicit):

foo a b
bar c d (moo) e f g
,h i j k l

The problem here is that this is only one way of splitting the expression up, so programmers have many options in choosing which calls and which arguments to emphasize by choosing which should start their own line. I’m not really sure yet if this is a good or bad thing. You might argue that complex expressions should be discouraged in the first place by requiring such things to be refactored into multiple separate expressions using intermediate variables. However, this would make Python-style dictionary and list literals much less useful unless you made them special cases exempt from the normal rules. In any case, I’m open to introducing a rule that discourages abuses, but I think such a thing should wait for real code examples; in the meantime, it’s best to err on the side of flexibility in this area.

One if by land, one-zero if by sea

June 4, 2007 – 5:27 am

How bits represent information and form the basis of computing.

An installment in a series of posts on basic computing concepts for beginning programmers.

As the general public has come into daily contact with computers, people have been disabused of their former notions that computers ‘think’ and ‘know’ things. Sadly, for most of us, the mystifying metaphor of human thought has not been replaced by some better conception of how computers work. While many people have begun to correctly think of computers as merely electrical and mechanical devices, not only do most people remain ignorant of how all the ‘gears’ work, they still can’t fathom how a computer could be made up of anything like gears, whether electronic ones or otherwise. And while we have been told to think of computers as ‘machines that do math’, most of us can’t fathom how math transforms into pictures, audio, video, user interfaces, games, or even just text.

To demystify the greater part of how computers work, you don’t have to learn all too much about computer hardware or electronics, for while these subjects are fantastically and fascinatingly complex in their own right, the role of computer hardware essentially comes down to performing a handful of simple tasks when instructed to do so—copy , add, subtract, and compare data, etc; most of the complications in hardware have to do with getting the hardware to do its simple job faster.

So it is the sequence of instructions—the software—fed to the hardware which explains the better part of the story, as this is what turns computers from calculating automatons into useful and seemingly intelligent devices. To explain software, the best place to start is in how bits represent information, for a piece of software is ultimately just a bunch of instructions and data expressed as bits, and manipulating bits is at heart all hardware does. So yes, sadly, every discussion of programming must begin with the subject of data, a topic as profoundly uninteresting as reading the phone book. Only severe autistic cases get into programming to manage data (no offense, serious autistic cases!) — the rest of us want to get our computers to do something, so do we really have to talk about something basically inert? Yes, quite simply because reading and writing data is exactly how computers do anything.

What is a bit?

As everyone these days knows, a bit is simply a thing that holds one of two states (represented with the symbols ‘0′ and ‘1′) and which can alternate between these two states, so any computer data is a series of bits, e.g. 00101111101011101111111100110100. The actual physical mechanism of ‘holding a bit state’ varies from one computer technology to another—memory chips use either capacitors or transistors; optical discs (CD’s and DVD’s) use microscopic grooves read and written by lasers; floppy disks and hard drives use charges on magnetically sensitive surfaces—but really, a bit is just an abstraction, not any particular tangible thing, so in fact, bits can even be found outside of computers, e.g. a flag on the side of a mailbox can be considered a bit because it holds one of two states, up or down.

The simplicity of bits is what makes them a good, universal, lowest-common-denominator representation of data. In fact, a bit is the smallest unit of information possible: you might think that something which held only one state would be the smallest unit of information, but you would be wrong because such a thing would not convey any distinctions, and without distinctions, you have no semantic content (see here).

Quantities of bits

Before discussing exactly how bits represent complex information, we should clear up some confusion around the terminology for expressing quantities of bits:

A single bit by itself can’t represent much, so we usually concern ourselves with series of multiple bits, and certain quantities of bits have names:

byte = 8 bits
nybble (or nibble) = 4 bits

The term ‘nybble’ is used quite rarely, but ‘byte’ is used perhaps even more frequently than ‘bit’.

(A byte is actually not always 8 bits: properly speaking, the size of a byte for a particular system refers to the size of ‘the smallest addressable unit of memory’ (i.e. the size of the cells into which memory is divided up; it is these cells which can be independently read and modified). The memory of some systems, especially some older ones, is divided into cells of some size other than 8 bits, but 8-bit bytes are found in almost all systems made in the last 30 years, including PC’s. There’s nothing intrinsically special about the quantity 8, except it has the virtue of being not too big and not too small while also being a power of two.)

We use Greek prefixes to indicate bits in certain quantities of powers of ten:

1 kilobit (Kb) = 10^3 bits = 1,000 bits
1 megabit (Mb) = 10^6 bits = 1,000,000 bits
1 gigabit (Gb) = 10^9 bits = 1,000,000,000 bits

…well, not quite. These are the popular (read, lazy), rounded-off definitions. The stricter system used by computer scientists and programmers defines these quantities in powers of two:

1 kilobit (Kb) = 2^10 bits = 1,024 bits
1 megabit (Mb) = 2^20 bits = 1,048,576 bits
1 gigabit (Gb) = 2^30 bits = 1,073,741,824 bits

(Actually, programmers very often use the lazy definitions in informal contexts—just don’t think a computer won’t notice the difference.)

The Greek prefixes can be used to indicate quantities of bytes as well, such as saying ‘one kilobyte’ to mean 1,000 (or 1,024) bytes.

Pay particular attention to abbreviations for whether the ‘b’ is capitalized or not: lowercase ‘b’, means ‘bit’, but uppercase ‘B’ means ‘byte’ e.g. ‘1 Kb’ is 1,024 bits, but ‘1 KB’ is 1,024 bytes. If you’re not paying attention, you could misinterpret a quantity of bits by a factor of eight! (You’ll also see the ‘k’, ‘m’, and ‘g’ in lower case, but this doesn’t have any significance.)

For some obscure reason, when talking about quantities of stored data, the convention is to use bytes, kilobytes, megabytes, and gigabytes, but when talking about data throughput (such as in the context of data transfer rates over a network or between computer components), the convention is to use bits, kilobits, megabits, and gigabits.

Character sets

So how do bits represent information humans care about? Well in the case of text, the relation between a particular string of bits and a text character is arbitrary, e.g. we could decide that the bit string 10111000 should designate the Roman character capital ‘J’, and as long as all of our hardware and software in the correct contexts treated that bit string as if it represented ‘J’, then it doesn’t matter that there’s no logical reason for doing so.

So to represent text as bits, we designate a unique string of bits for every character we wish to use, and this set of designations is called a character set. The most widely used character set in the Western world is called ASCII (American Standard Code for Information Interchange), which contains 128 characters, each mapped to its own 7-bit string, e.g. 1001101 in ASCII represents upper case ‘M’ while 1100010 represents lower case ‘b’.

For decades, virtually all programs written for English-speakers have used ASCII, but because ASCII doesn’t contain characters needed in other cultures, other locales used alternatives, so for many years, a hodge-podge of character sets prevailed world-wide. This began to change in the 90’s with the introduction of a universal character set, called Unicode. Unicode reserves enough space for 1,114,112 different characters, which means each character is designated a 21-bit string. As old software is being replaced by new software, Unicode is gradually being adopted as the replacement for all other character sets, including ASCII.

1,114,112 is more characters than is needed to contain all the characters of every language in the world (including even Chinese, Japanese, and Korean), and in fact, only a few hundred thousand characters are currently designated in Unicode, leaving many 21-bit strings available for future addition of characters. Some of the characters in Unicode are not language characters at all but rather symbols used for other purposes, such as math or musical notation.

Numbers

Some kinds of information, such as text characters, get by using arbitrary assignment of pieces of information to their representations, but other kinds of data are suited for a logical system. Numbers are best represented using a logical system for a variety of reasons, most obvious among them the fact that there is an infinite range of numbers, so it’s just impossible to give each number an arbitrarily selected bit string; using a logical set of rules for representing numbers allows us to encode as bits any number of any size in a consistent and predictable way.

So what is this logical system for representing numbers? Quite simply, a string of bits—11001, for instance—is a number, but in binary form rather than the decimal form with which you’re familiar: binary is nothing but a numbering system (i.e. way of expressing quantity) that works just as well as decimal. Unfortunately, while people use decimal all their lives, it becomes so ingrained that they can’t see how it works and therefore have a hard time imagining any alternative. Though the details of how binary works are really very simple once you understand them, the concept of an alternative number system is famously hard to convey succinctly and successfully to those uninitiated, so it’s something we’ll gloss over here. For the duration, just take it on faith that there is, for instance, a logical reason why 35 is expressed as the bit string 100011.

Once we have a logical correlation between any bit string and a number, it then makes sense to think of arbitrary assignments in terms of numbers rather than bit strings, e.g. if, in a character set, ‘G’ is assigned to the bit string 1000111, then it can also be said to be assigned to the decimal number which corresponds to that bit string, 71, and this is in fact how we normally think of such arbitrary assignments.

The perfect ambiguity of bits

Whatever manner is used to encode our information as bits, whether logical or arbitrary, it’s important to understand that the meaning of any string of bits is not intrinsic to the bits themselves: the meaning of any string of bits ultimately relies upon agreement between the writer of the bits and the reader as to how to interpret the bits. This is really no different from human languages, where the words of a language only have meaning because of (mostly informal) established agreements between a community of speakers of that language. It bears illustration, though, because this is not how most people commonly think of meaning. Consider:

Imagine I write 7 decimal digits on a piece of paper. Is it a phone number, the population of Milwaukee, or the number of angels on the head of a pin? Now imagine a bank that uses 14-digit account numbers. If I write down a series of 28 digits with no spaces or separators, how many phone numbers and how many bank account numbers do I have? Well, I may have known what I meant at the time I wrote those numbers down, but nothing in the data tells anyone—including myself ten minutes from then—what the numbers mean at all.

This same problem exists in computers. Consider a sequence of bits: 001110100111111110100100. The first thing not discernible is how the bits should be grouped: is this meant to be interpreted as three bytes, or six nibbles, or 5 bits followed by 19 bits, or what? Just as bad, the bits themselves indicate nothing of what kind of data they’re supposed to represent, whether numbers or text or otherwise, nor how that kind of data is encoded.

The lesson here is that, for data to be interpreted correctly, the thing doing the interpreting has to assume the length, encoding, and location of the data, and the only way to make these assumptions correctly is to strictly keep track of where data is placed and what it was supposed to mean when you placed it there.

What ensures then that, say, a file of ASCII text is interpreted as ASCII text and not treated as a bunch of numbers or perhaps as text of a different character set? Nothing! In fact, you can do the reverse: take any file and open it in a text editor to see it as ASCII text; if the file wasn’t intended to be human-readable ASCII text, you’ll almost certainly just see a sea of garbage like:

ïT¬?ì?Rïûê” ïBHïT$4ïz,ì?+ïH?;-t¦@¶ ïL$4ïQ,ï ël$?¤î? ï?ï£$+ + à +~@ï°ìV ?+Bn+K$+B?+K,¦ -+?+ LAâ-\;-|+ïåä? ¦? ;-ëL$?¤îH? ì« ? ël$$ïU ïD

…at least, it would be remarkably surprising if you opened a random file not intended to contain text but found that it happened to contain long sections of English, or any other human language, for that matter (well, actually, many files not intended to be read as text have text data embedded within them, so you will often see some strings of human language in such files). So be clear that, while nothing stops you from reading a piece of binary data as representing some kind of data which it wasn’t intended to be, barring remarkable coincidence, doing so just produces garbage.

Pretty pictures

A very large majority of all programming deals only with number and text data, but bits are also used to represent sexier kinds of data, namely images, video, and audio. At this point, how bits could possibly represent such information may still seem mysterious, so we should at least breach the matter by briefly illustrating how to represent images.

Quite simply, a computer image is just like the big scoreboard-grids of lights at sports arenas except that the individual emitters of light are much smaller than light bulbs, producing a much finer image. A computer-screen image is essentially a grid of discrete light-emitting points, called pixels, so to produce a certain image, we have to get each pixel to emit the right color.

(Of course, monitors don’t contain any light bulbs, but the physical process isn’t important to us, and besides, the actual physical process is very different between CRT monitors—Cathode-Ray Tubes, the bulky monitors that are a foot or more deep in dimension but which are now going out of use—and LCD’s—Liquid-Crystal Displays, the monitors less than an inch deep in dimension that are used in all laptops and have nearly taken over all new monitor sales for desktops.)

The solution, like with characters, is to use an arbitrary assignment: imagine that we establish a mapping of numbers to colors, and imagine that knowledge of this mapping is hardwired into our monitor; if I then feed a number to the monitor, it could set a pixel to the corresponding color. But which pixel does it set? Well, the ‘next’ pixel: a monitor is hard-wired to set the colors of its pixels in a certain order, drawing lines pixel-by-pixel, left-to-right starting from the top of the screen, moving down line-by-line, and cycling back up to the top to repeat the process. So data is fed into a monitor sequentially, thereby drawing the image sequentially, pixel-by-pixel; it’s just done so fast you don’t see the process.

The monitor image is updated from the computer at a fixed rate, usually sixty times a second or more, whether or not the image changes at all. Now, it would be horribly wasteful to have the CPU do this job itself of constantly feeding the monitor data just to show a still screen, so this responsibility is offloaded onto a specially purposed device, the video controller. An essential component of the video controller is its dedicated memory, called the framebuffer, wherein the current image to display is stored, and many times a second, the video controller transmits the contents of its framebuffer to the monitor; left to itself, the video controller can handle feeding the current image data in the framebuffer to the screen all by itself without attention from the CPU. In fact, from a programmer’s perspective, the data in the framebuffer is the image on the screen, and therefore, to change the image on the screen, the programmer simply instructs the CPU to modify the data in the framebuffer, causing a different image to be seen the next time the video controller sends the monitor the contents of the framebuffer.

The details of binary number representation and the ASCII and Unicode character sets will be presented in later posts.

It’s just a trick so we let our guard down!

June 2, 2007 – 12:26 am

Here.

What does learning to program involve?

May 28, 2007 – 3:33 am

If you’re not already an initiate to programming, you may not be clear what you’re getting into when you set out to learn to program, so here’s an overview.

Most obviously, you must learn a programming language, which comes down to learning four things:

  1. a syntax: The syntax of a language is the set of rules for how code of that language is written.

  2. a standard library: Programming in all languages is in some way modular, i.e. a piece of code can use other pieces of code. Rather than write all of a program’s code from scratch, programmers use pre-existing code from code libraries. Virtually all modern languages each have their own standard library, a library which is included as a default part of the language. There typically exist many libraries for a language, but only the language’s standard library is universally known and used by all programmers of that language.

  3. code idioms: Given a syntax and a standard library, there then exists in a language some informal set of idioms (reoccurring patterns) of how to code in the language which programmers write over and over again. For various reasons, these idioms can’t be expressed as library code, but while they are not built into the language or officially specified in the language specification, every good programmer uses established idioms wherever possible. An idiom finds favor because it has been proven over time to be the best and/or most obvious way of doing something. (I use the term ‘idiom’ to mean a ’small scale pattern’, something which is expressed in a single place in code. In contrast, the term ‘pattern’ is used to refer to ‘design patterns‘, a specific set of large-scale design strategies used in programming that are generally not particular to any language. Design patterns are also a very good thing to learn, but they aren’t a high priority early in your education.) You won’t hear much talk about idioms among programmers because, in all, there just aren’t that many of them in most languages, and they’re fairly natural to pick up.

  4. a set of tools: To write their programs, programmers use tools (programs):

    1. programmers do most of their work typing code in text-editor programs
    2. code typed by the programmer needs to be translated into a form runnable by the computer; this translation is performed by compiler, interpreter, assembler, and linker programs (which of these are used depends upon the programming language)
    3. debugger programs allow programmers to see exactly what happens when their programs run so they can ferret out bugs
    4. profiler programs help programmers test the performance of their programs and see how long it takes for portions of code to execute so that they can see what portions of code perhaps need to be optimized
    5. in some areas of programming, code generators can sometimes help speed up development by doing some of the coding work for you
    6. version control systems help programmers keep track of the changes they make to their code as they develop it, making it easier to coordinate with other programmers and revert back to earlier versions of their code
    7. programs which integrate some or all of the functionality listed above are called IDE’s (Integrated Developer Environments)

(By the way, don’t worry about tool costs: very high-quality tools are freely available for download for nearly all languages used today; in fact, with the notable exception of Microsoft languages—C# and Visual Basic—the preferred tools are the free and open source ones, while commercial tools are increasingly disfavored and are disappearing.)

So the question you probably have now is, ‘Which language should I learn?’ The correct answer is that you should learn more than one:

  • C: I start with C mainly because: 1) it is the lingua franca of programming; 2) learners can understand what’s really going on in the computer when they run their program written in C; and 3) most features of its syntax show up nearly verbatim in other major languages, such as Java. For purely pedantic reasons, C is a language you really must know, even if you never do any serious programming in it—you’re a better programmer simply knowing what programming in C is like. On the other hand, I strongly discourage you from attempting to write anything but trivial programs in C early in your learning phase, for it takes a good amount of programming experience and knowledge to write even mildly-sophisticated end-user programs (such as programs with graphical user interfaces) in C. Once you attain a decent amount of comfort with basic C, you’re much better off focusing on some other language, such as Java or Python.
  • Java is steadily replacing C as the lingua franca of programming. It is an object-oriented language with a very large standard library, so writing programs is significantly easier to do in Java than in C. Java is the most common choice of a first language to teach students today because Java protects programmers from misusing memory. I think this choice is a mistake, however, because it’s important as a programmer to understand what exactly Java is doing for you, and you can’t really understand that unless you study C. Worse, Java has a rather complex set of advanced syntactical features, and the set of tools and libraries typically used by experienced Java programmers is large and complex. These facets make it hard for learners to get up to speed with real-world code examples. Despite these flaws, Java, unlike C, is a good language to attempt writing your first real programs in.
  • Python is a language with an exceptionally elegant syntax which avoids clutter and nasty ‘corner cases’ (meaning situations where the language has to have complex rules to cover all possible cases). Expressing your ideas in Python is arguably easier than in any other language out there, so writing useful programs in Python is as easy as it gets. On the downside, Python code runs very slow compared to C and Java code, so if you want to be a ‘real programmer’ that can write processing-intensive programs, knowing Python is maybe not enough.

I consider these three languages the most essential to learn, not just because of their populatiry but because of how well they suit their roles. The best order to tackle them in is either C then Java then Python or the other way around: if you’re impatient and want to write something neat fairly early in your education, start with Python; if you’re patient and insist on understand how things really work, start with C.

Here are some other languages to be aware of:

  • C# (pronounced ‘C sharp’) is essentially Microsoft’s imitation of Java. Despite what the name implies, it is only tangentially related to C. While an improvement over Java in some ways, it’s really not different enough from Java to really make that much of a difference which of them you learn. (Many programmers will insist you stay away from Microsoft’s proprietary technologies, though there actually is an open source implementation of C# available called Mono.) I recommend Java because it’s the older, more established language with the broader reach and because better free tools are available for it. However, once you get comfortable with Java, taking a look at C# wouldn’t hurt because a lot of software companies out there are Microsoft shops, and most of them will be using C# by now.
  • C++ is essentially C with object-oriented programming features grafted on. The result is messy but not without its virtues. Because of its performance advantages, C++ is used for end-user applications with high performance requirements, notably games. However, if you’re learning programming to make games, you’re kidding yourself if you start with C++: it’ll be years before you’ll develop the skills to make games complex enough to stress a modern system, so you should focus your learning on an easier to use language. As you develop better program-design skills, they will translate to C++ when the time comes; in the mean time, you should broaden your general programming knowledge.
  • Visual Basic: The basic problem with Visual Basic is that it’s really not that basic at all any more. In its origins, VB was a language that was going to bring programming to the masses. Since then, the features of ‘real’ programming languages, like C++ and Java, have been grafted on to VB in ugly ways, resulting in a language which is just about as complex but not as elegant and therefore not used much except for writing ‘in-house’ software (software which is written for the internal use of a company/organization). The joke goes that a serious programmer doesn’t dare publish programs written in VB to the wider public lest he gets laughed at for using VB. Despite its unfair reputation as a toy language, VB can be used to do real work—there’s just little reason to choose it over Java and C# on the merits. (This judgment is possibly just aesthetic bias: I and most other programmers much prefer the look of Java and C# code to that of VB.)
  • Ruby has been stealing glory from Python lately, largely because of the popularity of ‘Ruby on Rails’, a framework for developing web programming (a framework is a glorified, overgrown library). In the end, the two languages are extremely similar such that learning one will make learning the other quite simple. Ruby arguably has a few feature advantages, but Python takes the edge in elegance and carefully thought out syntax. If you want to do web programming, I recommend the Django framework for Python as a replacement for Rails.
  • Perl is like Python’s deformed cousin. It’s a convoluted mess dearly loved by its partisans but inhospitable to first-time programmers. Currently, Perl is in limbo as the long awaited version 6 has been years late in coming; meanwhile, Python and Ruby have been stealing Perl’s niche, and I predict Perl will never recover, eventually fading into perenial also-ran status.
  • PHP, due to the popularity of the web, is the first language of a surprising number of people these days. This is unfortunate because PHP has little reason for existing except to plug a hole that shouldn’t have existed in the first place. In almost all respets, it’s otherwise an undistinguished amalgam of standard language features taken from other languages and inelegantly smashed together. Still, PHP is a popular choice for writing open source web applications, such as the blogging software that runs this site (WordPress) and the MediaWiki software that runs Wikipedia. I find this unfortunate, but PHP had the right features at the right time, and there’s no arguing with good timing.
  • Javascript, despite the name, is a totally distinct language from Java. A Javascript interpreter is embedded in today’s web browsers, so a webpage can include code that makes the page dynamic, i.e. Google Maps is a webpage with Javascript code that updates the page’s content as the user clicks buttons and drags the map. Javascript is a better language than many give it credit for, but it is only really used in the context of the browser, and quite honestly, Javascript is only used at all because it’s the only language the browser runs—you don’t have a choice of other languages if you want to add programmatic features to a webpage. Javascript would actually make a great first language except that it’s tied to the browser, so as a practical matter, learning Javascript requires learning all sorts of messy, browser-specific stuff that distracts from the essence of programming.

Of course, learning to program involves quite a bit more than just learning a language. Additionally, there’s:

  • terminology / way of speaking

  • best coding practices (especially how to write readable code)

  • design (even if you know a language inside out and write meticulously crafted code, you may still be clueless about writing programs in the large; without learning good design practices, your programs will become unreadable and unmodifiable rats nests as they grow in size)

  • a bit of computer science (especially fundamental data structures and algorithms)

  • optimization strategies

  • hardware concepts

  • operating system concepts

A note about this last bullet: whatever the language, the real difficulty of learning standard libraries (aside from the fact that many of them are very large) lies in the fact that their most important parts are heavily intertwined with concepts not particular to the language, e.g. every standard library includes code for reading and writing files, but files are really an operating system concept, not a language concept. Unfortunately, standard library documentation has the tendency to assume the reader is already familiar with these underlying concepts because it assumes an audience of programmers experienced in other languages; when the documentation does make an attempt at elucidating these concepts, it tends to do a half-assed job; even when the documentation explains these underlying concepts well, this is still a suboptimal way of learning because it confuses the concepts with the particulars of that particular library and language. So you should try to learn OS concepts in their own right, independently of any language; there are a few good books about operating system design, in particular, Modern Operating Systems (2nd edition), by Andrew S. Tanenbaum and (for when you’re comfortable with C) Linux Kernel Development, by Robert Love.

This same advice—’learn concepts, not libraries’—applies to a few other domains you’ll encounter, including networking, databases, and web standards, e.g. before devoting a lot of time to learning how to work with a database using Java’s standard libraries, learn about databases independently of any language.

I should end by mentioning ‘Pidgin’, my in-the-works language intended strictly for educational purposes. (Yes, the Gaim people are using the name ‘Pidgin’, but I came up with the name earlier, besides which, it’s a name most appropriate for a language.) My LearnProgramming.tv project has been on hiatus, partly because I’ve been reconsidering its approach ever since I had the epiphany that what’s really needed is a better language for beginners. There’s nothing to show yet, but implementation shouldn’t take long, as the syntax is extremely simple (it uses prefix notation for expressions) and I’m simply translating into Python. If you’re looking to learn to program, please leave a comment—I’m looking for some guinea pigs.

The United States is doomed

May 27, 2007 – 5:03 pm

Glenn Greenwald

This unbelievably irrational, even stupid, concept has arisen and has now taken root — that to cut off funds for the war means that, one day, our troops are going to be in the middle of a vicious fire-fight and suddenly they will run out of bullets — or run out of gas or armor — because Nancy Pelosi refused to pay for the things they need to protect themselves, and so they are going to find themselves in the middle of the Iraq war with no supplies and no money to pay for what they need. That is just one of those grossly distorting, idiotic myths the media allows to become immovably lodged in our political discourse and which infects our political analysis and prevents any sort of rational examination of our options.

The only way Congress cutting off funds for the war would have a negative effect on troop security is if the President keeps the military in theatre past the time when the previously alloted funds run out, but so lodged is the conventional wisdom, that were Congress to cut off funds and then the President to keep the military in theatre anyway, Congress would get the blame for any consequences.

…[T]he notion that de-funding constitutes a failure to support the troops — in a way that, say, timetables do not — is just inane, not even in the realm of basic rationality or coherence.

And yet exactly this nonsensical notion was permitted not only to take hold, but to become unchallengeable conventional wisdom in our public debate over the war. The whole debate we just had was centrally premised on an idea that is not merely unpersuasive, but factually false, just ridiculous on its face. That a blatant myth could be outcome-determinative in such an important debate is a depressingly commonplace indictment of our dysfunctional media and political institutions.

And how about an indictment of the American public? Democrats are scared to challenge conventional wisdom because that makes them eggheads who think they’re smarter than Joe Six Pack. Were a Democratic politician to undergo a true Bulworth transformation, they wouldn’t just push against the boundaries set by the media, they would have the courage to call voters on their stupidity: ‘Fuck you, voters, for believing that. Fuck you for not paying attention.’ That’s the crass version, of course, but I’m serious: the only way to solve our problems is to elevate the debate, which means shaming the media and the public for keeping the debate mired in trivialities and for not entertaining arguments more than two clauses long.

The naturalistic (language) fallacy

May 26, 2007 – 9:26 pm

Why natural language in programming languages is a fool’s game.

  1. The main purpose of a formal language is that it is free of the ambiguities found in natural human languages.
  2. While naturalistic language may give off signals of familiarity and thereby boost a programming language’s approachability, the gain is more than offset by the increased complexity which naturalism introduces, complexity which learners will very quickly be confronted with. Naturalism acts as a mask for language complexity, not a substitute for it.
  3. I for one am skeptical of the ‘cult of smartness’ found in programming culture, but even I will say that a learner new to programming just isn’t going to cut it if they can’t adapt to the concept of a formal language: if you’re hanging on to those bits of typical programming languages that happen to be English-like (e.g. control words like ‘if-else’), then you just aren’t getting it.
  4. The usual bits of English-like syntax found in our languages are often more misleading than helpful.

This last point can be illustrated with the keyword ‘while’. The keyword ‘while’ obviously was introduced with a sentence in mind like, ‘Do this stuff while this condition remains true.’ Problem is, this sentence, as understood by the typical English speaker who is not already versed in a C-like language, would not convey what ‘while’ actually does; the sentence only sounds good enough as a description of ‘while’ to programmers already familiar with the ‘while’ construct.

A more accurate sentence for describing ‘while’ would be, ‘If this condition is true, do this stuff, otherwise move on; after every time you do the stuff, test the condition again, doing the stuff if true or moving on, ad infinitum.’ Notice the absence of the word ‘while’. Also notice that the sentence which the originators of ‘while’ did have in mind is not what a newbie to the language is going to see—they will just see the word ‘while’.

Not only is ‘while’ not really evocative of its function, it easily misleads: a very common misconception learners get is that a ‘while’ loop will end the moment the condition is made untrue inside the loop rather than when the block completes and the condition tested again; many learners likely get this misconception because it is consistent with the sentence, ‘Do this stuff while this condition remains true.’ Learners would be better served were ‘while’ instead arbitrarily named ‘friedchicken’ (perhaps ‘kfc’ to save on typing).

Now, of course, not all English-like elements are so poorly chosen—’if’ is pretty sensibly named, though I would have gone with ‘otherwise’ in place of ‘else’—but the acceptable cases are few enough that one can hardly say there is much benefit in attempting to conform to an ideal of naturalism. You’ll likely just do more harm than good.