A learner’s guide to the very important concepts of ‘arrays‘ and ‘associative arrays‘ and the very confusing, overlapping terminology thereof.
In programming, the term ‘array’, in its most general sense, means ‘a sequence of units of data’, but confusingly, a preponderance of terms all fit that same definition, each with its own variation on the theme. This wouldn’t be so bad if programmers and programming languages could decide amongst themselves which connotations belong to which terms, but in truth, there is no definitive usage that keeps them all straight. At best, the various different things called ‘array’ can be classified by a few properties:
- Is the number of elements (a.k.a. the units of data) in the array fixed at the array’s creation time, or can the number of elements grow and/or shrink after creation?
- Are the elements homogeneous (all of the same kind) or heterogeneous (of different kinds)?
- Are the elements contiguous in memory (i.e. do the elements all sit directly adjacent to each other)?
- Do we care about the order? While the elements of an array are always indexed numerically (i.e. each element has a place in line relative to the others), we may simply want to use an array as a collection of things without regard to the order of its elements.
In any case, here are the strongest meanings of each term as best as I can piece together:
- The dominant use of the term ‘array‘ itself comes from the feature called ‘array’ in the C language and languages strongly influenced by C (which includes C++, Java, and C#). In this usage, an array is an ordered, homogeneous, contiguous, fixed number of elements. While not being able to mix different types of elements together in one array and not being able to add additional elements to an array after creating it makes these arrays bothersome to work with, C’s arrays purposefully forsake these features for performance and memory-usage advantages: by being homogeneous, fixed in length, and contiguous, a C array takes up a minimal amount of memory and generally requires less processing work to access and/or modify its elements. (As it happens, it is from the C language that the convention began of indexing the elements of an array starting from 0 rather than 1, and most of today’s languages stick with that convention even though it initially feels unnatural to students of programming.)
- A ‘list’ is an array which is not fixed in length: elements can be added to the end of the list or inserted or removed at any point in the list. Lists are not necessarily homogeneous nor necessarily heterogeneous: in most languages, you can create either kind of list. By allowing growth after creation, a list is generally more expensive performance- and memory-wise than C-style arrays; for instance, consider if you want to add elements to the list but the memory space at the end of the list is being occupied by some other data: to accommodate more elements, the whole list would have to be moved elsewhere where there’s more space, something quite expensive to do. (As I’ll describe in a follow-up post, there are two basic ways to implement lists, called ‘array lists’ and ‘linked lists’, both with their own performance trade-offs.) In languages where maximizing performance and memory conservation is not a primary design goal of the language (this includes Python, Ruby, Perl, and Javascript), lists are used in place of arrays for their flexibility; in C, C++, Java, or C# programming, however, lists are typically only used when really necessary.
- There are a number of interchangeable synonyms for ‘list’, including ‘dynamic array’ and ‘growable array’.
- A ’set’ is a collection of things in which no element can be the same as any other of the collection’s elements. You could simply use an array or list as a set, but if you then want to make sure no elements are ever found more than once in the array or list, you would have to write logic that enforces that rule when an element of the array/list gets added or modified.
- The term ’sequence’ doesn’t have any predominant use, but it is sometimes used as a generic term for an ordered collection. Some languages co-opt the term for some particular context, e.g. Java has a notion of ‘character sequences’ in its libraries, and Python classifies some of its types as ’sequences’.
- The term ’string’ is virtually always used to mean an array of characters, a.k.a. a piece of text data. However, ’string’ is very, very occasionally used in a more generic sense to mean a homogeneous sequence of some type other than characters. (I’ve seen this usage in the context of assembly programming, but not to my recollection in the context of high-level languages.)
An ‘associative array’, though also a kind of data collection, is actually a rather different thing than an ‘array’ or any ‘array’-like thing already discussed. [Hereafter, I will usually use the synonym 'dictionary' for 'associative array', as it avoids confusion with 'array'.] Each element of a dictionary is comprised of two pieces of data, one the ‘key’ and the other its associated ‘value’, together called a ‘key-value pair’. It isn’t necessary for either the keys or values to be homogeneous in type, and it’s perfectly fine for two or more values to be identical, but no two keys can be identical. The idea is that, while the elements of an array are located in the array by numerical index, the elements of a dictionary are located in the dictionary by key: we store a value in the dictionary by associating it with a key, and then we retrieve it from the dictionary by asking for the value associated with that key.
Probably the most commonly used type of dictionary is one with text strings for keys because it’s just very useful to be able to store and retrieve data by some meaningful bit of text, e.g. I could store people’s ages by their names:
- key: “John Lennon” value: 67
- key: “Paul McCartney” value: 65
- key: “Ringo Starr” value: 67
- key: “George Harrison” value: 64
Now to look up George Harrison’s age, I ask the dictionary for the value associated with the string “George Harrison” and get back the integer 64.
Again, any kind of object can be used for a key. While text strings are most commonly used, I could also use integers, e.g. I could store the names of people by their ages:
- key: 67 value: ["John Lennon", "Ringo Starr"]
- key: 65 value: ["Paul McCartney"]
- key: 64 value: ["George Harrison"]
(We account for the possibility of multiple people having the same name, so we store our values as arrays of strings (as indicated by the [ ] syntax) not just individual strings.)
If you’re going to look people’s ages up by their names and look their names up by their ages, then it might actually make sense to have both of these dictionaries even though it means storing the data twice over. If I only had a dictionary of age-by-name, looking up names by age would require creating a new list and then checking every element of the dictionary, adding to the list each name associated with the age I’m looking for. If my age-by-name dictionary is very large, this would make looking up names by age much more expensive performance-wise than if I had a names-by-age dictionary to use (as I’ll explain in a later post, dictionaries are almost always implemented in a manner that makes finding values-by-key very fast).
Now, if you’re going to associate values with integers, why not just use an array? Well with an array, if I have an element at index 78, then I must also have places in memory for indexes 0 to 77 whether I use those indexes or not. In contrast, a dictionary typically only takes up little more memory than is needed to store all its elements (again, as I’ll discuss in a later post).
Understand that, even if a dictionary has integers for keys, it is still considered to be ‘unordered’—there is no first element, no last element, no in-between elements—each element is the same as any other as far as “position” in the dictionary is concerned. In practice, of course, the key-value pairs sit in memory in some order, but if you cared about that order, you would use an array instead. Most implementations of a dictionary provide some means of getting an array of all the dictionary’s keys, thereby allowing a way to iterate over every value in the dictionary, but the order of the keys in this array produced from the dictionary is random.
You might be wondering why keys must be unique. It’s true that allowing multiple keys could be useful, e.g. if I ask for the value associated with key x when there are multiple keys x, I could get back an array of all values associated with x. Such dictionaries don’t exist because:
- If I want to change the value of key x, I would somehow have to specify which key x I meant.
- It’s conceptually simpler to pack together all the values you want to associate with key x into an array and then associate that array with unique key x.
- Unique keys make the implementation simple and efficient.
‘Dictionary’ is just one synonym for ‘associative array’; like with ‘array’, there is a preponderance of synonyms and near-synonyms for ‘associative array’, including:
- dictionary: A straight synonym and the preferred term of Python programmers.
- table: Basically a straight synonym for ‘associative array’, though be careful that ‘table’ is just as often used by programmers to mean a ‘database table’ or a table of information (like a row-by-column chart of figures in a document—not really a programming concept, but a lot of code deals with presenting such tables to users).
- string table: Like ‘table’, but implies that all the keys are strings and possibly that all the values are strings too.
- lookup table: A straight synonym in general use and probably the least ambiguous term you could use other than ‘associative array’ itself.
- map: A straight synonym and the preferred term of users of some languages. In C++, ‘map’ implies an associative array in which the keys are kept sorted (the criteria of how to sort the keys must be supplied by you when you create the map, for the map doesn’t necessarily know how to sort the kinds of objects you supply for keys).
- hash, hashtable, hashmap: Basically all synonyms for ‘associative array’ except the ‘hash’ part refers to a technique used in implementing associative arrays (again, something I’ll discuss in a later post); just be clear that the terms ‘hash’ and ‘hashing’ are not exclusively associated with associative arrays, as hashing is a fundamental technique used in many areas of programming.