A learner’s guide to the terminology and concepts of software build processes.
What’s the difference between an assembler, a compiler, and an interpreter, and what’s a linker?
Let’s start with the clearest case. An assembler is a program which translates ‘assembly language’ code into processor instructions (a.k.a. ‘machine instructions’/'machine code’, a.k.a. ‘native instructions’/'native code’). What’s assembly language? ‘Assembly’, ‘assembler’, or ‘asm’ for short, is the generic name given to all low-level languages. Now what’s a low-level language? Well, whereas in high-level languages, each line of source code typically translates into more than one processor instruction, in an assembly language, each line directly corresponds to one single processor instruction. Assembly offers the programmer exact control: what you write is exactly what gets executed, instruction-by-instruction.
Because different processors understand different sets of instructions, the assembler language you use must be particular to the processor platform you intend to run your program on. For instance, if you are targeting a processor that uses the x86 instruction set (which includes Intel and AMD processors), then you would use an x86 assembler.
So why write assembly? On the downside, writing your code one processor instruction at a time is far more tedious than writing the functionally equivalent code in a high-level language. Moreover, assembly language can’t protect you from even the most basic errors and allows you to do dangerous things like trying to read memory that doesn’t belong to your program (something which the OS and the processor conspire to stop your program from doing by halting your program when it tries to do such things). So not only is programming assembly like using tweezers to move a hill of sand, the tweezers are slippery and sharp. Producing complex, reasonably bug-free programs entirely in assembly is very hard and generally just hasn’t been done since the late-80′s.
On the upside, the exact control provided by assembly allows for optimizations simply not possible in high-level languages. While compilers and interpreters have gotten quite smart, they very, very rarely, if ever, produce the fastest possible code, leaving room for a human to do better. Again, writing a program entirely in assembly is simply too impractical given the size of most modern programs; however, if a key portion of your code is a bottleneck, it might be beneficial to rewrite that piece of code in assembly and then invoke it from your high-level language code.
Assembly retains one other important role. Some important processor instructions will never be generated by the output of a high-level language, so it is left to assembly code to allow access to those instructions. For instance, on most processors, system calls can only be invoked using a particular instruction, but there’s nothing you can write in C code which will make the C compiler spit out that instruction—it’s simply something (consciously) missing from the semantics of the language; therefore, to make a system call in C, a piece of assembly code that uses the system call instruction is written in a way that the code, when assembled, can be invoked from your C code. For instance, when you open a file in C with the C standard library’s ‘fopen’ function, depending upon your implementation of C, that function either calls a function written in assembly or is itself written in assembly, and that assembly function contains the instruction to invoke the system call that opens a file. (A ‘system call’ is a function provided by the operating system that can’t be invoked like a normal function because it exists in the operating system’s protected memory space; the OS and processor conspire to protect this memory space from direct access by ordinary programs because otherwise it would be possible for ordinary programs to bring down the whole system out of incompetence or do malevolent things like read files they aren’t supposed to be able to access. So, processors typically provide a system-call-invoking instruction which allows ordinary programs to invoke code at OS-defined specific addresses in the OS’s protected memory space. By allowing the execution of ordinary programs to enter this memory area only at specific points, the OS can prevent any funny business.)
Assemblers used to be a much bigger deal back in the DOS days when most programmers worked in assembly, but those days are gone. Today, assembly work is rarely done except by developers of operating systems and device drivers, and whereas there used to be many assemblers for Intel-compatible processors, today there are only a few real options (on the upside, they are all now free downloads):
- MASM (Microsoft Macro Assembler)
- GAS (GNU Assembler)
- FASM (Flat Assembler)
- NASM (Netwide Assembler)
Aside from these options, some C compilers feature mechanisms to embed assembly code amongst the C code. For instance, the C compiler in the GCC (GNU Compiler Collection) allows you to embed GAS assembly code using a special directive. (Understand, this and similar mechanisms in other C and C++ compilers are not official parts of either the C or C++ languages.)
Now, whereas high-level languages, such as Java, C, or C++, are typically highly standardized, the assembler languages for a particular processor may diverge significantly in syntax, e.g. while most assemblers on the x86 platform tend to follow the syntax established by Intel in its processor manuals (with the notable exception of GAS), they still have many sizable differences.
A high-level assembler is an assembler with some high-level-language-like conveniences thrown in. MASM arguably fits into this category, but the best example is certainly HLA (High Level Assembly), an assembler language originally conceived as a teaching tool.
A compiler is a program which translates high-level language code—called the source—into some other form (usually processor instructions)—called the target. Whereas assemblers do basically a verbatim, one-to-one translation—like a translation from English to Pig-Latin—compilers typically have a considerably more sophisticated task—more like a translation from English to Latin. So whereas the whole point of assembly generally is that the programmer controls the exact sequence of instructions, compilers only guarantee that the code they spit out is functionally equivalent to the semantics expressed in the source. Moreover, compilers generally attempt to optimize the code they produce, making the end result correspond even less directly to the source.
Just as assemblers are particular to the precise assembly syntax they can translate, compilers are specific to the high-level language(s) they can translate, i.e. a compiler for the C language can translate C code but not Pascal code. Also like assemblers, compilers are particular to the processor platform(s) which they can target (except some compilers don’t spit out processor instructions at all but rather some kind of ‘intermediate code’, as I’ll discuss later).
Consider the case of the C language. Like with assembly, there used to be a wide variety of C compilers used back in the 80′s and 90′s, but today the market has sorted out, and there are only a few notable C compilers. The two most important are:
- GCC (GNU Compiler Collection): Originally called the GNU C Compiler, GCC now supports many languages other than C and C++. GCC can target dozens of processor platforms, including all the most popular ones.
- Microsoft Visual C++: Despite the name, Visual C++ supports C as well as C++. Visual C++ only targets the Intel-compatible platforms: x86, x64, and Itanium. (Technically, ‘Visual C++’ is actually the name of Microsoft’s IDE (Integrated Developer Environment), but there isn’t a more commonly used name for Microsoft’s C or C++ compilers.)
The source code of all but the smallest programs is written spread across multiple files, and in most languages, these files are treated as separate ‘compilation units’, i.e. they are compiled independently of each other. When a compiler produces processor instructions, the resulting code is called ‘object code’, and the resulting files are called ‘object files’. While some operating systems, including Unix systems, will allow an object file to be run as a program (i.e. it will happily load the file and begin execution of its instructions), this is of limited use because, to make a complete program, the object files need to be ‘linked’ together:
In a program, the code in one source file makes a reference to code in other files and/or is referenced by code in other files: a program is a web of source files which make external references to each other, and so the source files depend upon each other. (If a source file does not reference other files and itself does not get referenced by other files, then it can’t have any effect on or be affected by the rest of the code, so it can’t be said to be a part of the same program.) Still, each source file is compiled separately, meaning that, when processing one source file, the compiler has no knowledge of the files referenced by the source code; consequently, when the compiler encounters an external reference in the source code, all it can do is leave a ‘stub’ in the object code allowing the connection to be patched later. Patching together the external reference stubs of one object file to another is precisely the job of a linker. It is the linker that takes many object files and produces from them an executable file (e.g. an .exe file on Windows).
Whereas assemblers and compilers translate code into other forms of code, an interpreter is a program that translates code into action, i.e. an interpreter reads code and does what it says, right then and there. If you intend your program to be run via an interpreter, then every user must have both your program and the interpreter to run it, and your program is then started by starting the interpreter and telling it to run your program. (This may sound unfriendly to naive users, but the installation and starting of the interpreter can be disguised from users such that they install and run your program like any other.)
Because interpretation happens every time you run the program as you run it, interpretation introduces a significant performance overhead. This cost can be mitigated using what I call the ‘hybrid model’. First, the source code is compiled into some intermediate form (i.e. code which is more like processor instructions than high-level code but which is not executable by the processor), and then, to run the program, an interpreter executes this intermediate code. (In this model, the linking of the compilation units is typically done by the interpreter every time the program is run.)
A further refinement of the hybrid model is to use a JIT (Just-in-time) compiler. You use a JIT compiler as you would an interpreter—you run your program by feeding the JIT compiler some form of code (usually intermediate code)—but the JIT compiler compiles code into processor instructions and runs that instead of interpreting the code. Despite the time spent to perform this compilation (typically reflected in a longer program load time), JIT compiling is usually considerably faster than using interpretation: using a JIT compiler with the hybrid model is typically only 10%-20% less performant than were the code ‘natively compiled’ (compiled into an executable and run as such), compared to 70-100% slower for interpreting intermediate code. [The term "performant" is used by programmers to mean 'fast performing' or 'acceptably performing', but you won't find it any dictionary---yet.] Some claim that, in a few cases, a sufficiently smart JIT compiler can run code faster than the same program compiled into an executable because the JIT compiler can make optimizations only discoverable at runtime. (The comparative performance of JIT compiling versus native compiling is a hotly debated topic. While most concede native compilation almost always produces better performance, it’s debated how much of a performance hit JIT compiling introduces.)
Understand that, whether using the hybrid model or not, an interpreted program is limited by its interpreter. Just as programs executed by the OS can only do what the OS allows them to do, interpreted programs can only do what their interpreter allows them to do. This has potential security benefits: as the theory goes, users can download programs and run them in an interpreter without having to trust those programs because the interpreter can block its programs from accessing files on the system and/or using the network connection, etc. In such schemes, the interpreter is often called a VM (virtual machine) because, as far as the programs which it runs are concerned, it looks and acts much like a full computer system. In practice, truly secure virtual machines aren’t quite a reality, for real VM’s have bugs which malicious programs they run can exploit to breach the limitations imposed by the VM; consequently, users should still be careful of which programs they download and run, even if the program is run in a VM.
Another often-cited benefit of interpretation is that, as long as an appropriate interpreter for your language exists on all the platforms you wish to run your program on, you only need to write the program once. This is often called ‘write once, run anywhere’. This argument made a bit more sense when computers were slower and so compilation took considerably longer, making compiling your program for all target platforms a bit more bothersome, but aversion to this inconvenience doesn’t really explain why interpreted programs are considered so much more portable. The real reason writing your program for an interpreted environment makes it generally easier to get it working on multiple platforms is that the interpreter acts as a layer of indirection between your program and the OS, so the interpreter can handle the messy particulars of dealing with variances between OS’s, e.g. the process of opening a file often differs from one OS to the other, but your program only has to tell the interpreter to open a file, and the interpreter in turn deals with the particulars of the OS.
The portability advantage of interpretation holds out as long as your program uses functionality that is available and works consistently on all of your target platforms. A notorious problem area is GUI’s (Graphical User Interfaces): many GUI ‘widgets‘ (windows, menus, scrollbars, drop-down menus, etc.) simply don’t look and act the same on Windows, Macs, and Linux desktops. Attempts to provide a cross-platform means of writing GUI code have to date only been partially successful.
In principle, any language can be either interpreted or compiled, but in practice, languages are designed with a particular model in mind. For instance, were you to interpret C language code, you would defeat the purposes of using C in the first place (mainly performance and greater machine control), and so this just isn’t done (though I bet someone somewhere has done it—someone somewhere has done everything, no matter how strange or daft). Another language, Java, was conceived and implemented to use the hybrid model; ‘native compilers’ (compilers that spit out processor instructions) for Java exist, but aren’t used very often because the performance benefits generally aren’t significant enough to be worth the downsides.
Thus endeth the lesson.