Archive

Compilation Basics

This article attempts to explain the basics of building from source: What “source code” is, what a compiler does, and the functions of libraries, make and ./configure.

Source code, binaries, and compilation

When a programmer writes a program, they will typically write it in a language like C, C++ or Java. These, like most programming languages, are a set of instructions for a computer, written in a “longhand” form which resembles a cross between natural language (“while <this is true>, do <that>”) and mathematical notation. This is known as the source code. Some languages, like Perl, Python and Lisp, have a program with them (called the interpreter) which reads this source code, and follows the instructions directly. However, this is a slow process, since the interpreter, like a human interpreter, has to convert between the human-readable source code and the “natural” language that the computer understands as it goes along.

To avoid this slow as-you-go translation process, some languages (C, C++, Java, Ada, Pascal) are not interpreted, but instead have a program which does the translation process all at once, and produces a separate file which can be executed directly by the hardware of the machine. This pre-translated file is known as a binary or executable. The translation process is known as compilation. The binary form of the program consists of a set of instructions in the underlying language of the CPU (known as its instruction set).

Note that interpreted languages can be run on any machine (e.g. x86 PC, PowerPC/Mac, Sun Sparc), provided that the machine has a suitable interpreter on it. Compiled languages require a different binary to be made for each different machine type that they have to be run on (since the underlying machine architectures, and hence CPU instruction sets are different).

Source files for C and C++ typically have extensions of .c, .cc, .C, .cxx, .h, .hh and .hxx. Java source files are .java. Other languages are more variable in their file extensions, and much rarer — 99.9% of all programs written in compiled languages you are likely to encounter are written in C, C++ or Java.

Linkers

A large and complex piece of software, like Apache, or the Linux kernel, may contain millions of lines of source code. They may even be written in different languages. OpenOffice.org has components written in C, C++, Java and Python, for example. A single source file containing the 4.3 million lines of code of the Linux kernel would be completely unmaintainable. For this reason (and some others — see later), compiled languages have a technique where the code can be split up into separate files, compiled separately, and then knitted together to form the finished binary.

When a program is split up into separate source files, the source files can’t be entirely independent of each other (else there would be no connection between them). Therefore, if one file defines a function:

createstuff.c

createFile(filename) { {{{     /* create a file here */ }

}}}

and another file uses it:

usestuff.c

useFile(name) { {{{     createFile(name);     /* Do stuff to the new file */ }

}}}

then there is a connection between the two files.

Now, when a source file is compiled, the compiler produces what’s known as an object file. These typically have a .o extension. Object files contain mostly compiled binary code, corresponding to the source code in the relevant source file. As mentioned above, though, the object files can’t be complete in themselves, as they don’t contain all of the information they need. So, in our example above, usestuff.c doesn’t know where the createFile() function is — it only knows that such a function must exist. Likewise, makestuff.c doesn’t know how the createFile() function will be used — it only knows that such a function has a particular definition.

To connect these two files together, there must be a process which takes the function call in usestuff.c, and connects it to the function definition in makestuff.c. This process is called linking, and is accomplished by a program called a linker. A linker takes a collection of object files (.o files) and connects them together into a single executable binary file.

So, the process looks like this:

makestuff.c ------compiler------> makestuff.o -----+ {{{                                                    |                                                  linker-----> dostuff (binary executable)                                                    | usestuff.c -------compiler------> usestuff.o ------+

}}}

Make

In a large project, it would be quite hard on the programmer to run the compiler and linker by hand each time they changed something in the code and wanted to test it. So, a program called make was developed. Make knows about using programs to make one file from other files (as the compiler and linker do). It is designed to look at the time-stamps on files, and if, say, the .o file is older than the corresponding .c file, it will rebuild the .o file. Then, since the executable is then older than the .o file, it will run the linker. Make knows some basic rules (how to make a .o from a .c or .cc), and can be taught others (how to make “dostuff” from “makestuff.o” and “usestuff.o”). It is controlled by a file containing the rules it needs to be taught, which is usually called Makefile. Unless told otherwise (with the -f option), make will look for a file called Makefile to read its rules from, and will complain if it can’t find it.

Make can be given multiple sets of rules to do different things. The default rule (invoked by running just make) will usually build all of the binaries in a project. It will probably also try to build any documentation, if that documentation needs to be built (e.g. from DocBook or LaTeX sources). The most common rule other than “build everything” that you will encounter is a rule to install all of the built files so that they can be accessed by the system. This rule is normally called “install”, and is run using make install. Other rules might include make uninstall and make clean. This latter will remove any generated files from the source tree, such as executables and .o files, and will return it to a “clean” state similar to the state it was in before a compilation was attempted. make clean can be useful if a program (or programmer) gets itself confused halfway through a compilation process.

Libraries and headers

Sometimes, there are useful functions which can be re-used in a lot of different programs. Things like formatting text output, through to complex functions like interfacing with X, or playing MP3s. It is convenient to be able to package a collection of such useful functions into a single unit which a programmer can use the functions from without having to care about how those functions are written. Such a package of functions is called a library.

To use a library in a program, the compiler (and the programmer) needs to know what functions are available in the library. The list of functions in a particular library is known as the API (Application Program Interface). The definition of the API for a library (in C and C++) is contained in a set of files known as header files or simply headers. These let both the programmer and the compiler know what functions are available from the library. Header files in C and C++ typically have a .h, .hh, or .hxx extension.

The compiled binary code for the library must also be made available. This comes in two forms:

  1. a static library, with a .a extension, which is supplied to the linker when the program is built. If a program is statically linked, then it carries its own copy of the library code with it.

  2. a dynamic or shared library, with a .so extension, which is made available to the program when it is run. A program that uses shared libraries must have those libraries available to it when it is run. However, shared libraries save a lot of space, as there only needs to be one copy of the library on the system to service many programs.

In the Windows world, shared libraries are known as Dynamically Linked Libraries or DLLs.

./configure and the autotools

Building programs for diverse and disparate systems is a complex job. Not all versions of Unix provide the same functions (API) to the programmer. Not all systems will have all of the optional libraries that might be wanted to build a program. The person building the project may not want all of the optional features that it provides.

To solve this problem (and provide a whole set of other problems — but that’s another rant), the GNU project has written a set of programs collectively called the AutoTools. The two main programs are Autoconf and Automake.

Autoconf generates configure scripts. A configure script (usually called “configure”) attempts to identify all the relevant quirks of the machine on which it’s being run (which OS functions are available; what peculiar parameters need to be passed to the compiler). It also checks that the relevant libraries needed for building the software are present, by looking for the relevant header files. The configure script then modifies specific “template” files to reflect the results of its investigations, and to ensure that the compilation of the program succeeds. The template files normally have a .in extension, and are modified to the same filename without the .in on the end.

Automake writes complex configurations for make, attempting to simplify the process for the author of software. Automake reads .am files, typically Makefile.am, and produces a .in file from it, which can then be processed by the configure script.

Summarising this:

{{{                                                 configure.ac                                                      |                                                   autoconf                                                      |                                                      v Makefile.am -----automake----> Makefile.in ------configure------> Makefile

}}}

You will normally find that distributed source packages contain a Makefile.in and a configure script already, so there’s (normally) no need to have automake and autoconf installed.

The configure script has a whole load of options available in it. It is often useful to look at the output of ./configure –help to see what options are available. With autoconf-generated configure scripts, you will always find a –prefix option. This allows you to change the location to which the compiled program will be installed. So, configuring the source of a package with:

$ ./configure

will ultimately install the package under /usr/local. If you want it in, say, /home/me/stuff, then:

$ ./configure --prefix=/home/me/stuff

will do the trick. Other options will allow you to prevent or allow the use of specific libraries for particular parts of the software (support for a particular image format, say, or encrypted communications, for example). Some options may allow you to specify the location of particular libraries, if ./configure can’t find them for itself. Most of the time you won’t need these other options.

The “default” build process

The documentation for most autotools-enabled packages has the following mantra in it, or something similar:

$ ./configure $ make $ sudo make install

From the preceding description, you should be able to work out that these three commands do the following:

  1. Check for libraries installed, and peculiaries of the operating system, and set up appropriately.
  2. Using the build rules generated from the previous step, build all of the source code in the project, and link it together into the relevant binaries.
  3. Install the binaries, and any other support files they might need, into the location defined in the ./configure step earlier.

(See also /CompileFromSource)

Java

The above description applies, more or less, to Java code as well as to C/C++ code. However, in Java, the source files are .java; object files are .class instead of .o; there is no direct equivalent to a fully-compiled binary in Java — everything is an object file (.class), and object files with the right code in them may be run as binaries. A .jar file is a package collecting together a number of files (.class files; other data or resources that a program might need) into a convenient package. .jar files can be thought of as the equivalent of libraries, executables, or an entire installed package, depending on how they are used.

Instead of make, most large Java projects use a program called “ant”, which performs exactly the same functions as make, but with a different syntax for the control file (build.xml instead of Makefile). There is no real need for automake and autoconf in Java projects, since Java programs run on a virtual machine – an idealised machine that does not exist in hardware. Instead, a program called a JVM (Java Virtual Machine) is used to translate the .class files from the idealised middle ground to a form the computer can understand every time the Java program is run. This process is much quicker than compiling a program from source, and the programs run faster than interpreted programs – it provides a nice middle ground between compiled code and interpreted code. Additionally, only one set of program files need be distributed – in theory, they will run on any machine with a JVM.

Leave a Reply