LLVM: Low-Level Virtual Machine

About

This tutorial provides a very basic and brief introduction about LLVM which is used in UIowa:CS:4980:0002, Topics in Computer Science II: Dependable System Design instructed by Dr. Guanpeng Li.
Specifically, this tutorial targets to provide concise descriptions of following contents:

  • How to Install LLVM?

  • What is LLVM?

  • What is LLVM IR? How to read LLVM IR?

  • What is LLVM Pass? How to write and compile an LLVM Pass?

  • Some useful LLVM Tools.

LLVM Installation

Following commands show how to install LLVM 3.4 on a Ubuntu 16.04 machine.
$ # Note that LLVM source code of other versions can be found at https://github.com/llvm/llvm-project.
$ git clone https://github.com/zjuacompiler/llvm.git          # download LLVM 3.4 source code
$ cd llvm                                                     # change directory to uncompressed folder
$ ./configure --enable-optimized --disable-assertions --enable-targets=host --with-python=“/usr/bin/python2”
                                                              # configure dependencies before you build LLVM
$ mkdir build && cd build                                     # "build" folder is for release
$ cmake ..                                                    # prepare cmake file, ".." indicates the path to source code
$ make -j$(nproc)                                             # make install LLVM with multi-threads
Check if installation is successful.
$ cd $YOUR-LOCAL-PATH$/llvm/build/bin/ && ls                  # list all compiled LLVM executable binaries
bugpoint   FileUpdate  lli-child-target  llvm-bcanalyzer  llvm-c-test  llvm-dwarfdump  llvm-lit  llvm-mcmarkup  llvm-ranlib   llvm-size        llvm-tblgen  obj2yaml  yaml-bench
count      llc         llvm-ar           llvm-config      llvm-diff    llvm-extract    llvm-lto  llvm-nm        llvm-readobj  llvm-stress      macho-dump   opt
FileCheck  lli         llvm-as           llvm-cov         llvm-dis     llvm-link       llvm-mc   llvm-objdump   llvm-rtdyld   llvm-symbolizer  not          yaml2obj
$ # It indicates LLVM-3.4 is successfully installed to your machine :)

LLVM

alt text 

Low-Level Virtual Machine (LLVM) actually is a compiler architecture [1], shown as left picture. The main use of LLVM is that it can be the Optimizer (language-agnostic Optimization) and Backend (Machine Code Generation) of multiple programming languages.

Compared with the traditional compiler (Frontend, Optimizer, Backend), the design of LLVM is very flexible for different programming languages. Supporting a new language only needs a new frontend. Supporting a new hardware device only needs a new backend. LLVM Optimizer is a general module — it provides a common standard and its optimization is based on well-structured LLVM IR (Intermediate Representation).

Besides, LLVM contains a lot of sub-projects including Clang (a lightweight compiler frontend for C, C, Objective-C, and Objective-C. You may frequently see it in your MacBooks)

LLVM IR

What is LLVM IR?

alt text 

LLVM IR is a low-level intermediate representation used by LLVM compiler framework [2], shown as left picture. LLVM-based compilers here can be splited into three components, the front-end, middle-end, and back-end; each with a specific task that takes IR as input and/or produces IR as output.

  • Front-end: compiles source language to IR.

  • Middle-end: optimizes IR, this step is very powerful and you can do a lot of interesting code transformations and program analysis here.

  • Back-end: compiles IR to machine code.

You can think of LLVM IR a as platform-independent assembly language with an infinite number of function local registers. However, LLVM IR is not machine code, but sort of the step just above assembly. So some things look more like a high-level language (like functions and the strong typing). Other looks more like low-level assembly (e.g. branching, basic-blocks, instructions).

How to read LLVM IR?

alt text 

LLVM IR has both human-readable (.ll) and binary versions (.bc, bitcode format). These two formats can be transformed easily, and we mainly focus on .ll here.

LLVM IR has three major components, function, basic-block (BB), and instruction.

  • Function: One LLVM IR has several functions, such as one function shown in left picture. Note that the function name is LLVM-level and can not be mapped with source code (except main function).

  • Basic-Block: One function contains several basic-blocks, and each node in the left picture represents a basic-block. Basic-block is also the minimum unit to record the dynamic footprint of one program execution.

  • Instruction: One basic-block contains several instructions. Every instruction has its own instruction type and variables.

Here is a toy example of compiling C code to LLVM IR. This IR has only one main() function. This function has only one basic-block. This basic-block has four instructions, of which types are “load”, “mul”, “store”, and “ret”.

Original C code
int variable = 21;

int main()
{
    variable = variable * 2;
    return variable;
}
LLVM IR (readable format)
@variable = global i32 21          ; define global variable, in LLVM IR global variable starts with '@'

define i32 @main() {
    %1 = load i32, i32* @variable  ; load the global variable, in LLVM IR local variable starts with '%'
    %2 = mul i32 %1, 2
    store i32 %2, i32* @variable   ; store instruction to write to global variable
    ret i32 %2
}

To know more about LLVM IR grammar, please see this document [3].

LLVM Pass

What is LLVM Pass?

LLVM Pass framework is an important component of LLVM infrastructure, and it performs code transformations and optimizations at LLVM IR level. You may have already noticed that in the pictures above: LLVM Passes can be executed via the optimizer. In fact, you can conduct any transformations on a given LLVM IR code via LLVM Passes, even converting one IR (e.g. bubble sort) to a totally another one (e.g. quick sort) as long as you can realize the functions. As a result, LLVM Pass is a very efficient tool for analyzing the program code.

How to write and compile an LLVM Pass?

Let's start writing an LLVM Pass by conducting some simple program analysis: calculating the number of “Call” type instructions in program IR.

1. Create LLVM Pass Folder
$ # We assume that you have already installed LLVM 3.4 via above guidance.
$ cd $YOUR-LOCAL-PATH$/llvm/lib/Transforms                               # change directory to the main folder that contains the LLVM Passes
$ ls                                                                     # list files and folders in current path
CMakeLists.txt  InstCombine      IPO            Makefile  Scalar  Vectorize
Hello           Instrumentation  LLVMBuild.txt  ObjCARC   Utils
$ # Except file CMakeList.txt, LLVMBuild.txt, and Makefile. Each of the rest folders denotes an individual LLVM Pass.
$ mkdir CallCount                                                        # create folder for our target LLVM Pass

Next, add path of folder CallCount into current CMakeLists.txt, so that it can be recognized during the LLVM compilation. The content of modified CMakeList.txt can be shown as below:

add_subdirectory(Utils)
add_subdirectory(Instrumentation)
add_subdirectory(InstCombine)
add_subdirectory(Scalar)
add_subdirectory(IPO)
add_subdirectory(Vectorize)
add_subdirectory(Hello)
add_subdirectory(ObjCARC)
add_subdirectory(CallCount)                                              # add path of folder that contains our target LLVM Pass
2. Write LLVM Pass
$ cd CallCount                                                           # change directory to target folder
$ touch Hello.cpp                                                        # create C++ source file of target LLVM Pass

Hello.cpp defines the logic of our pass, which is calculating the number of “Call” type instructions in program IR. The content of Hello.cpp can be shown as below:

#include "llvm/ADT/Statistic.h"
#include "llvm/IR/Function.h"
#include "llvm/Pass.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/Type.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/Instruction.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/Support/InstIterator.h"

#include <iostream>
#include <map>
#include <list>
#include <vector>
#include <set>

using namespace llvm;

namespace{

    /****** analysis pass ********/
    struct CallCount : public ModulePass{                                                       # This pass is developed based on ModulePass.
                                                                                                # There are also some other LLVM classes, such as:
                                                                                                #     CallGraphSCCPass, FunctionPass, LoopPass, and RegionPass.
        static char ID;
        int call_count = 0;                                                                     # Global variable that records the number of call type instructions.

        CallCount() : ModulePass(ID) {}

        virtual bool runOnModule(Module &M){                                                    # For each program IR, load it as a Module.
                                                                                                # Besides, you can regard this as the Main Function of this Pass.
            for(Module::iterator F = M.begin(), E = M.end(); F!= E; ++F){                       # Iterate each function in this Module.
                for(Function::iterator BB = F->begin(), E = F->end(); BB != E; ++BB){           # Iterate each basic-block in current function.
                    CallCount::runOnBasicBlock(BB, M.getContext());                             # Iterate each instructions in current basic-block.
                }
            }
            return false;
        }

        virtual bool runOnBasicBlock(Function::iterator &BB, LLVMContext &context){             # The function that is used above, input is current basic-block.
            for(BasicBlock::iterator BI = BB->begin(), BE = BB->end(); BI != BE; ++BI){         # Interate each instructions in current basic-block.
                int opcode = BI->getOpcode();                                                   # Get opcode of current instruction:
                                                                                                #     Opcode is the unique number for each instruction type.
                                                                                                #     More details can be found at $YOUR-LOCAL-PATH$/llvm/include/llvm/IR/instruction.def
                if(opcode == 49){                                                               # Record if the type of current instruction is "call".
                    call_count++;
                    outs() << call_count << '\n';
                }
            }
            return true;
        }
    };
}
char CallCount::ID = 0;
static RegisterPass<CallCount> X("CallCount", "Count call type instructions in given program IR", false, false);
                                                                                                # "CallCount" is the unique flag of this pass while being loaded by opt command.
3. Prepare for LLVM Pass Compilation
$ # Usually, the pass folder (i.e. CallCount here) should contain three files:
$ #     Hello.cpp       --  The souce code of this LLVM Pass. Of course, you can give it another name. We have wrote as above.
$ #     CMakeLists.txt  --  Link the source code with the compiled LLVM Pass. The LLVM Pass is .so format.
$ #     Makefile        --  The compilation logic under this whole LLVM project.
$ # So let's create the CMakeLists.txt and Makefile.
$ touch CMakeLists.txt Makefile

We first modify CMakeLists.txt, of which content can be shown as below:

add_llvm_loadable_module(CallCount      # The name of compiled LLVM Pass. So the output will be CallCount.so after the compilation.
    Hello.cpp                           # The source code of LLVM Pass.
)

Then, we modify Makefile, of which content can be shown as below:

LEVEL = ../../..
LIBRARYNAME = CallCount                 # Also the name of compiled LLVM Pass. This should be consistent with the name in CMakeLists.txt.
LOADABLE_MODULE = 1
USEDLIBS =

# If we don't need RTTI or EH, there's no reason to export anything
# from the hello plugin.
ifneq ($(REQUIRES_RTTI), 1)
ifneq ($(REQUIRES_EH), 1)
EXPORTED_SYMBOL_FILE = $(PROJ_SRC_DIR)/Hello.exports
endif
endif

include $(LEVEL)/Makefile.common

Right now, we are almost done and the LLVM Pass is ready for compilation.

4. Compile LLVM Pass
$ cd $YOUR-LOCAL-PATH$/llvm/build                                        # change directory to where we build this LLVM project
$ make -j$(nproc)                                                        # compile LLVM with multi-threads

Once the compilation is done, the target LLVM Pass CallCount.so can be found at $YOUR-LOCAL-PATH$/llvm/build/lib/.
To load the LLVM Pass for analyzing a program IR (e.g. pathfinder.ll), you can execute the following commands:

$ YOUR-LOCAL-PATH$/llvm/build/bin/ -load $YOUT-LOCAL-PATH$/llvm/build/lib/CallCount.so pathfinder.ll -CallCount -o output.ll
$ # You can know how many "call" type instructions are in pathfinder.ll :)

The source folder of this LLVM Pass can be found at git repository [4].

Some Useful LLVM Tools

clang: LLVM C compiler
$ clang hello-world.c -o hello-world                                     # Compile C code to executable binary
$ clang -S -emit-llvm hello-world.c -o hello-world.ll                    # Compile C code to readable IR
clang++: LLVM C++ compiler
$ clang++ hello-world.cpp -o hello-world                                 # Compile C++ code to executable binary
$ clang++ -S -emit-llvm hello-world.cpp -o hello-world.ll                # Compile C++ code to readable IR
llvm-as: Assembler
$ llvm-as hello-world.ll -o hello-world.bc                               # Compile readable IR to bitcode format
llvm-dis: Disassembler
$ llvm-dis hello-world.bc -o hello-world.ll                              # Compile bitcode format to readable IR
llvm-link: Linker
$ llvm-link -S hello.ll world.ll -o hello-world.ll                       # Link two IRs into a unified one
llc: Static compiler
$ llc hello-world.ll -o hello-world.s                                    # Compile IR into assembly code for a specified architecture
lli: Directly execute IR
$ lli pathfinder.ll 1000 10                                              # Directly execute Rodinia-pathfinder IR with input "1000 10" using a just-in-time compiler
                                                                         # This IR can be found at git repository [4].
opt: Optimizer
$ opt -load ./CallCount.so pathfinder.ll -CallCount -o output.ll         # Load LLVM Pass for code transformation and optimization.
                                                                         # CallCount.so is the LLVM Pass we want to load.
                                                                         # -CallCount is the unique flag of this Pass registered in current LLVM project.
                                                                         # The output.ll is bitcode format, which can be disassembler to readable IR via llvm-dis.

Contributing

This document is written by Yafan Huang. Besides, this document has not been iterated yet, so some descriptions may be a bit confused. If you have any questions after reading this document, please feel free to email him at yafan-huang@uiowa.edu.

References

[1] LLVM Documents: [Link]
[2] Blog: LLVM IR and Go: [Link]
[3] Mapping High Level Constructs to LLVM IR: [Link]
[4] Github Repo: LLFI-Quick-Start: [Link]