Bazel Primer

Introduction ¹

Bazel is an open-source build/test framework similar to Maven, Make, and Gradle.

It features:

Human-readable, high-level build language
Fast and reliable via caching
Scalable
Extensible for other language or framework

This post is a reading notes about the official documentation on Bazel version 3.4.0. You can skip these intros and jump directly to the sample repo to get started.

Bazel Setup

Follow the instructions here to install the latest release for your system.

Concepts

In general, Bazel builds software from source code organized in a directory called a workspace.

Source files in the workspace are organized in a nested hierarchy of packages.

Each package is a directory containing a set of related source files + one BUILD file for that package.

A simple example of a C++ project structure for one package is shown below:

.
├── README.md
├── WORKSPACE
└── main
    ├── BUILD
    └── hello-world.cc

Workspace

A workspace is a directory containing your source files and symbolic links to other directories that contain the build output.

Have a look at the following project structure after bazel built the target:

.
├── README.md
├── WORKSPACE
├── bazel-bin -> /private/var/tmp/_bazel_mxin/a122b7b4d9e8cf33d3804073143b4e06/execroot/__main__/bazel-out/darwin-fastbuild/bin
├── bazel-out -> /private/var/tmp/_bazel_mxin/a122b7b4d9e8cf33d3804073143b4e06/execroot/__main__/bazel-out
├── bazel-stage1 -> /private/var/tmp/_bazel_mxin/a122b7b4d9e8cf33d3804073143b4e06/execroot/__main__
├── bazel-testlogs -> /private/var/tmp/_bazel_mxin/a122b7b4d9e8cf33d3804073143b4e06/execroot/__main__/bazel-out/darwin-fastbuild/testlogs
└── main
    ├── BUILD
    └── hello-world.cc

Note the new symbolic links created from that build.

Bazel identify a directory as a workspace root by searching for a file named WORKSPACE or WORKSPACE.bazel. It may be empty or may contain references to external dependencies required to build the outputs. If both WORKSPACE and WORKSPACE.bazel exist, Bazel will ignore the WORKSPACE file.

If there’s another subdirectory under the workspace root and it contains a file called WORKSPACE, Bazel simply ignores them. In other words, Bazel does not support nested workspaces.

Packages

As mentioned earlier, source files usually organized in nested hierarchy called packages.

Conceptually, a package is

the primary unit of code organization in a repository
a collection of logically related files
a specification of the dependencies among these files

In reality (ps: joking), it is a subdirectory containing a BUILD or BUILD.bazel file beneath the workspace root. A package includes all files + all subdirectories beneath the package root, except those themselves contain a BUILD (or BUILD.bazel), which become subpackages in this case.

For example, the below directory tree contains two packages: my/app and its subpackage my/app/test.

src/my/app/BUILD
src/my/app/app.cc
src/my/app/data/input.txt
src/my/app/tests/BUILD
src/my/app/tests/test.cc

Repositories

In the above introduction of packages, we mentioned repository, so what is it? We know GitHub repos, and it’s a way of organizing source code. Bazel repository is a similar concept.

Bazel defines the root of the main repository as the directory containing the WORKSPACE file, also called @.

We can have dependent external repositories like googletest and these external repos are defined in the main repo’s WORKSPACE file using workspace rules.

Note that external repos are repos themselves, which means they have their own WORKSPACE file as well! However, these WORKSPACE files are ignored by Bazel and hence those transitively dependent repos are not added automatically.

Targets

Within a package, we define elements as targets. The name of a target is referred as its label.

Target categories include:

files
rules
package groups (less nemerous)

Files

We can further divide files as:

Source files
- usually written by the efforts of people and checked in to the repo
Generated files (or Derived files)
- not checked in to the repo but are generated by the build tool from source files according to specific rules

Rules

A rule specifies the relationship between inputs and outputs and the necessary steps to derive the outputs from the inputs.

Attributes

Each rule has a set of attributes and the applicable attributes for a given rule and the significance/semantics of each attribute are a function of the rule’s class. Each attribute has a name and a type.

For example, common attribute types are:

integers
label
list of labels
string
list of strings
output label
list of output labels

Not all attributes need to be specified in every rule (i.e. some attributes are optional). Attributes thus form a dictionary from keys (names) to optional, typed values.

Below we introduce several common attributes.

`name` attribute

Every rule has a name attribute of type string and must be syntactically valid target name as explained below (labels section).

In some cases, a rule’s name is somewhat arbitrary such as for genrules.

In other cases, the name is significant. For example, for *_binary and *_test rules, the name attribute determines the produced executable’s name by the build.

`srcs` attribute

This attribute has type list of labels, which means its value, if present, is a list of labels with each being the name of a target that is an input to this rule.

`outs` attribute

This attribute has type list of output labels. It is similar to the srcs attribute but differs in two significant ways:

due to the invariant that the outputs of a rule belong to the same package as the rule itself (mentioned earlier), output labels cannot include a package component and must be in one of the “relative” forms (discussed below in the labels section)
the relationship implied by an (ordinary) label attribute is inverse to that implied by an output label: a rule depends on its srcs, whereas a rule is depended on by its outs.

The two types of label attributes (srcs and outs) thus assign direction to the edges b/w targets, giving rise to a dependency graph (DAG over targets, a.k.a target graph or build dependency graph), which is the domain over which the Bazel Query tool operates.

Inputs

The inputs may be source files, generated files, or even other rules. Allowing generated files as the inputs means outputs of one rule may be the inputs to another rule, thus allowing rule chaining. Allowing other rules to be the inputs of one rule is more complex and language/rule-dependent.

For example, a C++ library rule A may have another C++ library rule B as input. The effect of this dependency is that B’s header files are available to A during compilation, B’s symbols are available to A during linking, and B’s runtime data is available to A during execution.

Note that a rule’s inputs may come from another package.

Outputs

The outputs are usually generated files and these files are always belong to the same package as the rule itself.

Class (or Categories)

A rule can be of one of many different kinds or classes based on the output type. Such as rules that produce compiled executables and libs, test executables and other supported outputs.

Package groups

A package group is a set of packages whose purpose is to limit accessibility of certain rules.

It is defined by the package_group function and does not generate nor consume files.

Labels

As mentioned in the targets intro above, a target’s name is its label and the label uniquely identifies the target.

A typical label in canonical form looks like:

@myrepo//my/app/main:app_binary

Note that @myrepo is the repo’s identifier.

Usually a label refers to a target in the same repo, and hence we can omit the repo identifier and written it as:

//my/app/main:app_binary

A label starts with // and consists of two parts separated by a ::

package name
- e.g. my/app/main in the above example
target name
- e.g. app_binary in the above example

A label’s second part (i.e. the target name) can be omitted if the target name is the same as the last component of the package name. Such short-form labels are just an abbreviation and these two forms are equivalent.

For example, if we have label //my/app:app, we can also write it as //my/app.

Quick quiz:

What are the types of the following representations:

my/app
- a package named my/app
//my/app
- a target under my/app package, with its label in short-form and target name is assumed to be app
//my/app:app
- a target under my/app package, with target name app
@myrepo//my/app/main:app_binary
- a target under repo myrepo, package my/app/main, target name app_binary

We can shorten the label identifier even further! Within the BUILD file for package my/app, we can omit the package-name part of labels for this package’s targets, similar to relative paths…

For example, if we have targets //my/app, //my/app:app_binary, we can refer to them in the file my/app/BUILD as

//my/app:app or //my/app or :app or app
//my/app:app_binary or :app_binary or app_binary

Don’t be confused with all these forms of representations! Remember to be consistent with your styles of using labels.

Usually the colon : is omitted for file targets, but retained for rule targets. This allows us to reference files by their unadorned name relative to the package directory in the package’s BUILD file, e.g.

generate.cc
testdata/input.txt

If you want to reference targets outside current package in the BUILD file, you need to refer to them using their complete label.

For example, with another package named my/test and you want to refer a file in the package my/app in my/test’s BUILD file, you need to use //my/app:generate.cc.

If you refer to a target with incorrect label, you may get errors like crosses a package boundary.

Labels starting with @// are references to the main repo and still work even from external repos.

Therefore @//a/b/c is different from //a/b/c when referenced from an external repo. The former refers back to the main repo while the latter looks for target //a/b/c in the current external repo itself.

Such nuance difference can be especially important when you write rules in the main repo that refer to targets in the main repo, but these rules will be used from external repos.

I know the label syntax is strict, but Bazel intentionally enforces that to many reasons. The precise details can be found here.

The `BUILD` files

In the above sections, we discussed packages, targets, labels, build dependency graph abstractly. They are building blocks of Bazel and can be found in a BUILD file.

A BUILD file defines a package and is interpreted as a sequential list of statements by using the imperative language called Starlark.

By saying “sequential list”, we emphasize the order does matter, especially for variables. Variables must be defined before they are used.

In the meantime, the relative order of rule declarations is immaterial and all that matter is which rules were declared and with what value by the time package evaluation completes.

So, in simple BUILD files that consist only of rule declarations, these declarations can be re-ordered freely without changing the behavior.

Limitations

no function definition, for statements or if statements to encourage a clean separation b/w code and data
- functions should be declared in .bzl files instead
no *args or **kwargs arguments
- have to list all the arguments explicitly
unable to perform arbitrary I/O
- hence the interpretation of BUILD files is hermetic i.e. dependent only on a known set of inputs, which is essential for ensuring that builds are reproducible
should be written using only ASCII characters

Best practices

use comments liberally to document the role of each build target, whether or not it is intended for public use and to document the role of the package itself
- since BUILD files need to be updated whenever the dependencies of the underlying code change, and are typically maintained by multiple people on a team

Bazel extensions

Bazel extensions are files ending in .bzl.

As mentioned in the BUILD file limitations, such files can be used to load new rules, functions or constants. Use load statement in the BUILD file to import a symbol from an extension.

E.g. The following code loads the file foo/bar/file.bzl and add the some_library symbol to the environment.

load("//foo/bar:file.bzl", "some_library")

load also supports additional arguments to import multiple symbols.

Limitations of load statement:

arguments must be string literals (i.e. no variables)
load statements must appear at the top-level (i.e. cannot be in function body)
the first argument is a label (discussed above) identifying the .bzl file (i.e. a file target). If it is a relative label,
- it is resolved w.r.t the package (not directory) containing the .bzl file.
- it should use a leading :

Another typical usage of load is to assign different names (i.e. aliases) to the imported symbols:

E.g.

load("//foo/bar:file.bzl", library_alias = "some_library")

# multiple symbols and a mix of aliases and regular symbol names
load(":my_rules.bzl", "some_rule", nice_alias = "some_other_rule")

In a .bzl file, symbols starting with _ are not exported and thus cannot be loaded from another file.

Build rules

Majority of Bazel build rules come in families and grouped by language. For example, cc_binary, cc_library and cc_test are the build rules for C++ binaries, libraries, and tests.

As you can imagine, the naming schema for other languages is similar: with a different prefix that identifying that language. E.g. java_* for Java. The suffix identifies the feature of that rule:

*_binary rules build executables. The executable will be put in the build tool’s binary output tree w.r.t the rule’s label, so //my:program will appear at $(BINDIR)/my/program.
*_test rules are a specialization of a *_binary rule and is used for automated testing.
- tests return 0 on success
- it can only open files that beneath its runfiles tree at runtime
*_library rules specify separately-compiled modules in the given programming language. Libraries can depend on other libs, and binaries and tests can depend on libs.

Dependencies

We discussed dependency graph in the above sections, and it models the depends on relationship among targets.

A target A depends on a target B if B is needed by A at build or execution time.

With the dependency graph defined, we further define a target’s direct dependencies as those direct neighbors in the dependency graph, i.e. targets reachable by a path of length 1 in the DAG. Similarly, a target’s transitive dependencies are those targets on which it depends via a path through the graph.

In the context of builds, there are two types of dependency graphs:

the graph of actual dependencies
- a target X is actually dependent on target Yif and only if Y must be present, built and up-to-date in order for X to be built correctly.
  - built could mean generated, processed, compiled, linked, archived, compressed, executed, or any other kinds of tasks that routinely occur during a build.
the graph of declared dependencies
- a target X has a declared dependency on target Y if and only if there’s a dependency edge from X to Y in the package of X.

In order to have a correct build, the actual dependency graph (denoted by Α) must be a subgraph of the declared dependency graph (denoted by D) (i.e. every pair of directly-connected nodes in A must also be directly connected in D). We therefore say D is an overapproximation of A.

What all these mean is that BUILD file writers should try to make D as close to A as possible, and thus every rule must explicitly declare all of its actual direct dependencies to the build system, and no more.

Types of dependencies

Most build rules have 3 attributes for specifying different kinds of generic dependencies: srcs, deps, and data. Other attributes also exist for rule-specific kinds of dependencies e.g. compiler, resources, etc.

srcs dependencies
- represent files directly consumed by the rule or rules that output source files
deps dependencies
- rule pointing to separately-compiled modules providing header files, symbols, libraries, data, etc.

data dependencies

the build system runs tests in an isolated directory where only files listed as data are available