You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
243 lines
9.6 KiB
243 lines
9.6 KiB
|
|
# Starlark in Go: Implementation
|
|
|
|
This document (a work in progress) describes some of the design
|
|
choices of the Go implementation of Starlark.
|
|
|
|
* [Scanner](#scanner)
|
|
* [Parser](#parser)
|
|
* [Resolver](#resolver)
|
|
* [Evaluator](#evaluator)
|
|
* [Data types](#data-types)
|
|
* [Freezing](#freezing)
|
|
* [Testing](#testing)
|
|
|
|
|
|
## Scanner
|
|
|
|
The scanner is derived from Russ Cox's
|
|
[buildifier](https://github.com/bazelbuild/buildtools/tree/master/buildifier)
|
|
tool, which pretty-prints Bazel BUILD files.
|
|
|
|
Most of the work happens in `(*scanner).nextToken`.
|
|
|
|
## Parser
|
|
|
|
The parser is hand-written recursive-descent parser. It uses the
|
|
technique of [precedence
|
|
climbing](http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing)
|
|
to reduce the number of productions.
|
|
|
|
In some places the parser accepts a larger set of programs than are
|
|
strictly valid, leaving the task of rejecting them to the subsequent
|
|
resolver pass. For example, in the function call `f(a, b=c)` the
|
|
parser accepts any expression for `a` and `b`, even though `b` may
|
|
legally be only an identifier. For the parser to distinguish these
|
|
cases would require additional lookahead.
|
|
|
|
## Resolver
|
|
|
|
The resolver reports structural errors in the program, such as the use
|
|
of `break` and `continue` outside of a loop.
|
|
|
|
Starlark has stricter syntactic limitations than Python. For example,
|
|
it does not permit `for` loops or `if` statements at top level, nor
|
|
does it permit global variables to be bound more than once.
|
|
These limitations come from the Bazel project's desire to make it easy
|
|
to identify the sole statement that defines each global, permitting
|
|
accurate cross-reference documentation.
|
|
|
|
In addition, the resolver validates all variable names, classifying
|
|
them as references to universal, global, local, or free variables.
|
|
Local and free variables are mapped to a small integer, allowing the
|
|
evaluator to use an efficient (flat) representation for the
|
|
environment.
|
|
|
|
Not all features of the Go implementation are "standard" (that is,
|
|
supported by Bazel's Java implementation), at least for now, so
|
|
non-standard features such as `lambda`, `float`, and `set`
|
|
are flag-controlled. The resolver reports
|
|
any uses of dialect features that have not been enabled.
|
|
|
|
|
|
## Evaluator
|
|
|
|
### Data types
|
|
|
|
<b>Integers:</b> Integers are representing using `big.Int`, an
|
|
arbitrary precision integer. This representation was chosen because,
|
|
for many applications, Starlark must be able to handle without loss
|
|
protocol buffer values containing signed and unsigned 64-bit integers,
|
|
which requires 65 bits of precision.
|
|
|
|
Small integers (<256) are preallocated, but all other values require
|
|
memory allocation. Integer performance is relatively poor, but it
|
|
matters little for Bazel-like workloads which depend much
|
|
more on lists of strings than on integers. (Recall that a typical loop
|
|
over a list in Starlark does not materialize the loop index as an `int`.)
|
|
|
|
An optimization worth trying would be to represent integers using
|
|
either an `int32` or `big.Int`, with the `big.Int` used only when
|
|
`int32` does not suffice. Using `int32`, not `int64`, for "small"
|
|
numbers would make it easier to detect overflow from operations like
|
|
`int32 * int32`, which would trigger the use of `big.Int`.
|
|
|
|
<b>Floating point</b>:
|
|
Floating point numbers are represented using Go's `float64`.
|
|
Again, `float` support is required to support protocol buffers. The
|
|
existence of floating-point NaN and its infamous comparison behavior
|
|
(`NaN != NaN`) had many ramifications for the API, since we cannot
|
|
assume the result of an ordered comparison is either less than,
|
|
greater than, or equal: it may also fail.
|
|
|
|
<b>Strings</b>:
|
|
|
|
TODO: discuss UTF-8 and string.bytes method.
|
|
|
|
<b>Dictionaries and sets</b>:
|
|
Starlark dictionaries have predictable iteration order.
|
|
Furthermore, many Starlark values are hashable in Starlark even though
|
|
the Go values that represent them are not hashable in Go: big
|
|
integers, for example.
|
|
Consequently, we cannot use Go maps to implement Starlark's dictionary.
|
|
|
|
We use a simple hash table whose buckets are linked lists, each
|
|
element of which holds up to 8 key/value pairs. In a well-distributed
|
|
table the list should rarely exceed length 1. In addition, each
|
|
key/value item is part of doubly-linked list that maintains the
|
|
insertion order of the elements for iteration.
|
|
|
|
<b>Struct:</b>
|
|
The `starlarkstruct` Go package provides a non-standard Starlark
|
|
extension data type, `struct`, that maps field identifiers to
|
|
arbitrary values. Fields are accessed using dot notation: `y = s.f`.
|
|
This data type is extensively used in Bazel, but its specification is
|
|
currently evolving.
|
|
|
|
Starlark has no `class` mechanism, nor equivalent of Python's
|
|
`namedtuple`, though it is likely that future versions will support
|
|
some way to define a record data type of several fields, with a
|
|
representation more efficient than a hash table.
|
|
|
|
|
|
### Freezing
|
|
|
|
All mutable values created during module initialization are _frozen_
|
|
upon its completion. It is this property that permits a Starlark module
|
|
to be referenced by two Starlark threads running concurrently (such as
|
|
the initialization threads of two other modules) without the
|
|
possibility of a data race.
|
|
|
|
The Go implementation supports freezing by storing an additional
|
|
"frozen" Boolean variable in each mutable object. Once this flag is set,
|
|
all subsequent attempts at mutation fail. Every value defines a
|
|
Freeze method that sets its own frozen flag if not already set, and
|
|
calls Freeze for each value that it contains.
|
|
For example, when a list is frozen, it freezes each of its elements;
|
|
when a dictionary is frozen, it freezes each of its keys and values;
|
|
and when a function value is frozen, it freezes each of the free
|
|
variables and parameter default values implicitly referenced by its closure.
|
|
Application-defined types must also follow this discipline.
|
|
|
|
The freeze mechanism in the Go implementation is finer grained than in
|
|
the Java implementation: in effect, the latter has one "frozen" flag
|
|
per module, and every value holds a reference to the frozen flag of
|
|
its module. This makes setting the frozen flag more efficient---a
|
|
simple bit flip, no need to traverse the object graph---but coarser
|
|
grained. Also, it complicates the API slightly because to construct a
|
|
list, say, requires a reference to the frozen flag it should use.
|
|
|
|
The Go implementation would also permit the freeze operation to be
|
|
exposed to the program, for example as a built-in function.
|
|
This has proven valuable in writing tests of the freeze mechanism
|
|
itself, but is otherwise mostly a curiosity.
|
|
|
|
|
|
### Fail-fast iterators
|
|
|
|
In some languages (such as Go), a program may mutate a data structure
|
|
while iterating over it; for example, a range loop over a map may
|
|
delete map elements. In other languages (such as Java), iterators do
|
|
extra bookkeeping so that modification of the underlying collection
|
|
invalidates the iterator, and the next attempt to use it fails.
|
|
This often helps to detect subtle mistakes.
|
|
|
|
Starlark takes this a step further. Instead of mutation of the
|
|
collection invalidating the iterator, the act of iterating makes the
|
|
collection temporarily immutable, so that an attempt to, say, delete a
|
|
dict element while looping over the dict, will fail. The error is
|
|
reported against the delete operation, not the iteration.
|
|
|
|
This is implemented by having each mutable iterable value record a
|
|
counter of active iterators. Starting a loop increments this counter,
|
|
and completing a loop decrements it. A collection with a nonzero
|
|
counter behaves as if frozen. If the collection is actually frozen,
|
|
the counter bookkeeping is unnecessary. (Consequently, iterator
|
|
bookkeeping is needed only while objects are still mutable, before
|
|
they can have been published to another thread, and thus no
|
|
synchronization is necessary.)
|
|
|
|
A consequence of this design is that in the Go API, it is imperative
|
|
to call `Done` on each iterator once it is no longer needed.
|
|
|
|
```
|
|
TODO
|
|
starlark.Value interface and subinterfaces
|
|
argument passing to builtins: UnpackArgs, UnpackPositionalArgs.
|
|
```
|
|
|
|
<b>Evaluation strategy:</b>
|
|
The evaluator uses a simple recursive tree walk, returning a value or
|
|
an error for each expression. We have experimented with just-in-time
|
|
compilation of syntax trees to bytecode, but two limitations in the
|
|
current Go compiler prevent this strategy from outperforming the
|
|
tree-walking evaluator.
|
|
|
|
First, the Go compiler does not generate a "computed goto" for a
|
|
switch statement ([Go issue
|
|
5496](https://github.com/golang/go/issues/5496)). A bytecode
|
|
interpreter's main loop is a for-loop around a switch statement with
|
|
dozens or hundreds of cases, and the speed with which each case can be
|
|
dispatched strongly affects overall performance.
|
|
Currently, a switch statement generates a binary tree of ordered
|
|
comparisons, requiring several branches instead of one.
|
|
|
|
Second, the Go compiler's escape analysis assumes that the underlying
|
|
array from a `make([]Value, n)` allocation always escapes
|
|
([Go issue 20533](https://github.com/golang/go/issues/20533)).
|
|
Because the bytecode interpreter's operand stack has a non-constant
|
|
length, it must be allocated with `make`. The resulting allocation
|
|
adds to the cost of each Starlark function call; this can be tolerated
|
|
by amortizing one very large stack allocation across many calls.
|
|
More problematic appears to be the cost of the additional GC write
|
|
barriers incurred by every VM operation: every intermediate result is
|
|
saved to the VM's operand stack, which is on the heap.
|
|
By contrast, intermediate results in the tree-walking evaluator are
|
|
never stored to the heap.
|
|
|
|
```
|
|
TODO
|
|
frames, backtraces, errors.
|
|
threads
|
|
Print
|
|
Load
|
|
```
|
|
|
|
## Testing
|
|
|
|
```
|
|
TODO
|
|
starlarktest package
|
|
`assert` module
|
|
starlarkstruct
|
|
integration with Go testing.T
|
|
```
|
|
|
|
|
|
## TODO
|
|
|
|
|
|
```
|
|
Discuss practical separation of code and data.
|
|
```
|