Incremental compilation
This document outlines how incremental computation might be done in the Isograph compiler. It won't necessarily be exactly how it is implemented.
Why
In watch mode and when running the compiler as a language server, we want to efficiently recompute values in response to changes in input. When an input changes, we could recompile everything from scratch, but it would be much more efficient to recompute the minimal section of the computation graph that actually needs to change.
Consider this computation graph, which we might encounter if we have both file1 and file2 open simultaneously:
syntax highlighting (file1) -> iso literal ASTs in (file1) -> file text (file1)
syntax highlighting (file2) -> iso literal ASTs in (file2) -> file text (file2)
If the text of file1 changes, we don't need to recompute syntax highlighting (file2), and can just reuse the last computed value.
But even more subtly, if the file 1 text changes, but in a way that doesn't affect the AST (e.g. the user adds some JavaScript after the last iso literal), then we can recompute the "iso literal ASTs in (file1)" node, determine that the result didn't change, and thus avoid recomputing the "syntax highlighting (file1)" node.
But how can that be implemented?
Types of nodes
There are two types of nodes in the above graph: source nodes and computed nodes.
Source nodes
The "file text" nodes are the only source nodes shown above. A source node is the "raw input". A source node contains:
- its value, and
time_set
, the "epoch" when it was last set- note: the epoch is a monotonically increasing
u32
, stored in the Database
- note: the epoch is a monotonically increasing
What happens when a source node changes?
When a source node is updated, it's value is compared to the old one. If different, we increment the Database's epoch, update the stored value and set time_modified
to the current epoch.
Computed nodes
A computed node is a deterministic function that always receives, as its first parameter, the database (as &mut self
), and as its second parameter, receives a single struct of parameters. It must always call db.calculate
immediately. This is not enforced, but could be via #[derive]
macro.
fn syntax_highlighting(parent_db_view: &mut DatabaseView, file_name: FileName) -> &SyntaxHighlighting {
parent_db_view.calculate<T>(
"syntax_highlighting",
file_name: T,
// inner function:
|self_db_view, file_name: &FileName /* &T */| {
let asts = get_asts(self_db_view, file_name); // tracked
calculate_syntax_highlighting_from_asts(asts) // pure
}
)
}
// Or if we have a proc macro that does the db_view.calculate stuff:
#[track]
fn syntax_highlighting(db_view: &mut DatabaseView, file_name: &FileName) -> SyntaxHighlighting {
let asts = get_asts(nested_db_view, file_name);
calculate_syntax_highlighting_from_asts(asts) // this is a SyntaxHighlighting
}
Here, db_view.calculate
is called, which creates a derived node for the tuple ("syntax_highlighting", file_name)
.
What does calculate
do when called initially?
If db.calculate
is called for a given (name, param)
tuple, and no derived node for the (name, param)
tuple exists in the Database, it will:
- call the inner function and store its value in the Database at
(name, param)
, while - tracking its dependencies,
- i.e. track whenever
DatabaseView::calculate
is called within that inner function, e.g. viaget_asts_in(nested_db_view, ...)
.
- i.e. track whenever
- set the derived node's
time_calculated
to the maximumtime_calculated
/time_set
epoch of the dependencies (e0
), (this can beu32::MAX
if there are no dependencies),- Note: this is because the value would have been the same if this fn was called as early as
e0
- Note: this is because the value would have been the same if this fn was called as early as
- and set the derived node's
time_verified
to the max of the current epoche1
and oftime_calculated
(which can beu32::MAX
).
In the derived node in the Database for ("syntax_highlighting", file_name)
, it will track that get_asts_in
was called (which is tracked because it called DatabaseView::calculate
).
The resulting value (a SyntaxHighlighting
) is stored in the database, and a reference to it is returned to the caller.
What about when calculate
is called again?
If calculate
is called again (at epoch e2
), we "verify" the node. Assume that the node has time_calculated = e0
and time_verified = e1
(notably, e1 !== e2
).
We verify each dependency, which returns time_calculated/time_set
for each dependency.
If the dependencies all have time_calculated/time_set <= e1
, then, we return a reference to the last calculated value.
If not, then we call the inner function, and compare the new result to the old result. If new_result != old_result
, we update the stored value and set time_calculated
to the max time_calculated/time_set
epoch of the dependencies.
In either case, we set time_verified
to e2
.
What if we verify a node and time_verified === current_epoch
?
This may occur if no changes to source files have occured! So, we can just return the same value. This might occur, for example, if we want hover information for a node and syntax highlighting. Both can use the same AST.
Example
Consider this dependency tree:
hover info syntax highlighting go to definition
\ / /
AST --------------------/
|
source text
Hover info is called
- source text is set,
time_set = e0
- several other changes occur, epoch is now
e1
. - hover info is calculated. The derived node isn't in the Database, so we call its inner function, which
- causes AST to be calculated. That derived node isn't in the Database, so we call its inner function, which checks the source text (
time_set = e0
) - AST has
time_verified = e1
andtime_set = e0
- hover info is calculated, which also has
time_verified = e1
andtime_set = e0
.
Source text is updated and syntax highlighting is called
- source text is set,
time_set = e2
- syntax highlighting is calculated. The derived node isn't in the Database, so we call its inner function, which
- causes AST to be calculated. The derived node is in the Database, and
e2 > ast.time_verified
(e1
), so we check its dependencies. They have maxtime_set = e2
, which is greater thane1
(AST'stime_verified
), so we recalculate AST by calling its inner function. AST now hastime_verified = e2
andtime_set = e2
. - syntax highlighting is calculated, which also has
time_verified
e2
andtime_set
e2
No updates, but go to definition is called
- go to definition is calculated. The derived node isn't in the Database, so we call its inner function.
- This calls
calculate_ast(db_view)
. Theast
derived node is in the Database, andast.time_verified === current_epoch
, so we return the current value ofast
. - go to definition's inner function is completed, so that has
time_verified = e2
andtime_calculated = e2
.
Implementation details
Database
- Stores the global epoch (an incrementing counter)
- Has a
derived_nodes: HashMap<(&'static str, HashString)> -> DerivedNode
HashString
is the params hashed
- Has
source_nodes
? But these maybe could be separate
DerivedNode
result: Box<dyn Any>
- perhaps
params: Box<dyn Any>
(if we need to recover them) - the
time_verified
epoch - the
time_calculated
epoch.- Invariant:
time_calculated <= time_verified
- Invariant:
- perhaps:
time_accessed
epoch (for garbage collection) Vec<Dependency>
Dependencies
A sketch of how to store dependencies
struct DatabaseView<'parent, 'db: 'parent> {
database: &'db Database,
dependencies: Vec<Box<dyn Fn() -> bool>> + 'parent
parent_view: &'parent DatabaseView
}
impl<'parent, 'db: 'parent> DatabaseView<'parent, 'db> {
fn calculate<TParam: Any, TOutput: Any>(
&'parent mut self,
static_key: &'static str,
param: TParam,
inner_fn: impl Fn(&'db mut DatabaseView /* self db view */, &TParam) -> TOutput + 'parent;
) -> &'db TOutput {
let parent_db_view = self;
// tell parent_db_view that "syntax_highlighting" was called
parent_db_view.dependencies.push(Box::new(DerivedNodeDependency {
static_key,
param_hash: hash(param),
recalculate_inner_no_params: Box::new(|database_view| {
inner_fn(database_view, param)
}),
}));
if let Some(mut previously_derived_node) = self.database.get_mut(static_key, param) {
let has_changed = false;
for revalidate_dependency in previously_derived_node.dependencies.iter_mut() {
has_changed = revalidate_dependency();
if has_changed {
break;
}
}
if has_changed {
let nested_view = self.nested_view();
let derived_node_value = Box::new(inner_fn(&mut nested_view, ¶m));
parent_db_view.max_time_calculated = std::cmp::max(
parent_db_view.max_time_calculated,
nested_view.max_time_calculated,
);
// database.store returns a reference to the derived_node_value
return self.database.store(static_key, param, DerivedNode {
derived_node_value, // Box<dyn Any>,
time_verified: self.database.epoch,
time_calculated: nested_view.max_time_calculated,
dependencies: nested_view.dependencies,
}).downcast_ref::<TOutput>().expect("Expected to be the correct type")
} else {
return &previously_derived_node.derived_node_value;
}
} else {
panic!("same thing as above");
}
}
fn nested_view(&mut self) -> DatabaseView {
Self {
database: self.database,
dependencies: vec![],
max_time_calculated: 0,
}
}
}
impl Database {
fn revalidate<'db>(
&'db mut self,
parent_static_key: &'static str,
parent_param: HashString,
parent_recalculate_inner: impl Fn(&'db mut DatabaseView) -> Box<dyn Any + Eq> + 'db
) -> ChangeStatus {
match self.get_mut(static_key, param) {
Some(previously_derived_node) => {
// check each dependency
// if all dependencies are not invalidated, return HasNotChanged
// if any dependency is invalidated, recalculate_inner, compare, and
// if different, store and return HasChanged
// always set revalidated at to current epoch
let dependency_change_status = ChangeStatus::Unchanged;
for dependency in previously_derived_node.dependencies.iter_mut() {
if let Some(dependency_pdn) = self.get_mut(dependency.static_key, dependency.param_hash) {
if (dependency_pdn.time_validated === self.current_epoch) {
continue;
} else {
// potentially
dependency_change_status = ChangeStatus::Changed;
break;
}
} else {
panic!("Expected nested dep to exist");
}
}
match dependency_change_status {
ChangeStatus::Unchanged => {
return ChangeStatus::Unchanged
},
ChangeStatus::Changed => {
let mut database_view = self.database_view();
let new_value = parent_recalculate_inner(&mut database_view);
previously_derived_node.dependencies = database_view.dependencies;
previously_derived_node.time_verified = self.current_epoch;
if new_value != previously_derived_node.derived_node_value {
previously_derived_node.derived_node_value = new_value;
previously_derived_node.time_calculated = database_view.max_time_calculated;
return ChangeStatus::Changed
}
return ChangeStatus::Unchanged
},
}
}
None => {
panic!("Expected to exist if revalidating")
}
}
}
}
enum ChangeStatus {
Changed,
Unchanged,
}
DatabaseView
Notes
- Perhaps we can find a way to have typed storage, instead of
Box<dyn Any>
everywhere. - Perhaps we should intern all the results.
- How would garbage collection occur? It seems safe to drop items from the
derived_nodes
map, since they will simply be recalculated. (It is not safe to drop fromderived_node_inner_functions
)
To discuss
- references
- hash maps
- tracked inputs
unneeded
derived_node_inner_functions: HashMap<&'static str, Box<dyn Fn(&mut Database, Param) -> Box<dyn Any>>>
- This looks difficult to statically type.
- Why is this struct necessary?
- e.g. when checking
hover_info
, we need to check all of its dependencies (without callinghover_info.inner_function
!) This check could find thatAST
is stale (i.e. has dependencies that are updated). At this point, we need to callAST
's inner function. At this point, we compareast.new_value == ast.old_value
, and if these are the same, we can short circuit and avoid callinghover_info.inner_function
. - Thus, we need the ability to call
ast.inner_function
without going throughhover_info.inner_function
.
- e.g. when checking