DeclEngine insert-elimination breaks recursion detection and causes stack overflow on recursive functions
Eliminating re-inserts of same method declarations (same after monomorphization and updating constant expressions) in method_application::monomorphize_method resulted in a side-effect of recursion detection not working anymore. Apparently, the detection, for unknown reasons, depends on the unwanted behavior of having same method declarations inserted several times.
The elimination of re-inserts is done in #7659.
Below is the summary of an unsuccessful attempt done by Claude Opus 4.8 to analyze and fix the root cause.
Summary
As part of reducing pressure on the DeclEngine, monomorphize_method (and the
same pattern elsewhere) was changed to skip insert(...).with_parent(...)
when the monomorphization/transformation reports no change, returning the
original DeclRef unchanged.
A side effect of that change is that recursive functions are no longer
detected. Instead of the clean CompileError::RecursiveCall /
CompileError::RecursiveCallChain, the compiler now overflows its stack
while compiling a program that contains a recursive function.
Reverting only the monomorphize_method hunk restores the clean error, which
isolates the regression to insert/with_parent-elimination.
Related: recursion is unsupported in Sway, see
#3018.
Reproducer
test/src/e2e_vm_tests/test_programs/should_fail/impl_self_recursive — forc build
overflows the stack. The test has been disabled as a mitigation (see
Mitigation); its main does not even call the recursive
methods, yet the build still overflows.
Minimal version:
script;
struct Foo {}
impl Foo {
pub fn rec_0(self) -> u32 { self.rec_0() }
}
fn main() -> u32 { 0 }
Symptom and where it overflows
Pipeline (relevant part): TyProgram::type_check →
check_recursive (the pass that emits RecursiveCall) → run_decl_checks →
collect_types_metadata → control-flow analysis → IR generation → ASM.
Verified with phase markers:
- With the optimization,
check_recursive passes and emits nothing, and the
overflow happens in a later phase (IR generation).
- With the optimization reverted,
check_recursive emits the expected
RecursiveCall and compilation stops cleanly — no overflow.
So check_recursive is the intended single gate, and the bug is that the
optimization makes it miss the recursion. Everything downstream assumes
recursion was already rejected.
Root cause
Finding A — the recursion-graph normalization is coupled to the with_parent side effect
check_recursive (sway-core/src/semantic_analysis/type_check_analysis.rs)
builds a dependency graph in which every function is one node and a call is an
edge; recursion shows up as a cycle (a self-loop in the simplest case). By the
time the typed AST is analyzed, a callee is frequently a monomorphized copy of
the original declaration with a fresh DeclId, so it is normalized back to its
origin node by walking the DeclEngine's parents chain (recorded via
with_parent).
The optimization's no-op path returns the original DeclRef without
inserting and re-parenting a copy. The recursive call's fn_ref then has no
parents link back to the function being analyzed, so it normalizes to a
different node, the self-loop never forms, and the cycle is invisible.
A robust normalization key that does not depend on this side effect is the
function's parsed declaration id (ParsedDeclId), which is shared by every
typed function derived from the same source function (it is copied on every
DeclEngine::insert). Adding it as a fallback in get_node_for_fn_decl
(consulted only after the existing parents-chain and items_node_stack
lookups fail) fixes Finding A and is harmless to the passing cases. However,
it does not fix the bug on its own — see Finding B.
Finding B — the dominant problem: the recursive declarations are absent from what check_recursive walks
check_recursive iterates the typed module tree (root_module, its submodules,
and each module's all_nodes). Instrumenting that traversal shows:
- With the optimization, the script's own declarations (
struct Foo,
impl Foo, main, rec_0..rec_5) are absent from the module tree
check_recursive walks. In instrumented runs it visited ~60 functions, all
from std, and none of main/rec_*/impl Foo; root_module.all_nodes
contained only import side-effects.
- With the optimization reverted, those declarations are present
(impl_for=Foo, …) and the recursion is detected.
- This holds whether or not the recursive method is reachable from
main
(live vs dead): the script's code is missing from the traversal either way.
Crucially, those declarations still exist in the DeclEngine and are processed
by later phases (which is what overflows). So the typed AST that
check_recursive walks and the set the later phases process have diverged.
We have not pinned Finding B to an exact line. Leads:
- The construction of the typed module tree /
all_nodes (or a
reachability/collection step feeding it) appears to rely on the parents
graph, which the no-op path no longer populates.
- This correlates with the observation that DCA (dead-code analysis) accuracy
improved when inserts were eliminated (e.g. commit c25fe18a7…, which
surfaced an additional valid "struct field never accessed" warning). The same
insert-elimination is changing which declarations are considered/retained in
the typed program, which is consistent with declarations dropping out of the
module tree.
The downstream is multi-site (why a localized IR guard is not enough)
Because check_recursive is bypassed, multiple later passes that
independently walk the call graph are exposed. Confirmed/observed so far:
PanickingFunctionCache::can_panic (sway-core/src/ir_generation.rs), called
at the start of every compile_function (ir_generation/compile.rs, gated on
backtrace != None): walks callees recursively and writes its cache only
after computing the result, so a recursive function recurses forever. It
already carries a // TODO: Add support for recursive functions ... #3018
comment on the FunctionApplication arm.
CompiledFunctionCache::get_compiled_function: same "inline callee, cache
after compiling" shape.
- At least one earlier site: with the test as-is, the overflow happens
before any function is IR-compiled. Instrumenting compile_function shows it
is reached for trivial programs (prints main_0, helper_1) but is reached
zero times for the recursive test — i.e. compile_program / compile_script
overflows during setup, before main is even compiled. This site was not
localized.
Guarding can_panic + get_compiled_function (break the cycle in can_panic,
emit RecursiveCall from get_compiled_function) was implemented and did not
stop the overflow — it just moved to the earlier, unlocalized site. Guarding
individual sites is therefore whack-a-mole; the correct fix is at the single gate
(check_recursive).
What was tried
| Attempt |
Result |
Revert only the monomorphize_method hunk |
Clean RecursiveCall, no overflow (confirms the cause) |
Parsed-id fallback in get_node_for_fn_decl (Finding A) |
Correct + harmless, but does not fix the bug (Finding B: code never reaches check_recursive) |
| Global parsed-id keying of graph nodes |
Fixes Finding A but regresses std (analysis-time stack overflow from over-merging monomorphizations) |
IR guard at can_panic + get_compiled_function |
Does not stop the overflow (dominant site is earlier; multi-site) |
Recommended fix
- Proper fix (single gate): make
check_recursive see the recursive
declarations again — i.e. find and fix why the optimization drops them from
the typed module tree (Finding B). Optionally also land the parsed-id fallback
(Finding A) so detection no longer depends on the with_parent side effect.
- Backstop (optional, defensive): guard the IR-gen recursion sites
(can_panic, get_compiled_function, and whichever earlier site overflows
first) so that a recursive function that ever slips past check_recursive
yields a RecursiveCall error rather than a stack overflow. This is the
#3018 direction and is whack-a-mole without (1).
Severity / priority
Low practical impact: developer-written recursion is rare and is normally
rejected; the regression turns a clean compile error into a compiler crash
(stack overflow). Acceptable to defer behind the DeclEngine optimization work,
but the crash-instead-of-error behavior should be tracked.
Mitigation applied
test/src/e2e_vm_tests/test_programs/should_fail/impl_self_recursive/test.toml
was set to category = "disabled" with an explanatory comment. The original
# check: expectations are retained in the file as documentation of the intended
behavior. Re-enable (category = "fail") once detection is restored. No
test.workaround.toml was added because there is no truthful expected output —
the build currently crashes with a stack overflow.
DeclEngineinsert-elimination breaks recursion detection and causes stack overflow on recursive functionsEliminating re-inserts of same method declarations (same after monomorphization and updating constant expressions) in
method_application::monomorphize_methodresulted in a side-effect of recursion detection not working anymore. Apparently, the detection, for unknown reasons, depends on the unwanted behavior of having same method declarations inserted several times.The elimination of re-inserts is done in #7659.
Below is the summary of an unsuccessful attempt done by Claude Opus 4.8 to analyze and fix the root cause.
Summary
As part of reducing pressure on the
DeclEngine,monomorphize_method(and thesame pattern elsewhere) was changed to skip
insert(...).with_parent(...)when the monomorphization/transformation reports no change, returning the
original
DeclRefunchanged.A side effect of that change is that recursive functions are no longer
detected. Instead of the clean
CompileError::RecursiveCall/CompileError::RecursiveCallChain, the compiler now overflows its stackwhile compiling a program that contains a recursive function.
Reverting only the
monomorphize_methodhunk restores the clean error, whichisolates the regression to insert/
with_parent-elimination.Related: recursion is unsupported in Sway, see
#3018.
Reproducer
test/src/e2e_vm_tests/test_programs/should_fail/impl_self_recursive—forc buildoverflows the stack. The test has been disabled as a mitigation (see
Mitigation); its
maindoes not even call the recursivemethods, yet the build still overflows.
Minimal version:
Symptom and where it overflows
Pipeline (relevant part):
TyProgram::type_check→check_recursive(the pass that emitsRecursiveCall) →run_decl_checks→collect_types_metadata→ control-flow analysis → IR generation → ASM.Verified with phase markers:
check_recursivepasses and emits nothing, and theoverflow happens in a later phase (IR generation).
check_recursiveemits the expectedRecursiveCalland compilation stops cleanly — no overflow.So
check_recursiveis the intended single gate, and the bug is that theoptimization makes it miss the recursion. Everything downstream assumes
recursion was already rejected.
Root cause
Finding A — the recursion-graph normalization is coupled to the
with_parentside effectcheck_recursive(sway-core/src/semantic_analysis/type_check_analysis.rs)builds a dependency graph in which every function is one node and a call is an
edge; recursion shows up as a cycle (a self-loop in the simplest case). By the
time the typed AST is analyzed, a callee is frequently a monomorphized copy of
the original declaration with a fresh
DeclId, so it is normalized back to itsorigin node by walking the
DeclEngine'sparentschain (recorded viawith_parent).The optimization's no-op path returns the original
DeclRefwithoutinserting and re-parenting a copy. The recursive call's
fn_refthen has noparentslink back to the function being analyzed, so it normalizes to adifferent node, the self-loop never forms, and the cycle is invisible.
A robust normalization key that does not depend on this side effect is the
function's parsed declaration id (
ParsedDeclId), which is shared by everytyped function derived from the same source function (it is copied on every
DeclEngine::insert). Adding it as a fallback inget_node_for_fn_decl(consulted only after the existing
parents-chain anditems_node_stacklookups fail) fixes Finding A and is harmless to the passing cases. However,
it does not fix the bug on its own — see Finding B.
Finding B — the dominant problem: the recursive declarations are absent from what
check_recursivewalkscheck_recursiveiterates the typed module tree (root_module, its submodules,and each module's
all_nodes). Instrumenting that traversal shows:struct Foo,impl Foo,main,rec_0..rec_5) are absent from the module treecheck_recursivewalks. In instrumented runs it visited ~60 functions, allfrom
std, and none ofmain/rec_*/impl Foo;root_module.all_nodescontained only import side-effects.
(
impl_for=Foo, …) and the recursion is detected.main(live vs dead): the script's code is missing from the traversal either way.
Crucially, those declarations still exist in the
DeclEngineand are processedby later phases (which is what overflows). So the typed AST that
check_recursivewalks and the set the later phases process have diverged.We have not pinned Finding B to an exact line. Leads:
all_nodes(or areachability/collection step feeding it) appears to rely on the
parentsgraph, which the no-op path no longer populates.
improved when inserts were eliminated (e.g. commit
c25fe18a7…, whichsurfaced an additional valid "struct field never accessed" warning). The same
insert-elimination is changing which declarations are considered/retained in
the typed program, which is consistent with declarations dropping out of the
module tree.
The downstream is multi-site (why a localized IR guard is not enough)
Because
check_recursiveis bypassed, multiple later passes thatindependently walk the call graph are exposed. Confirmed/observed so far:
PanickingFunctionCache::can_panic(sway-core/src/ir_generation.rs), calledat the start of every
compile_function(ir_generation/compile.rs, gated onbacktrace != None): walks callees recursively and writes its cache onlyafter computing the result, so a recursive function recurses forever. It
already carries a
// TODO: Add support for recursive functions ... #3018comment on the
FunctionApplicationarm.CompiledFunctionCache::get_compiled_function: same "inline callee, cacheafter compiling" shape.
before any function is IR-compiled. Instrumenting
compile_functionshows itis reached for trivial programs (prints
main_0,helper_1) but is reachedzero times for the recursive test — i.e.
compile_program/compile_scriptoverflows during setup, before
mainis even compiled. This site was notlocalized.
Guarding
can_panic+get_compiled_function(break the cycle incan_panic,emit
RecursiveCallfromget_compiled_function) was implemented and did notstop the overflow — it just moved to the earlier, unlocalized site. Guarding
individual sites is therefore whack-a-mole; the correct fix is at the single gate
(
check_recursive).What was tried
monomorphize_methodhunkRecursiveCall, no overflow (confirms the cause)get_node_for_fn_decl(Finding A)check_recursive)std(analysis-time stack overflow from over-merging monomorphizations)can_panic+get_compiled_functionRecommended fix
check_recursivesee the recursivedeclarations again — i.e. find and fix why the optimization drops them from
the typed module tree (Finding B). Optionally also land the parsed-id fallback
(Finding A) so detection no longer depends on the
with_parentside effect.(
can_panic,get_compiled_function, and whichever earlier site overflowsfirst) so that a recursive function that ever slips past
check_recursiveyields a
RecursiveCallerror rather than a stack overflow. This is the#3018direction and is whack-a-mole without (1).Severity / priority
Low practical impact: developer-written recursion is rare and is normally
rejected; the regression turns a clean compile error into a compiler crash
(stack overflow). Acceptable to defer behind the
DeclEngineoptimization work,but the crash-instead-of-error behavior should be tracked.
Mitigation applied
test/src/e2e_vm_tests/test_programs/should_fail/impl_self_recursive/test.tomlwas set to
category = "disabled"with an explanatory comment. The original# check:expectations are retained in the file as documentation of the intendedbehavior. Re-enable (
category = "fail") once detection is restored. Notest.workaround.tomlwas added because there is no truthful expected output —the build currently crashes with a stack overflow.