Skip to content

DeclEngine insert-elimination in monomorphize_method breaks recursion detection #7658

Description

@ironcev

DeclEngine insert-elimination breaks recursion detection and causes stack overflow on recursive functions

Eliminating re-inserts of same method declarations (same after monomorphization and updating constant expressions) in method_application::monomorphize_method resulted in a side-effect of recursion detection not working anymore. Apparently, the detection, for unknown reasons, depends on the unwanted behavior of having same method declarations inserted several times.

The elimination of re-inserts is done in #7659.

Below is the summary of an unsuccessful attempt done by Claude Opus 4.8 to analyze and fix the root cause.

Summary

As part of reducing pressure on the DeclEngine, monomorphize_method (and the
same pattern elsewhere) was changed to skip insert(...).with_parent(...)
when the monomorphization/transformation reports no change, returning the
original DeclRef unchanged.

A side effect of that change is that recursive functions are no longer
detected
. Instead of the clean CompileError::RecursiveCall /
CompileError::RecursiveCallChain, the compiler now overflows its stack
while compiling a program that contains a recursive function.

Reverting only the monomorphize_method hunk restores the clean error, which
isolates the regression to insert/with_parent-elimination.

Related: recursion is unsupported in Sway, see
#3018.

Reproducer

test/src/e2e_vm_tests/test_programs/should_fail/impl_self_recursiveforc build
overflows the stack. The test has been disabled as a mitigation (see
Mitigation); its main does not even call the recursive
methods, yet the build still overflows.

Minimal version:

script;
struct Foo {}
impl Foo {
    pub fn rec_0(self) -> u32 { self.rec_0() }
}
fn main() -> u32 { 0 }

Symptom and where it overflows

Pipeline (relevant part): TyProgram::type_check
check_recursive (the pass that emits RecursiveCall) → run_decl_checks
collect_types_metadata → control-flow analysis → IR generation → ASM.

Verified with phase markers:

  • With the optimization, check_recursive passes and emits nothing, and the
    overflow happens in a later phase (IR generation).
  • With the optimization reverted, check_recursive emits the expected
    RecursiveCall and compilation stops cleanly — no overflow.

So check_recursive is the intended single gate, and the bug is that the
optimization makes it miss the recursion. Everything downstream assumes
recursion was already rejected.

Root cause

Finding A — the recursion-graph normalization is coupled to the with_parent side effect

check_recursive (sway-core/src/semantic_analysis/type_check_analysis.rs)
builds a dependency graph in which every function is one node and a call is an
edge; recursion shows up as a cycle (a self-loop in the simplest case). By the
time the typed AST is analyzed, a callee is frequently a monomorphized copy of
the original declaration with a fresh DeclId, so it is normalized back to its
origin node by walking the DeclEngine's parents chain (recorded via
with_parent).

The optimization's no-op path returns the original DeclRef without
inserting and re-parenting a copy. The recursive call's fn_ref then has no
parents link back to the function being analyzed, so it normalizes to a
different node, the self-loop never forms, and the cycle is invisible.

A robust normalization key that does not depend on this side effect is the
function's parsed declaration id (ParsedDeclId), which is shared by every
typed function derived from the same source function (it is copied on every
DeclEngine::insert). Adding it as a fallback in get_node_for_fn_decl
(consulted only after the existing parents-chain and items_node_stack
lookups fail) fixes Finding A and is harmless to the passing cases. However,
it does not fix the bug on its own — see Finding B.

Finding B — the dominant problem: the recursive declarations are absent from what check_recursive walks

check_recursive iterates the typed module tree (root_module, its submodules,
and each module's all_nodes). Instrumenting that traversal shows:

  • With the optimization, the script's own declarations (struct Foo,
    impl Foo, main, rec_0..rec_5) are absent from the module tree
    check_recursive walks. In instrumented runs it visited ~60 functions, all
    from std, and none of main/rec_*/impl Foo; root_module.all_nodes
    contained only import side-effects.
  • With the optimization reverted, those declarations are present
    (impl_for=Foo, …) and the recursion is detected.
  • This holds whether or not the recursive method is reachable from main
    (live vs dead): the script's code is missing from the traversal either way.

Crucially, those declarations still exist in the DeclEngine and are processed
by later phases
(which is what overflows). So the typed AST that
check_recursive walks and the set the later phases process have diverged.

We have not pinned Finding B to an exact line. Leads:

  • The construction of the typed module tree / all_nodes (or a
    reachability/collection step feeding it) appears to rely on the parents
    graph, which the no-op path no longer populates.
  • This correlates with the observation that DCA (dead-code analysis) accuracy
    improved
    when inserts were eliminated (e.g. commit c25fe18a7…, which
    surfaced an additional valid "struct field never accessed" warning). The same
    insert-elimination is changing which declarations are considered/retained in
    the typed program, which is consistent with declarations dropping out of the
    module tree.

The downstream is multi-site (why a localized IR guard is not enough)

Because check_recursive is bypassed, multiple later passes that
independently walk the call graph are exposed. Confirmed/observed so far:

  • PanickingFunctionCache::can_panic (sway-core/src/ir_generation.rs), called
    at the start of every compile_function (ir_generation/compile.rs, gated on
    backtrace != None): walks callees recursively and writes its cache only
    after computing the result, so a recursive function recurses forever. It
    already carries a // TODO: Add support for recursive functions ... #3018
    comment on the FunctionApplication arm.
  • CompiledFunctionCache::get_compiled_function: same "inline callee, cache
    after compiling" shape.
  • At least one earlier site: with the test as-is, the overflow happens
    before any function is IR-compiled. Instrumenting compile_function shows it
    is reached for trivial programs (prints main_0, helper_1) but is reached
    zero times for the recursive test — i.e. compile_program / compile_script
    overflows during setup, before main is even compiled. This site was not
    localized.

Guarding can_panic + get_compiled_function (break the cycle in can_panic,
emit RecursiveCall from get_compiled_function) was implemented and did not
stop the overflow
— it just moved to the earlier, unlocalized site. Guarding
individual sites is therefore whack-a-mole; the correct fix is at the single gate
(check_recursive).

What was tried

Attempt Result
Revert only the monomorphize_method hunk Clean RecursiveCall, no overflow (confirms the cause)
Parsed-id fallback in get_node_for_fn_decl (Finding A) Correct + harmless, but does not fix the bug (Finding B: code never reaches check_recursive)
Global parsed-id keying of graph nodes Fixes Finding A but regresses std (analysis-time stack overflow from over-merging monomorphizations)
IR guard at can_panic + get_compiled_function Does not stop the overflow (dominant site is earlier; multi-site)

Recommended fix

  1. Proper fix (single gate): make check_recursive see the recursive
    declarations again — i.e. find and fix why the optimization drops them from
    the typed module tree (Finding B). Optionally also land the parsed-id fallback
    (Finding A) so detection no longer depends on the with_parent side effect.
  2. Backstop (optional, defensive): guard the IR-gen recursion sites
    (can_panic, get_compiled_function, and whichever earlier site overflows
    first) so that a recursive function that ever slips past check_recursive
    yields a RecursiveCall error rather than a stack overflow. This is the
    #3018 direction and is whack-a-mole without (1).

Severity / priority

Low practical impact: developer-written recursion is rare and is normally
rejected; the regression turns a clean compile error into a compiler crash
(stack overflow). Acceptable to defer behind the DeclEngine optimization work,
but the crash-instead-of-error behavior should be tracked.

Mitigation applied

test/src/e2e_vm_tests/test_programs/should_fail/impl_self_recursive/test.toml
was set to category = "disabled" with an explanatory comment. The original
# check: expectations are retained in the file as documentation of the intended
behavior. Re-enable (category = "fail") once detection is restored. No
test.workaround.toml was added because there is no truthful expected output —
the build currently crashes with a stack overflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcompilerGeneral compiler. Should eventually become more specific as the issue is triagedcompiler: frontendEverything to do with type checking, control flow analysis, and everything between parsing and IRgen

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions