Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/labeler.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: labeler

on:
pull_request:
pull_request_target:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change to this file is scope-creep. Please revert, open a different PR with rationale for this change if you want.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted. Will open a separate PR with rationale if we want to pursue that change.

types:
- opened
- reopened
Expand All @@ -14,12 +14,14 @@ on:
permissions:
contents: read
pull-requests: write
issues: write

jobs:
label:
permissions:
contents: read
pull-requests: write
issues: write
runs-on: blacksmith-4vcpu-ubuntu-2404
steps:
- uses: actions/checkout@v5
Expand Down
19 changes: 16 additions & 3 deletions conn/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,20 @@ type Node struct {
}

// NewNode returns a new Node instance.
func NewNode(rc *pb.RaftContext, store *raftwal.DiskStorage, tlsConfig *tls.Config) *Node {
// electionTick controls how many ticks (each 100ms) before an election is triggered.
// If electionTick <= 0, defaults to 20 (i.e., 2s election timeout).
func NewNode(rc *pb.RaftContext, store *raftwal.DiskStorage, tlsConfig *tls.Config,
electionTick int) *Node {

const heartbeatTick = 1 // 100ms per tick
if electionTick <= 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current guard only handles electionTick <= 0. But HeartbeatTick is hardcoded to 1 just below, and etcd raft requires ElectionTick > HeartbeatTick. If an operator sets election-tick=1, Config.validate()
returns "election tick must be greater than heartbeat tick" and newRaft panics during StartNode a cryptic crash on boot that never mentions the flag they set.

Since this is the only place a raft.Config is built (every Alpha node and Zero flow through NewNode), the heartbeat floor of 1 applies everywhere, so election-tick can never legally be 1. Better to fail fast with
a clear message than let raft panic

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this has been addressed in the latest commit.

Added validation at lines 91–94:

const heartbeatTick = 1 // 100ms per tick

if electionTick <= 0 {
    electionTick = 20
}

if electionTick <= heartbeatTick {
    glog.Fatalf(
        "election-tick=%d is invalid: must be greater than heartbeat-tick (%d). "+
            "Recommended minimum is 10 (1s election timeout).",
        electionTick,
        heartbeatTick,
    )
}

Now, if a user sets:

--raft "election-tick=1"

the process fails fast during startup with a clear validation error instead of encountering a cryptic Raft panic later during initialization.

Thanks for reviewing, let me know if any changes/explaination req.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, do you think it would make sense to add a stricter fail-safe and reject election tick values below 10 altogether? Since the recommendation is already 10 (1s election timeout), allowing smaller values may not be particularly useful and could lead to unstable configurations.

Similarly, would it be worth introducing an upper bound as well (for example, around 24 hours) to prevent accidentally misconfigured values resulting in extremely long election timeouts?

I'd be interested in your thoughts on both of these. Do you prefer keeping the validation minimal (only ensuring it's greater than the heartbeat tick), or would you favor enforcing a practical operating range for election tick values?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LakshimiRam-073 , Thanks for adding the validation. The fail-fast check on electionTick <= heartbeatTick makes sense to me since that's a hard correctness requirement anyway (raft.Config.validate() will panic if it's violated).

For the two follow-up questions, I'd personally keep the validation fairly minimal and avoid enforcing a practical range.

For the lower bound of 10, I don't think we should reject smaller values. One of the main reasons for exposing this flag is to give operators control over election timing, and there are valid use cases for running below 10. For example, integration/CI tests often benefit from much faster failover, single-node development setups don't have the same stability concerns, and some low-latency deployments may intentionally optimize for sub-second failover. etcd itself only requires electionTick > heartbeatTick. Also, since we run with PreVote: true, we're already protected from one of the more problematic failure modes where a flaky follower can repeatedly disrupt a healthy leader.

I'd also avoid adding an upper bound. Any limit we pick—24 hours or otherwise is ultimately arbitrary. If someone configures an extremely large election timeout, they've effectively decided they want leader changes to be very rare or even handled manually. That's unusual, but it could still be intentional. Hard limits in cases like this often end up causing more frustration than value.

That said, I don't love silently accepting very small values either. Since raft randomizes the timeout in [electionTick, 2*electionTick), setting electionTick=2 means an election can start after missing only a couple of heartbeats. At that point, something as simple as a GC pause or brief network hiccup could trigger unnecessary leader elections.

Instead of rejecting those values, I'd lean toward logging a warning:

if electionTick < 10 {
    glog.Warningf("election-tick=%d gives a %dms election timeout. Values below 10 (1s) "+
        "may cause spurious leader elections under GC pauses or network jitter.",
        electionTick, electionTick*100)
}

That way we preserve flexibility while still making operators aware of the tradeoffs.

@LakshimiRam-073 LakshimiRam-073 Jun 15, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit fix(raft): warn on negative and low election-tick values (view changes).

I kept the hard correctness check on electionTick <= heartbeatTick, and preserved flexibility for smaller valid values by logging a warning instead of rejecting them. Negative values now warn and fall back to the default, while 0 continues to mean unset/default.

electionTick = 20
}
if electionTick <= heartbeatTick {
glog.Fatalf("election-tick=%d is invalid: must be greater than heartbeat-tick (%d). "+
"Recommended minimum is 10 (1s election timeout).", electionTick, heartbeatTick)
}

snap, err := store.Snapshot()
x.Check(err)

Expand All @@ -90,8 +103,8 @@ func NewNode(rc *pb.RaftContext, store *raftwal.DiskStorage, tlsConfig *tls.Conf
Store: store,
Cfg: &raft.Config{
ID: rc.Id,
ElectionTick: 20, // 2s if we call Tick() every 100 ms.
HeartbeatTick: 1, // 100ms if we call Tick() every 100 ms.
ElectionTick: electionTick, // Default 2s if tick is 100ms.
HeartbeatTick: heartbeatTick, // 100ms if we call Tick() every 100 ms.
Storage: store,
MaxInflightMsgs: 256,
MaxSizePerMsg: 256 << 10, // 256 KB should allow more batching.
Expand Down
2 changes: 1 addition & 1 deletion conn/node_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ func TestProposal(t *testing.T) {
store := raftwal.Init(dir)

rc := &pb.RaftContext{Id: 1}
n := NewNode(rc, store, nil)
n := NewNode(rc, store, nil, 0)

peers := []raft.Peer{{ID: n.Id}}
n.SetRaft(raft.StartNode(n.Cfg, peers))
Expand Down
3 changes: 3 additions & 0 deletions dgraph/cmd/alpha/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,9 @@ they form a Raft group and provide synchronous replication.
"to 0 to disable duration based snapshot.").
Flag("pending-proposals",
"Number of pending mutation proposals. Useful for rate limiting.").
Flag("election-tick",
"Number of ticks (each 100ms) before a follower starts an election. "+
"Default 20 means 2s election timeout. Increase in high-latency networks.").
String())

flag.String("security", worker.SecurityDefaults, z.NewSuperFlagHelp(worker.SecurityDefaults).
Expand Down
2 changes: 1 addition & 1 deletion dgraph/cmd/zero/raft.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ import (
)

const (
raftDefaults = "idx=1; learner=false;"
raftDefaults = "idx=1; learner=false; election-tick=20;"
)

var proposalKey uint64
Expand Down
6 changes: 5 additions & 1 deletion dgraph/cmd/zero/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,9 @@ instances to achieve high-availability.
Flag("learner",
`Make this Zero a "learner" node. In learner mode, this Zero will not participate `+
"in Raft elections. This can be used to achieve a read-only replica.").
Flag("election-tick",
"Number of ticks (each 100ms) before a follower starts an election. "+
"Default 20 means 2s election timeout. Increase in high-latency networks.").
String())

flag.String("audit", worker.AuditDefaults, z.NewSuperFlagHelp(worker.AuditDefaults).
Expand Down Expand Up @@ -160,7 +163,8 @@ func (st *state) serveGRPC(l net.Listener, store *raftwal.DiskStorage) {
Group: 0,
IsLearner: opts.raft.GetBool("learner"),
}
m := conn.NewNode(&rc, store, opts.tlsClientConfig)
electionTick := opts.raft.GetInt64("election-tick")
m := conn.NewNode(&rc, store, opts.tlsClientConfig, int(electionTick))

// Zero followers should not be forwarding proposals to the leader, to avoid txn commits which
// were calculated in a previous Zero leader.
Expand Down
3 changes: 2 additions & 1 deletion worker/draft.go
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,8 @@ func newNode(store *raftwal.DiskStorage, gid uint32, id uint64, myAddr string) *
IsLearner: isLearner,
}
glog.Infof("RaftContext: %+v\n", rc)
m := conn.NewNode(rc, store, x.WorkerConfig.TLSClientConfig)
electionTick := x.WorkerConfig.Raft.GetInt64("election-tick")
m := conn.NewNode(rc, store, x.WorkerConfig.TLSClientConfig, int(electionTick))

n := &node{
Node: m,
Expand Down
2 changes: 1 addition & 1 deletion worker/server_state.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ const (
AuditDefaults = `compress=false; days=10; size=100; dir=; output=; encrypt-file=;`
BadgerDefaults = `compression=snappy; numgoroutines=8;`
RaftDefaults = `learner=false; snapshot-after-entries=10000; ` +
`snapshot-after-duration=30m; pending-proposals=256; idx=; group=;`
`snapshot-after-duration=30m; pending-proposals=256; idx=; group=; election-tick=20;`
SecurityDefaults = `token=; whitelist=;`
CDCDefaults = `file=; kafka=; sasl_user=; sasl_password=; ca_cert=; client_cert=; ` +
`client_key=; sasl-mechanism=PLAIN; tls=false;`
Expand Down
Loading