Cache Consistency Development Guide

This page is for developers maintaining server-side cache code. It defines which paths require strong consistency, how Redis version fences act as authoritative freshness boundaries, and what rules new caches must follow.

Core rule: asynchronous invalidation is for convergence, not correctness. Authorization, access control, room settings, playback state, membership, and resource-existence paths must remain correct even when a node has not received an invalidation event.

Components

Component	Purpose	Code entry point
L1 cache	Per-node memory cache that avoids repeated local reads	`moka` cache, `RoomSettingsCache`, `PlaybackStateCache`
Redis L2 cache	Shared cross-node cache that reduces PostgreSQL read load	`synctv-core/src/cache/l2_backend.rs`
Redis version fence	Authoritative freshness version for a logical resource	`synctv-core/src/cache/consistency.rs`
PostgreSQL row version	Durable optimistic-lock version of business state	repository-layer `version` columns
invalidation stream	Clears other nodes’ local caches sooner	`CacheInvalidationRuntime`

Redis version fences are the decision point for strong reads. L1 and L2 values must carry versions; a strong read may return cached data only when the cached version satisfies the fence.

Strong Domains

CacheDomain defines logical resources that can be guarded by Redis fences:

Domain	Scope	Current strategy
`RoomSettings(room_id)`	Room password, join policy, approval policy, role default permissions, and room access behavior	Redis allocates versions, DB stores exact versions, L1/L2 use version-aware writes
`Playback(room_id)`	Current playback state, resets, autoplay, and playback state after media cleanup	Redis allocates versions, DB stores exact versions, L2 uses state `version` CAS
`Permission(room_id, user_id)`	One member’s effective permissions	Member-level mutations advance the user fence through reservations; strong reads validate both user fence and room-settings fence
`RoomMembership(room_id, user_id)`	Membership, kick, leave, and post-leave access boundaries	If cached later, it must first join the fence protocol; current critical paths are DB-authoritative
`MediaResource(room_id, media_id)`	Media existence, ownership, and access after deletion	If cached later, it must first join the fence protocol; current critical paths are DB-authoritative
`Playlist(room_id, playlist_id)`	Playlist existence, ownership, and access after deletion	If cached later, it must first join the fence protocol; current critical paths are DB-authoritative
`UserAuthSecurity(user_id)`	Ban, deletion, password version, token revocation, OAuth/passkey/session state	If cached, it must fail closed or join the fence protocol

Do not design domains around API routes. A domain should represent business state that changes and is validated together.

Strong Read Protocol

A strong read must follow this logic:

Read the Redis fence.
If Redis is unavailable or the fence store is not authoritative, authorization and access-control paths must bypass cache and read PostgreSQL; they must not trust old cache.
Check L1. Return it only when cached.version >= fence.
Check L2. Return it only when cached.version >= fence.
Read PostgreSQL and refresh cache with a version-aware write.

Pseudocode:

let fence = version_fence.current_version(&domain).await?;

if let Some(value) = l1.get(key).await {
    if value.version >= fence {
        return Ok(value);
    }
}

if let Some(value) = l2.get(key).await? {
    if value.version >= fence {
        return Ok(value);
    }
}

let value = repository.load_with_version(key).await?;
cache.set_if_version_at_least(key, value.clone()).await?;
Ok(value)

Do not use simple cache-first logic in strong reads. Cache-first is only acceptable for paths explicitly marked eventual and low risk.

Write Protocol

For resources with business row versions, Redis is the version allocator:

Read the current DB version from PostgreSQL.
Use ConsistencyCoordinator to begin a fence write so Redis/local fence state atomically checks whether the current committed or pending fence is already ahead of the observed DB version and reserves a pending version.
Commit the PostgreSQL optimistic-lock update with that exact reserved version.
Commit the same fence reservation token after the database transaction commits. If the DB CAS or transaction fails, abort only the matching pending reservation.
Write L2/L1 with set_if_version_at_least.
Publish invalidation and realtime events so other nodes converge sooner.

This order prevents the unsafe state: PostgreSQL has the new version while Redis fence still exposes the old version.

Redis may hold a pending state. For example, a CAS conflict, transaction rollback, process crash, or outbox failure may leave a pending version without a matching DB commit. Strong reads must bypass cache and read PostgreSQL while pending exists. That is fail-safe, with the cost that the domain temporarily loses cache hits.

The current implementation has committed/pending state in the fence store and ConsistencyCoordinator: strong reads fall back to DB when pending exists, and tokenized room-settings, playback, membership, member-role, and member-permission writes commit the matching reservation after the database commit. Read-time repair and the bootstrapped background repair worker repair by comparing PostgreSQL row version with the pending version: if DB has reached the pending version, finalize pending; if DB has not reached the pending version and the pending lease has expired, expire the abandoned pending reservation; if DB has not reached the pending version and the lease has not expired, keep pending. A local timeout alone must not abort pending; the repair must also compare PostgreSQL version.

Business services should not call the low-level fence store directly. New strong-consistency paths must begin/commit/abort reservations, seed, or record DB fallback through ConsistencyCoordinator. This keeps metrics, error classification, and the pending/committed fence protocol behind one replacement point.

Reservation Lifecycle

A SyncTV fence reservation is not part of the PostgreSQL transaction. Rolling back a DB transaction does not clear a pending reservation from Redis/local fence state. Every reservation therefore needs an explicit owner, and that owner must cover every exit path.

Mandatory rules:

After begin_*write succeeds, the reservation must immediately be owned by the current function, a local owner/collector, or a return value that successfully transfers ownership to the caller.
Before ownership is transferred to the caller, every later ?, return Err(...), CAS miss, outbox failure, auxiliary cleanup failure, and transaction commit failure must abort the matching reservation first.
If a helper creates a reservation, that helper must clean up its own failure paths. The caller can only clean up reservations that were successfully returned.
Batch reservation code must use a collector/owner pattern. If reservation N+1 fails, the first N reservations must be aborted immediately.
Fence commit may happen only after the PostgreSQL transaction has committed. Do not expose a pending reservation as committed before the durable DB fact exists.
Fence commit failure is a post-commit repair problem. Do not try to “roll back” a DB-committed business fact by aborting the version after commit.

Forbidden pattern:

let reservation = begin_write().await?;
write_db_row().await?;
delete_auxiliary_rows().await?;
tx.commit().await?;
commit_write(&reservation).await?;

Correct code must explicitly close error exits:

let reservation = begin_write().await?;

let result: Result<_> = async {
    write_db_row().await?;
    delete_auxiliary_rows().await?;
    Ok(())
}
.await;

if let Err(error) = result {
    abort_write(reservation.as_ref()).await;
    return Err(error);
}

if let Err(error) = tx.commit().await {
    abort_write(reservation.as_ref()).await;
    return Err(error.into());
}

commit_write(reservation.as_ref(), db_version).await?;

Before changing strong-consistency write paths, audit reservation ownership with source search and inspect every begin site that the change can affect:

rg -n "begin_.*write|begin_observed_write|VersionFenceReservation" synctv-core/src/service synctv-core/src/cache
rg -n "abort_.*write|commit_.*write|commit_reserved_write|abort_reserved_write" synctv-core/src/service synctv-core/src/cache

This search does not prove correctness. Reviewers must inspect every relevant begin site and verify owner transfer, every ? / return Err path before transfer, transaction commit failure handling, post-commit finalization, and cache invalidation.

Papers and open-source systems provide principles, not a drop-in implementation for this codebase. Spanner, etcd, and Kubernetes watch-cache designs keep version proofs inside one controlled system. SyncTV currently spans PostgreSQL transactions and Redis/local fence state without a global transaction manager, so service code must explicitly maintain pending reservation ownership, abort, and commit.

L2 Writes

Redis L2 must not be overwritten unconditionally. Any reload-from-DB path that writes L2 must use a version-aware write:

cache.set_if_version_at_least(key, value).await?;

This prevents a racing read from writing version N back into Redis after a write path has already committed version N+1.

Permission Cache

Effective permissions are not stored as an independent snapshot table. They are computed at read time:

effective_permissions =
  f(global_defaults, room_settings.role_defaults, room_member.role, member_overrides)

Permission cache entries therefore store two versions:

Field	Source	Meaning
`user_version`	`Permission(room_id, user_id)` fence	Freshness of the member’s own role and overrides
`room_settings_version`	PostgreSQL `_settings` row version	Room settings version used when computing this permission value

A strong permission read may return cache only when both checks pass:

cached.user_version >= Redis Permission(room_id, user_id) fence
cached.room_settings_version >= Redis RoomSettings(room_id) fence

Changing one member’s role or permission overrides advances only that member’s Permission(room_id, user_id) fence.

Room default permissions are part of RoomSettings. After a settings write advances the RoomSettings(room_id) fence, old permission cache entries are rejected because their room_settings_version no longer satisfies the new fence. invalidate_room_cache(room_id) only performs local clearing and broadcast convergence; it is not the correctness mechanism.

Role of Invalidation

Redis Streams, local broadcast, and PostgreSQL notifications are convergence mechanisms:

Reduce stale L1 residency.
Reduce the chance that the next strong read falls back to DB.
Drive Realtime resource observation re-evaluation.

They are not the source of strong consistency. When adding a strong path, design the fence and version validation first, then add invalidation as an optimization.

New Cache Design Rules

New or changed caches must satisfy these constraints:

Constraint	Rule
Authorization, access control, existence checks, and critical user-visible state	Use the strong/fence protocol; if the path cannot join the fence protocol, keep it DB-authoritative
Cached value version source	Prefer a business row version; derived values store the source versions used in computation
Relationship between Redis fence and DB version	Strong reads must not see a committed fence that lags the DB; install a pending reservation before the DB commit
L2 overwrite semantics	All writes use `set_if_version_at_least`; an older reload cannot overwrite a newer value
Redis unavailable semantics	Authorization paths fail closed or bypass cache and read DB
Async invalidation semantics	Invalidation is only a convergence optimization, never a correctness dependency
Service integration	Use `ConsistencyCoordinator` for fence access; do not call low-level `VersionFenceStore` primitives directly from service code

Observability

Consistency metrics are used to detect safe-but-degraded reads and write paths that need repair:

Metric	Meaning
`cache_fence_operations_total{domain,operation,result}`	Success, conflict, timeout, and error counts for current-version reads, begin/commit/abort, and seed operations
`cache_db_fallback_total{domain,reason}`	Strong reads that fell back to PostgreSQL because of missing fences, stale cache, L2 errors, and similar reasons
`cache_stale_write_reject_total{cache_type,level}`	Version-aware cache writes rejected because L1/L2 already held a newer value
`cache_fence_pending{domain}`	Whether a domain currently has a pending fence reservation
`cache_fence_repair_total{domain,result}`	Read-time PostgreSQL fallback repair outcomes for advancing/finalizing fences
`cache_fence_db_compare{domain,relation}`	Redis fence vs PostgreSQL version relation observed during repair/patrol, such as `fence_behind_db`, `fence_ahead_db`, or `pending_ahead_db`

Test Requirements

Strong-consistency cache changes should cover these cases:

L1 contains an old value, Redis fence has advanced, and the strong read rejects L1.
L2 contains an old value, Redis fence has advanced, and the strong read rejects L2.
The write path reserves a Redis version and stores that exact version in DB.
An older reload cannot overwrite a newer L2 value.
A derived cache is rejected after an upstream source version changes.

Permission changes should also cover:

A member-level permission mutation affects only that user’s permission fence.
A room default permission mutation makes old permission cache entries fail the room-settings fence check.

Research Comparison And Current Conclusions

This design review used at least 15 modern papers, production writeups, and popular open-source system documents:

Source	Useful idea	SyncTV conclusion
Scaling Memcache at Facebook	Leases, invalidation fanout, hot-key protection, and treating cache as an operated system	SyncTV treats invalidation as convergence; production operation should observe fence lag, CAS skips, and DB fallback
TAO: Facebook’s Distributed Data Store for the Social Graph	Object-oriented cache/version structure over graph data	`CacheDomain` should follow business resources; derived values must store source versions
RAMP-TAO	Multi-object reads must avoid fractured visibility	Permission cache derives from member rows and room settings, so it must store both source versions
Polaris / Cache Made Consistent	Production cache consistency needs independent detection, not only code review	Redis fence vs DB version lag and rejected old L2 writes are core consistency signals
Amazon Dynamo	Object versioning and explicit conflict handling	Redis fence may be ahead of DB, but cache entries must carry real source versions, not only fence versions
Google Spanner	Monotonic timestamps and external consistency depend on clear commit ordering	SyncTV is not a global transaction system; strong guarantees are per domain through Redis monotonic fences
Cloud Spanner external consistency docs	Strong and stale reads are explicit modes	SyncTV must keep strong and eventual APIs clearly separated
Calvin	Decide transaction order before execution	Reserving Redis version before DB write is correct; a fence ahead of DB is a safe cache-miss state
RAMP transactions	Read-atomic metadata is required for multi-source derived reads	Permission cache must store member version and room settings version, not a single logical invalidation version
FaRM	High-performance transactions still need validation	L2 CAS and DB optimistic locks are both required; unconditional set is not acceptable
Kubernetes API concepts	`resourceVersion` is used for change detection and consistency requirements	`RoomSettings.version`, `无客户端缓存版本`, and `RoomMember.version` should be cache source versions
Kubernetes consistent reads from cache	Consistent cache reads require progress/version proof	SyncTV L1/L2 can serve strong reads only after satisfying Redis fence
etcd API guarantees	Linearizable and serializable/stale reads are separate modes	Redis failure must not make authorization paths trust stale cache; use DB fallback or fail closed
Envoy xDS protocol	Version + nonce avoids ACK/NACK races	Realtime and invalidation can later expose observed-version debug fields, but correctness must not depend on ACKs
CockroachDB follower reads	Stale reads are explicit consistent historical reads	SyncTV eventual paths are only for low-risk reads, never authorization or access control
TiDB stale read	Historical reads require TSO/safe-point boundaries	Stale-read modes require an explicit staleness bound, not an implicit TTL guarantee
Cassandra LWT	CAS/linearizable writes are useful for critical conditional updates	SyncTV must keep DB optimistic locks plus Redis L2 version CAS for critical cache writes
Redis Lua scripting	Redis scripts can provide atomic compare-and-set boundaries	`set_version_at_least` and L2 Lua CAS provide Redis-side atomic version boundaries

Design conclusions:

Topic	Conclusion
Derived caches	Derived cache entries store actual source versions, not the current Redis fence as if it were a source version. Permission cache stores `RoomMember.version` and room settings row version.
Write ordering	The fence exposed to strong reads must not lag DB; Redis failure must not silently complete a strong-consistency write. Room settings, playback, and membership/permission/role writes begin a pending fence first, then write exact DB versions.
Multi-source strong reads	Strong reads that need multiple fences must evaluate cache hits against one coherent freshness boundary.
Production observability	Consistency observability covers fence lag, CAS rejects, DB fallback, and Redis fence unavailable events.
Delete semantics	Delete, leave, and kick transitions require explicit version semantics; member deletion writes a PostgreSQL lifecycle marker as the tombstone version, while strong reads still use DB authority for non-member authorization and never cache a successful authorization result for a removed member.
Eventual paths	Eventual APIs and strong APIs have different consistency contracts; authorization and access control must not use eventual reads.