Cache Consistency Development Guide
This page is for developers maintaining server-side cache code. It defines which paths require strong consistency, how Redis version fences act as authoritative freshness boundaries, and what rules new caches must follow.
Core rule: asynchronous invalidation is for convergence, not correctness. Authorization, access control, room settings, playback state, membership, and resource-existence paths must remain correct even when a node has not received an invalidation event.
Components
Section titled “Components”| Component | Purpose | Code entry point |
|---|---|---|
| L1 cache | Per-node memory cache that avoids repeated local reads | moka cache, RoomSettingsCache, PlaybackStateCache |
| Redis L2 cache | Shared cross-node cache that reduces PostgreSQL read load | synctv-core/src/cache/l2_backend.rs |
| Redis version fence | Authoritative freshness version for a logical resource | synctv-core/src/cache/consistency.rs |
| PostgreSQL row version | Durable optimistic-lock version of business state | repository-layer version columns |
| invalidation stream | Clears other nodes’ local caches sooner | CacheInvalidationRuntime |
Redis version fences are the decision point for strong reads. L1 and L2 values must carry versions; a strong read may return cached data only when the cached version satisfies the fence.
Strong Domains
Section titled “Strong Domains”CacheDomain defines logical resources that can be guarded by Redis fences:
| Domain | Scope | Current strategy |
|---|---|---|
RoomSettings(room_id) | Room password, join policy, approval policy, role default permissions, and room access behavior | Redis allocates versions, DB stores exact versions, L1/L2 use version-aware writes |
Playback(room_id) | Current playback state, resets, autoplay, and playback state after media cleanup | Redis allocates versions, DB stores exact versions, L2 uses state version CAS |
Permission(room_id, user_id) | One member’s effective permissions | Member-level mutations advance the user fence through reservations; strong reads validate both user fence and room-settings fence |
RoomMembership(room_id, user_id) | Membership, kick, leave, and post-leave access boundaries | If cached later, it must first join the fence protocol; current critical paths are DB-authoritative |
MediaResource(room_id, media_id) | Media existence, ownership, and access after deletion | If cached later, it must first join the fence protocol; current critical paths are DB-authoritative |
Playlist(room_id, playlist_id) | Playlist existence, ownership, and access after deletion | If cached later, it must first join the fence protocol; current critical paths are DB-authoritative |
UserAuthSecurity(user_id) | Ban, deletion, password version, token revocation, OAuth/passkey/session state | If cached, it must fail closed or join the fence protocol |
Do not design domains around API routes. A domain should represent business state that changes and is validated together.
Strong Read Protocol
Section titled “Strong Read Protocol”A strong read must follow this logic:
- Read the Redis fence.
- If Redis is unavailable or the fence store is not authoritative, authorization and access-control paths must bypass cache and read PostgreSQL; they must not trust old cache.
- Check L1. Return it only when
cached.version >= fence. - Check L2. Return it only when
cached.version >= fence. - Read PostgreSQL and refresh cache with a version-aware write.
Pseudocode:
let fence = version_fence.current_version(&domain).await?;
if let Some(value) = l1.get(key).await { if value.version >= fence { return Ok(value); }}
if let Some(value) = l2.get(key).await? { if value.version >= fence { return Ok(value); }}
let value = repository.load_with_version(key).await?;cache.set_if_version_at_least(key, value.clone()).await?;Ok(value)Do not use simple cache-first logic in strong reads. Cache-first is only acceptable for paths explicitly marked eventual and low risk.
Write Protocol
Section titled “Write Protocol”For resources with business row versions, Redis is the version allocator:
- Read the current DB version from PostgreSQL.
- Use
ConsistencyCoordinatorto begin a fence write so Redis/local fence state atomically checks whether the current committed or pending fence is already ahead of the observed DB version and reserves a pending version. - Commit the PostgreSQL optimistic-lock update with that exact reserved version.
- Commit the same fence reservation token after the database transaction commits. If the DB CAS or transaction fails, abort only the matching pending reservation.
- Write L2/L1 with
set_if_version_at_least. - Publish invalidation and realtime events so other nodes converge sooner.
This order prevents the unsafe state: PostgreSQL has the new version while Redis fence still exposes the old version.
Redis may hold a pending state. For example, a CAS conflict, transaction rollback, process crash, or outbox failure may leave a pending version without a matching DB commit. Strong reads must bypass cache and read PostgreSQL while pending exists. That is fail-safe, with the cost that the domain temporarily loses cache hits.
The current implementation has committed/pending state in the fence store and ConsistencyCoordinator: strong reads fall back to DB when pending exists, and tokenized room-settings, playback, membership, member-role, and member-permission writes commit the matching reservation after the database commit. Read-time repair and the bootstrapped background repair worker repair by comparing PostgreSQL row version with the pending version: if DB has reached the pending version, finalize pending; if DB has not reached the pending version and the pending lease has expired, expire the abandoned pending reservation; if DB has not reached the pending version and the lease has not expired, keep pending. A local timeout alone must not abort pending; the repair must also compare PostgreSQL version.
Business services should not call the low-level fence store directly. New strong-consistency paths must begin/commit/abort reservations, seed, or record DB fallback through ConsistencyCoordinator. This keeps metrics, error classification, and the pending/committed fence protocol behind one replacement point.
Reservation Lifecycle
Section titled “Reservation Lifecycle”A SyncTV fence reservation is not part of the PostgreSQL transaction. Rolling back a DB transaction does not clear a pending reservation from Redis/local fence state. Every reservation therefore needs an explicit owner, and that owner must cover every exit path.
Mandatory rules:
- After
begin_*writesucceeds, the reservation must immediately be owned by the current function, a local owner/collector, or a return value that successfully transfers ownership to the caller. - Before ownership is transferred to the caller, every later
?,return Err(...), CAS miss, outbox failure, auxiliary cleanup failure, and transaction commit failure must abort the matching reservation first. - If a helper creates a reservation, that helper must clean up its own failure paths. The caller can only clean up reservations that were successfully returned.
- Batch reservation code must use a collector/owner pattern. If reservation N+1 fails, the first N reservations must be aborted immediately.
- Fence commit may happen only after the PostgreSQL transaction has committed. Do not expose a pending reservation as committed before the durable DB fact exists.
- Fence commit failure is a post-commit repair problem. Do not try to “roll back” a DB-committed business fact by aborting the version after commit.
Forbidden pattern:
let reservation = begin_write().await?;write_db_row().await?;delete_auxiliary_rows().await?;tx.commit().await?;commit_write(&reservation).await?;Correct code must explicitly close error exits:
let reservation = begin_write().await?;
let result: Result<_> = async { write_db_row().await?; delete_auxiliary_rows().await?; Ok(())}.await;
if let Err(error) = result { abort_write(reservation.as_ref()).await; return Err(error);}
if let Err(error) = tx.commit().await { abort_write(reservation.as_ref()).await; return Err(error.into());}
commit_write(reservation.as_ref(), db_version).await?;Before changing strong-consistency write paths, audit reservation ownership with source search and inspect every begin site that the change can affect:
rg -n "begin_.*write|begin_observed_write|VersionFenceReservation" synctv-core/src/service synctv-core/src/cacherg -n "abort_.*write|commit_.*write|commit_reserved_write|abort_reserved_write" synctv-core/src/service synctv-core/src/cacheThis search does not prove correctness. Reviewers must inspect every relevant
begin site and verify owner transfer, every ? / return Err path before
transfer, transaction commit failure handling, post-commit finalization, and
cache invalidation.
Papers and open-source systems provide principles, not a drop-in implementation for this codebase. Spanner, etcd, and Kubernetes watch-cache designs keep version proofs inside one controlled system. SyncTV currently spans PostgreSQL transactions and Redis/local fence state without a global transaction manager, so service code must explicitly maintain pending reservation ownership, abort, and commit.
L2 Writes
Section titled “L2 Writes”Redis L2 must not be overwritten unconditionally. Any reload-from-DB path that writes L2 must use a version-aware write:
cache.set_if_version_at_least(key, value).await?;This prevents a racing read from writing version N back into Redis after a write path has already committed version N+1.
Permission Cache
Section titled “Permission Cache”Effective permissions are not stored as an independent snapshot table. They are computed at read time:
effective_permissions = f(global_defaults, room_settings.role_defaults, room_member.role, member_overrides)Permission cache entries therefore store two versions:
| Field | Source | Meaning |
|---|---|---|
user_version | Permission(room_id, user_id) fence | Freshness of the member’s own role and overrides |
room_settings_version | PostgreSQL _settings row version | Room settings version used when computing this permission value |
A strong permission read may return cache only when both checks pass:
cached.user_version >= Redis Permission(room_id, user_id) fencecached.room_settings_version >= Redis RoomSettings(room_id) fenceChanging one member’s role or permission overrides advances only that member’s Permission(room_id, user_id) fence.
Room default permissions are part of RoomSettings. After a settings write advances the RoomSettings(room_id) fence, old permission cache entries are rejected because their room_settings_version no longer satisfies the new fence. invalidate_room_cache(room_id) only performs local clearing and broadcast convergence; it is not the correctness mechanism.
Role of Invalidation
Section titled “Role of Invalidation”Redis Streams, local broadcast, and PostgreSQL notifications are convergence mechanisms:
- Reduce stale L1 residency.
- Reduce the chance that the next strong read falls back to DB.
- Drive Realtime resource observation re-evaluation.
They are not the source of strong consistency. When adding a strong path, design the fence and version validation first, then add invalidation as an optimization.
New Cache Design Rules
Section titled “New Cache Design Rules”New or changed caches must satisfy these constraints:
| Constraint | Rule |
|---|---|
| Authorization, access control, existence checks, and critical user-visible state | Use the strong/fence protocol; if the path cannot join the fence protocol, keep it DB-authoritative |
| Cached value version source | Prefer a business row version; derived values store the source versions used in computation |
| Relationship between Redis fence and DB version | Strong reads must not see a committed fence that lags the DB; install a pending reservation before the DB commit |
| L2 overwrite semantics | All writes use set_if_version_at_least; an older reload cannot overwrite a newer value |
| Redis unavailable semantics | Authorization paths fail closed or bypass cache and read DB |
| Async invalidation semantics | Invalidation is only a convergence optimization, never a correctness dependency |
| Service integration | Use ConsistencyCoordinator for fence access; do not call low-level VersionFenceStore primitives directly from service code |
Observability
Section titled “Observability”Consistency metrics are used to detect safe-but-degraded reads and write paths that need repair:
| Metric | Meaning |
|---|---|
cache_fence_operations_total{domain,operation,result} | Success, conflict, timeout, and error counts for current-version reads, begin/commit/abort, and seed operations |
cache_db_fallback_total{domain,reason} | Strong reads that fell back to PostgreSQL because of missing fences, stale cache, L2 errors, and similar reasons |
cache_stale_write_reject_total{cache_type,level} | Version-aware cache writes rejected because L1/L2 already held a newer value |
cache_fence_pending{domain} | Whether a domain currently has a pending fence reservation |
cache_fence_repair_total{domain,result} | Read-time PostgreSQL fallback repair outcomes for advancing/finalizing fences |
cache_fence_db_compare{domain,relation} | Redis fence vs PostgreSQL version relation observed during repair/patrol, such as fence_behind_db, fence_ahead_db, or pending_ahead_db |
Test Requirements
Section titled “Test Requirements”Strong-consistency cache changes should cover these cases:
- L1 contains an old value, Redis fence has advanced, and the strong read rejects L1.
- L2 contains an old value, Redis fence has advanced, and the strong read rejects L2.
- The write path reserves a Redis version and stores that exact version in DB.
- An older reload cannot overwrite a newer L2 value.
- A derived cache is rejected after an upstream source version changes.
Permission changes should also cover:
- A member-level permission mutation affects only that user’s permission fence.
- A room default permission mutation makes old permission cache entries fail the room-settings fence check.
Research Comparison And Current Conclusions
Section titled “Research Comparison And Current Conclusions”This design review used at least 15 modern papers, production writeups, and popular open-source system documents:
| Source | Useful idea | SyncTV conclusion |
|---|---|---|
| Scaling Memcache at Facebook | Leases, invalidation fanout, hot-key protection, and treating cache as an operated system | SyncTV treats invalidation as convergence; production operation should observe fence lag, CAS skips, and DB fallback |
| TAO: Facebook’s Distributed Data Store for the Social Graph | Object-oriented cache/version structure over graph data | CacheDomain should follow business resources; derived values must store source versions |
| RAMP-TAO | Multi-object reads must avoid fractured visibility | Permission cache derives from member rows and room settings, so it must store both source versions |
| Polaris / Cache Made Consistent | Production cache consistency needs independent detection, not only code review | Redis fence vs DB version lag and rejected old L2 writes are core consistency signals |
| Amazon Dynamo | Object versioning and explicit conflict handling | Redis fence may be ahead of DB, but cache entries must carry real source versions, not only fence versions |
| Google Spanner | Monotonic timestamps and external consistency depend on clear commit ordering | SyncTV is not a global transaction system; strong guarantees are per domain through Redis monotonic fences |
| Cloud Spanner external consistency docs | Strong and stale reads are explicit modes | SyncTV must keep strong and eventual APIs clearly separated |
| Calvin | Decide transaction order before execution | Reserving Redis version before DB write is correct; a fence ahead of DB is a safe cache-miss state |
| RAMP transactions | Read-atomic metadata is required for multi-source derived reads | Permission cache must store member version and room settings version, not a single logical invalidation version |
| FaRM | High-performance transactions still need validation | L2 CAS and DB optimistic locks are both required; unconditional set is not acceptable |
| Kubernetes API concepts | resourceVersion is used for change detection and consistency requirements | RoomSettings.version, 无客户端缓存版本, and RoomMember.version should be cache source versions |
| Kubernetes consistent reads from cache | Consistent cache reads require progress/version proof | SyncTV L1/L2 can serve strong reads only after satisfying Redis fence |
| etcd API guarantees | Linearizable and serializable/stale reads are separate modes | Redis failure must not make authorization paths trust stale cache; use DB fallback or fail closed |
| Envoy xDS protocol | Version + nonce avoids ACK/NACK races | Realtime and invalidation can later expose observed-version debug fields, but correctness must not depend on ACKs |
| CockroachDB follower reads | Stale reads are explicit consistent historical reads | SyncTV eventual paths are only for low-risk reads, never authorization or access control |
| TiDB stale read | Historical reads require TSO/safe-point boundaries | Stale-read modes require an explicit staleness bound, not an implicit TTL guarantee |
| Cassandra LWT | CAS/linearizable writes are useful for critical conditional updates | SyncTV must keep DB optimistic locks plus Redis L2 version CAS for critical cache writes |
| Redis Lua scripting | Redis scripts can provide atomic compare-and-set boundaries | set_version_at_least and L2 Lua CAS provide Redis-side atomic version boundaries |
Design conclusions:
| Topic | Conclusion |
|---|---|
| Derived caches | Derived cache entries store actual source versions, not the current Redis fence as if it were a source version. Permission cache stores RoomMember.version and room settings row version. |
| Write ordering | The fence exposed to strong reads must not lag DB; Redis failure must not silently complete a strong-consistency write. Room settings, playback, and membership/permission/role writes begin a pending fence first, then write exact DB versions. |
| Multi-source strong reads | Strong reads that need multiple fences must evaluate cache hits against one coherent freshness boundary. |
| Production observability | Consistency observability covers fence lag, CAS rejects, DB fallback, and Redis fence unavailable events. |
| Delete semantics | Delete, leave, and kick transitions require explicit version semantics; member deletion writes a PostgreSQL lifecycle marker as the tombstone version, while strong reads still use DB authority for non-member authorization and never cache a successful authorization result for a removed member. |
| Eventual paths | Eventual APIs and strong APIs have different consistency contracts; authorization and access control must not use eventual reads. |