Skip to content

Cache Consistency Development Guide

This page is for developers maintaining server-side cache code. It defines which paths require strong consistency, how Redis version fences act as authoritative freshness boundaries, and what rules new caches must follow.

Core rule: asynchronous invalidation is for convergence, not correctness. Authorization, access control, room settings, playback state, membership, and resource-existence paths must remain correct even when a node has not received an invalidation event.

ComponentPurposeCode entry point
L1 cachePer-node memory cache that avoids repeated local readsmoka cache, RoomSettingsCache, PlaybackStateCache
Redis L2 cacheShared cross-node cache that reduces PostgreSQL read loadsynctv-core/src/cache/l2_backend.rs
Redis version fenceAuthoritative freshness version for a logical resourcesynctv-core/src/cache/consistency.rs
PostgreSQL row versionDurable optimistic-lock version of business staterepository-layer version columns
invalidation streamClears other nodes’ local caches soonerCacheInvalidationRuntime

Redis version fences are the decision point for strong reads. L1 and L2 values must carry versions; a strong read may return cached data only when the cached version satisfies the fence.

CacheDomain defines logical resources that can be guarded by Redis fences:

DomainScopeCurrent strategy
RoomSettings(room_id)Room password, join policy, approval policy, role default permissions, and room access behaviorRedis allocates versions, DB stores exact versions, L1/L2 use version-aware writes
Playback(room_id)Current playback state, resets, autoplay, and playback state after media cleanupRedis allocates versions, DB stores exact versions, L2 uses state version CAS
Permission(room_id, user_id)One member’s effective permissionsMember-level mutations advance the user fence through reservations; strong reads validate both user fence and room-settings fence
RoomMembership(room_id, user_id)Membership, kick, leave, and post-leave access boundariesIf cached later, it must first join the fence protocol; current critical paths are DB-authoritative
MediaResource(room_id, media_id)Media existence, ownership, and access after deletionIf cached later, it must first join the fence protocol; current critical paths are DB-authoritative
Playlist(room_id, playlist_id)Playlist existence, ownership, and access after deletionIf cached later, it must first join the fence protocol; current critical paths are DB-authoritative
UserAuthSecurity(user_id)Ban, deletion, password version, token revocation, OAuth/passkey/session stateIf cached, it must fail closed or join the fence protocol

Do not design domains around API routes. A domain should represent business state that changes and is validated together.

A strong read must follow this logic:

  1. Read the Redis fence.
  2. If Redis is unavailable or the fence store is not authoritative, authorization and access-control paths must bypass cache and read PostgreSQL; they must not trust old cache.
  3. Check L1. Return it only when cached.version >= fence.
  4. Check L2. Return it only when cached.version >= fence.
  5. Read PostgreSQL and refresh cache with a version-aware write.

Pseudocode:

let fence = version_fence.current_version(&domain).await?;
if let Some(value) = l1.get(key).await {
if value.version >= fence {
return Ok(value);
}
}
if let Some(value) = l2.get(key).await? {
if value.version >= fence {
return Ok(value);
}
}
let value = repository.load_with_version(key).await?;
cache.set_if_version_at_least(key, value.clone()).await?;
Ok(value)

Do not use simple cache-first logic in strong reads. Cache-first is only acceptable for paths explicitly marked eventual and low risk.

For resources with business row versions, Redis is the version allocator:

  1. Read the current DB version from PostgreSQL.
  2. Use ConsistencyCoordinator to begin a fence write so Redis/local fence state atomically checks whether the current committed or pending fence is already ahead of the observed DB version and reserves a pending version.
  3. Commit the PostgreSQL optimistic-lock update with that exact reserved version.
  4. Commit the same fence reservation token after the database transaction commits. If the DB CAS or transaction fails, abort only the matching pending reservation.
  5. Write L2/L1 with set_if_version_at_least.
  6. Publish invalidation and realtime events so other nodes converge sooner.

This order prevents the unsafe state: PostgreSQL has the new version while Redis fence still exposes the old version.

Redis may hold a pending state. For example, a CAS conflict, transaction rollback, process crash, or outbox failure may leave a pending version without a matching DB commit. Strong reads must bypass cache and read PostgreSQL while pending exists. That is fail-safe, with the cost that the domain temporarily loses cache hits.

The current implementation has committed/pending state in the fence store and ConsistencyCoordinator: strong reads fall back to DB when pending exists, and tokenized room-settings, playback, membership, member-role, and member-permission writes commit the matching reservation after the database commit. Read-time repair and the bootstrapped background repair worker repair by comparing PostgreSQL row version with the pending version: if DB has reached the pending version, finalize pending; if DB has not reached the pending version and the pending lease has expired, expire the abandoned pending reservation; if DB has not reached the pending version and the lease has not expired, keep pending. A local timeout alone must not abort pending; the repair must also compare PostgreSQL version.

Business services should not call the low-level fence store directly. New strong-consistency paths must begin/commit/abort reservations, seed, or record DB fallback through ConsistencyCoordinator. This keeps metrics, error classification, and the pending/committed fence protocol behind one replacement point.

A SyncTV fence reservation is not part of the PostgreSQL transaction. Rolling back a DB transaction does not clear a pending reservation from Redis/local fence state. Every reservation therefore needs an explicit owner, and that owner must cover every exit path.

Mandatory rules:

  • After begin_*write succeeds, the reservation must immediately be owned by the current function, a local owner/collector, or a return value that successfully transfers ownership to the caller.
  • Before ownership is transferred to the caller, every later ?, return Err(...), CAS miss, outbox failure, auxiliary cleanup failure, and transaction commit failure must abort the matching reservation first.
  • If a helper creates a reservation, that helper must clean up its own failure paths. The caller can only clean up reservations that were successfully returned.
  • Batch reservation code must use a collector/owner pattern. If reservation N+1 fails, the first N reservations must be aborted immediately.
  • Fence commit may happen only after the PostgreSQL transaction has committed. Do not expose a pending reservation as committed before the durable DB fact exists.
  • Fence commit failure is a post-commit repair problem. Do not try to “roll back” a DB-committed business fact by aborting the version after commit.

Forbidden pattern:

let reservation = begin_write().await?;
write_db_row().await?;
delete_auxiliary_rows().await?;
tx.commit().await?;
commit_write(&reservation).await?;

Correct code must explicitly close error exits:

let reservation = begin_write().await?;
let result: Result<_> = async {
write_db_row().await?;
delete_auxiliary_rows().await?;
Ok(())
}
.await;
if let Err(error) = result {
abort_write(reservation.as_ref()).await;
return Err(error);
}
if let Err(error) = tx.commit().await {
abort_write(reservation.as_ref()).await;
return Err(error.into());
}
commit_write(reservation.as_ref(), db_version).await?;

Before changing strong-consistency write paths, audit reservation ownership with source search and inspect every begin site that the change can affect:

Terminal window
rg -n "begin_.*write|begin_observed_write|VersionFenceReservation" synctv-core/src/service synctv-core/src/cache
rg -n "abort_.*write|commit_.*write|commit_reserved_write|abort_reserved_write" synctv-core/src/service synctv-core/src/cache

This search does not prove correctness. Reviewers must inspect every relevant begin site and verify owner transfer, every ? / return Err path before transfer, transaction commit failure handling, post-commit finalization, and cache invalidation.

Papers and open-source systems provide principles, not a drop-in implementation for this codebase. Spanner, etcd, and Kubernetes watch-cache designs keep version proofs inside one controlled system. SyncTV currently spans PostgreSQL transactions and Redis/local fence state without a global transaction manager, so service code must explicitly maintain pending reservation ownership, abort, and commit.

Redis L2 must not be overwritten unconditionally. Any reload-from-DB path that writes L2 must use a version-aware write:

cache.set_if_version_at_least(key, value).await?;

This prevents a racing read from writing version N back into Redis after a write path has already committed version N+1.

Effective permissions are not stored as an independent snapshot table. They are computed at read time:

effective_permissions =
f(global_defaults, room_settings.role_defaults, room_member.role, member_overrides)

Permission cache entries therefore store two versions:

FieldSourceMeaning
user_versionPermission(room_id, user_id) fenceFreshness of the member’s own role and overrides
room_settings_versionPostgreSQL _settings row versionRoom settings version used when computing this permission value

A strong permission read may return cache only when both checks pass:

cached.user_version >= Redis Permission(room_id, user_id) fence
cached.room_settings_version >= Redis RoomSettings(room_id) fence

Changing one member’s role or permission overrides advances only that member’s Permission(room_id, user_id) fence.

Room default permissions are part of RoomSettings. After a settings write advances the RoomSettings(room_id) fence, old permission cache entries are rejected because their room_settings_version no longer satisfies the new fence. invalidate_room_cache(room_id) only performs local clearing and broadcast convergence; it is not the correctness mechanism.

Redis Streams, local broadcast, and PostgreSQL notifications are convergence mechanisms:

  • Reduce stale L1 residency.
  • Reduce the chance that the next strong read falls back to DB.
  • Drive Realtime resource observation re-evaluation.

They are not the source of strong consistency. When adding a strong path, design the fence and version validation first, then add invalidation as an optimization.

New or changed caches must satisfy these constraints:

ConstraintRule
Authorization, access control, existence checks, and critical user-visible stateUse the strong/fence protocol; if the path cannot join the fence protocol, keep it DB-authoritative
Cached value version sourcePrefer a business row version; derived values store the source versions used in computation
Relationship between Redis fence and DB versionStrong reads must not see a committed fence that lags the DB; install a pending reservation before the DB commit
L2 overwrite semanticsAll writes use set_if_version_at_least; an older reload cannot overwrite a newer value
Redis unavailable semanticsAuthorization paths fail closed or bypass cache and read DB
Async invalidation semanticsInvalidation is only a convergence optimization, never a correctness dependency
Service integrationUse ConsistencyCoordinator for fence access; do not call low-level VersionFenceStore primitives directly from service code

Consistency metrics are used to detect safe-but-degraded reads and write paths that need repair:

MetricMeaning
cache_fence_operations_total{domain,operation,result}Success, conflict, timeout, and error counts for current-version reads, begin/commit/abort, and seed operations
cache_db_fallback_total{domain,reason}Strong reads that fell back to PostgreSQL because of missing fences, stale cache, L2 errors, and similar reasons
cache_stale_write_reject_total{cache_type,level}Version-aware cache writes rejected because L1/L2 already held a newer value
cache_fence_pending{domain}Whether a domain currently has a pending fence reservation
cache_fence_repair_total{domain,result}Read-time PostgreSQL fallback repair outcomes for advancing/finalizing fences
cache_fence_db_compare{domain,relation}Redis fence vs PostgreSQL version relation observed during repair/patrol, such as fence_behind_db, fence_ahead_db, or pending_ahead_db

Strong-consistency cache changes should cover these cases:

  • L1 contains an old value, Redis fence has advanced, and the strong read rejects L1.
  • L2 contains an old value, Redis fence has advanced, and the strong read rejects L2.
  • The write path reserves a Redis version and stores that exact version in DB.
  • An older reload cannot overwrite a newer L2 value.
  • A derived cache is rejected after an upstream source version changes.

Permission changes should also cover:

  • A member-level permission mutation affects only that user’s permission fence.
  • A room default permission mutation makes old permission cache entries fail the room-settings fence check.

Research Comparison And Current Conclusions

Section titled “Research Comparison And Current Conclusions”

This design review used at least 15 modern papers, production writeups, and popular open-source system documents:

SourceUseful ideaSyncTV conclusion
Scaling Memcache at FacebookLeases, invalidation fanout, hot-key protection, and treating cache as an operated systemSyncTV treats invalidation as convergence; production operation should observe fence lag, CAS skips, and DB fallback
TAO: Facebook’s Distributed Data Store for the Social GraphObject-oriented cache/version structure over graph dataCacheDomain should follow business resources; derived values must store source versions
RAMP-TAOMulti-object reads must avoid fractured visibilityPermission cache derives from member rows and room settings, so it must store both source versions
Polaris / Cache Made ConsistentProduction cache consistency needs independent detection, not only code reviewRedis fence vs DB version lag and rejected old L2 writes are core consistency signals
Amazon DynamoObject versioning and explicit conflict handlingRedis fence may be ahead of DB, but cache entries must carry real source versions, not only fence versions
Google SpannerMonotonic timestamps and external consistency depend on clear commit orderingSyncTV is not a global transaction system; strong guarantees are per domain through Redis monotonic fences
Cloud Spanner external consistency docsStrong and stale reads are explicit modesSyncTV must keep strong and eventual APIs clearly separated
CalvinDecide transaction order before executionReserving Redis version before DB write is correct; a fence ahead of DB is a safe cache-miss state
RAMP transactionsRead-atomic metadata is required for multi-source derived readsPermission cache must store member version and room settings version, not a single logical invalidation version
FaRMHigh-performance transactions still need validationL2 CAS and DB optimistic locks are both required; unconditional set is not acceptable
Kubernetes API conceptsresourceVersion is used for change detection and consistency requirementsRoomSettings.version, 无客户端缓存版本, and RoomMember.version should be cache source versions
Kubernetes consistent reads from cacheConsistent cache reads require progress/version proofSyncTV L1/L2 can serve strong reads only after satisfying Redis fence
etcd API guaranteesLinearizable and serializable/stale reads are separate modesRedis failure must not make authorization paths trust stale cache; use DB fallback or fail closed
Envoy xDS protocolVersion + nonce avoids ACK/NACK racesRealtime and invalidation can later expose observed-version debug fields, but correctness must not depend on ACKs
CockroachDB follower readsStale reads are explicit consistent historical readsSyncTV eventual paths are only for low-risk reads, never authorization or access control
TiDB stale readHistorical reads require TSO/safe-point boundariesStale-read modes require an explicit staleness bound, not an implicit TTL guarantee
Cassandra LWTCAS/linearizable writes are useful for critical conditional updatesSyncTV must keep DB optimistic locks plus Redis L2 version CAS for critical cache writes
Redis Lua scriptingRedis scripts can provide atomic compare-and-set boundariesset_version_at_least and L2 Lua CAS provide Redis-side atomic version boundaries

Design conclusions:

TopicConclusion
Derived cachesDerived cache entries store actual source versions, not the current Redis fence as if it were a source version. Permission cache stores RoomMember.version and room settings row version.
Write orderingThe fence exposed to strong reads must not lag DB; Redis failure must not silently complete a strong-consistency write. Room settings, playback, and membership/permission/role writes begin a pending fence first, then write exact DB versions.
Multi-source strong readsStrong reads that need multiple fences must evaluate cache hits against one coherent freshness boundary.
Production observabilityConsistency observability covers fence lag, CAS rejects, DB fallback, and Redis fence unavailable events.
Delete semanticsDelete, leave, and kick transitions require explicit version semantics; member deletion writes a PostgreSQL lifecycle marker as the tombstone version, while strong reads still use DB authority for non-member authorization and never cache a successful authorization result for a removed member.
Eventual pathsEventual APIs and strong APIs have different consistency contracts; authorization and access control must not use eventual reads.