200ms p99 cold latency at 100B+ scale, with <1GB RAM per billion vectors goals:

Scale ⬆️ recall value ⬆️ latency ⬇️ cost ⬇️ Memory ⬇️

indexing:

HSNW:

IVF:

LIRE:

limitation or problem faced by turbopuff:

  • Glob filters and regexes can be expensive. Glob tpuf* is compiled down to an optimized prefix scan, whereas Glob *tpuf* or IGlob will potentially scan at every document in the namespace. Contact us if you’re seeing performance issues for your workload, we can likely suggest alternatives (e.g. using full-text search or a different filter). This is not a fundamental limitations, and we plan to introduce indexes for these types of queries soon.
    • Use Python with C bindings. If you’re using the Python turbopuffer client, use the turbopuffer[fast] package rather than the base package. This includes C binaries which can improve ingestion throughput dramatically, by leveraging a faster JSON serializer.

https://turbopuffer.com/blog/native-filtering

Core Idea: “Write a new version, read the merged view”

Traditional DB (Mutable)S3 + LSM (Immutable)
UPDATE row SET x=5 → modify disk blockINSERT (key, x=5, ts=NOW) → new SSTable file
Read → one blockRead → merge latest versions on-the-fly

Immutability is not a bug — it’s the foundation of durability.

https://github.com/quickwit-oss/quickwit, is similar on s3 but the records are immutable

S3 / GCS / Azure Blob Storage

!!! tip “Lowest cost, highest latency”

- **Latency** ⇒ Has the highest latency. p95 latency is also substantially worse than p50. In general you get results in the order of several hundred milliseconds
- **Scalability** ⇒ Infinite on storage, however, QPS will be limited by S3 concurrency limits
- **Cost** ⇒ Lowest (order of magnitude cheaper than other options)
- **Reliability/Availability** ⇒ Highly available, as blob storage like S3 are critical infrastructure that form the backbone of the internet.