Concepts
Columnstore Table
Columnar storage dramatically improves analytical query performance by storing data vertically, allowing for massive compression and precise data retrieval. When querying only specific columns, the vectorized engine can load and process only the required data, skipping irrelevant columns entirely. This approach enables DuckDBs vectorized execution to process multiple data points simultaneously, transforming columnar storage into a high-speed data processing engine.
What is Iceberg?
Iceberg is becoming the agreed-upon storage format for analytic workloads. It is just some metadata over Parquet files. This metadata gives it full table semantics. All execution engines like Starrocks, Snowflake, Trino, DuckDB, Presto, Spark, Polars can query Iceberg Tables directly.
What does pg_mooncake mean by Iceberg Columnstore table?
In Postgres, the columnstore table is just like a regular table. Outside of Postgres, we write data and metadata in a format that is directly readable by these engines.
TLDR: You do not have to replicate data out of Postgres to query it as a Python dataframe.
If data is stored in S3, how do we make it fast?
We support cache on writes. We will soon be supporting cache on reads. This is how we can sub second performance. For the best peformance, co-locate your object store bucket and Postgres instance.
Why are we doing this?
1. Run Analytics on Live Application Data in PostgreSQL Run transactions on columnstore tables, ensuring up-to-date analytics without managing individual Parquet files.
2. Write PostgreSQL Tables to Your Lake or Lakehouse Make application data accessible as native tables for data engineering and science outside of PostgreSQL, without complex ETL, CDC or stitching of files.
3. Query and Update existing Lakehouse Tables Natively in PostgreSQL (coming soon) Connect existing Lakehouse catalogs and expose them directly as columnstore tables in PostgreSQL.