Load Data to columnstore table
Ensure the DDL of your columnstore table matches the source of data.
From a PostgreSQL table
INSERT INTO users_columnstore
SELECT * FROM users_rowstore;
Note: columnstore tables are not suited for small writes and updates.
It is recommended to batch writes using a cron job, or triggers from a staging table. Reach us for help.
From Parquet / CSV / JSON files
To load files from S3, set up cloud storage credentials.
1. Query a file with mooncake.read_parquet, mooncake.read_csv, or mooncake.read_json.
SELECT n FROM mooncake.read_parquet('s3://bucket/file.parquet') AS (n int);
Note: you will have to define the schema of the file in AS... If your mooncake.read_parquet does not explicitly involve a columnstore table, you may need to run:
LOAD 'pg_mooncake';
``
2\. Load a file into a columnstore table.
```sql
COPY users_columnstore FROM 's3://bucket/file.<extension>';
3. Load multiple files using the wildcard path:
COPY users_columnstore FROM 's3://bucket/folder/*.<extension>';
From a Huggingface Dataset
Similar to loading from S3, replace s3:// prefix with hf://
1. Find Huggingface dataset like lmsys_chat_1m_clean.
2. Navigate to files, and find the path to data file.
https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean/blob/main/data/train-00000-of-00004.parquet
3. Add hf:// prefix and remove /blob/main.
COPY users_columnstore FROM 'hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet';
4. Join HuggingFace dataset with an existing PostgreSQL table.
SELECT u.*, p.*
FROM users_rowstore u
JOIN read_parquet('hf://OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet';) AS p(id int ....)
ON u.id = p.id;
From existing Iceberg/Delta tables
1. Read existing tables with iceberg_scan or delta_scan:
SELECT n FROM iceberg_scan('s3://bucket/table') AS (n int);
2. Load an Iceberg table into a columnstore table.
COPY users_columnstore FROM('s3://bucket/table');
INSERT INTO users_columnstore...;