Load Data to columnstore table

Ensure the DDL of your columnstore table matches the source of data.

From a PostgreSQL table

INSERT INTO users_columnstore
SELECT * FROM users_rowstore;

Note: columnstore tables are not suited for small writes and updates.

It is recommended to batch writes using a cron job, or triggers from a staging table. Reach us for help.

From Parquet / CSV / JSON files

To load files from S3, set up cloud storage credentials.

1. Query a file with mooncake.read_parquet, mooncake.read_csv, or mooncake.read_json.

SELECT n FROM mooncake.read_parquet('s3://bucket/file.parquet') AS (n int);

Note: you will have to define the schema of the file in AS... If your mooncake.read_parquet does not explicitly involve a columnstore table, you may need to run:

LOAD 'pg_mooncake';
``
 
2\. Load a file into a columnstore table. 
```sql
COPY users_columnstore FROM 's3://bucket/file.<extension>';

3. Load multiple files using the wildcard path:

COPY users_columnstore FROM 's3://bucket/folder/*.<extension>';

From a Huggingface Dataset

Similar to loading from S3, replace s3:// prefix with hf://

1. Find Huggingface dataset like lmsys_chat_1m_clean.

2. Navigate to files, and find the path to data file.

https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean/blob/main/data/train-00000-of-00004.parquet

3. Add hf:// prefix and remove /blob/main.

COPY users_columnstore FROM 'hf://datasets/OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet';

4. Join HuggingFace dataset with an existing PostgreSQL table.

SELECT u.*, p.*
FROM users_rowstore u
JOIN read_parquet('hf://OpenLeecher/lmsys_chat_1m_clean/data/train-00000-of-00004.parquet';) AS p(id int ....)
ON u.id = p.id;

From existing Iceberg/Delta tables

1. Read existing tables with iceberg_scan or delta_scan:

SELECT n FROM iceberg_scan('s3://bucket/table') AS (n int);

2. Load an Iceberg table into a columnstore table.

COPY users_columnstore FROM('s3://bucket/table');
 
INSERT INTO users_columnstore...;