Migrate Data Lake data from Apache Iceberg to Memgraph
Overview
Migrating data from an Apache Iceberg table stored in a data lake to Memgraph can be efficiently done using Dremio as a query engine and Memgraph’s migrate module. This setup eliminates the need for manual data exports and enables real-time data streaming into Memgraph.
Why Apache Iceberg?
Apache Iceberg is a high-performance table format designed for data lakes. It provides:
- Schema evolution without rewriting files.
- Time travel for querying historical data versions.
- Partition evolution to optimize queries.
- ACID transactions for reliability.
Why use Dremio?
Dremio acts as a query engine for your Iceberg data stored in MinIO or another object storage solution. It provides:
- Federated querying across multiple data sources.
- SQL-based access to Iceberg tables.
- Optimized query performance through Apache Arrow.
Benefits of using Arrow Flight for migration
Arrow Flight is a high-performance data transport built on gRPC. It enables:
- Efficient data streaming from Dremio to Memgraph.
- Compression over the network, reducing bandwidth usage.
- Parallel execution of queries for faster ingestion.
- Schema-aware migration, preserving types and structures.
With Memgraph’s migrate module, we can directly stream Apache Iceberg data from Dremio into Memgraph without using CSV exports.
Setting up MinIO and Dremio
We will use Docker Compose to set up MinIO (for object storage) and Dremio (for querying the data lake).
Docker Compose configuration
version: '3.8'
services:
minio:
image: minio/minio
container_name: minio
environment:
MINIO_ROOT_USER: admin
MINIO_ROOT_PASSWORD: password
ports:
- "9000:9000"
- "9001:9001"
command: server /data --console-address ":9001"
volumes:
- minio_data:/data
dremio:
image: dremio/dremio-oss
container_name: dremio
ports:
- "9047:9047" # Dremio Web UI
- "31010:31010" # JDBC port
- "32010:32010" # ODBC port
- "45678:45678" # Internal port
environment:
- DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dservices.coordinator.enabled=true -Dservices.executor.enabled=true
- DREMIO_USERNAME=admin
- DREMIO_PASSWORD=admin
volumes:
- dremio_data:/opt/dremio/data
- dremio_conf:/opt/dremio/conf
volumes:
minio_data:
dremio_data:
dremio_conf:
- Start MinIO and Dremio:
docker-compose up -d
- Access MinIO at
http://localhost:9001
and create an Iceberg bucket.
We will first create a bucket in the MinIO object storage.
We can see that the bucket is successfully added to MinIO
- Connect MinIO to Dremio through the Dremio UI at
http://localhost:9047
.
Log in to Dremio:
Create a MinIO datasource. MinIO is an S3-compatible object storage so the datasource selected is Amazon S3
.
In order to connect to MinIO, we need to:
- tick the
Enable compatibility mode
- add property
fs.s3a.path.style.access
and set it totrue
- add property
fs.s3a.endpoint
and set it tominio:9000
since that is our network name in docker compose
For this example, we will create a dummy table using Dremio and populate it with 2 persons:
Migrating Iceberg data to Memgraph
Once MinIO and Dremio are set up, and an Iceberg table (persons
) is available,
we can use Memgraph’s migrate module to transfer the data. We will use the Arrow Flight
migration module since that enables fast transfer of Arrow compatible format using gRPC protocol.
Executing the Migration Query in Memgraph Lab
Run the following Cypher query in Memgraph Lab:
CALL migrate.arrow_flight(
"SELECT * FROM minio.iceberg.persons;",
{
host: "localhost",
port: 32010,
username: "admin",
password: "admin1!!"
}
) YIELD row
WITH row
CREATE (:Person {age: row.age, name: row.name});
Explanation:
SELECT * FROM minio.iceberg.persons;
→ Queries the Iceberg table in Dremio.- Arrow Flight is used for streaming the results efficiently into Memgraph.
- Memgraph ingests each row and creates
Person
nodes withage
andname
properties.
Make sure that Memgraph sees the gRPC port exposed by Dremio for data transfer.
Visualizing the migration
Once executed, you will see nodes successfully ingested into Memgraph:
Conclusion
By using Apache Iceberg, Dremio, and Memgraph’s migrate module, we achieve:
- Streaming data ingestion without CSV exports.
- Fast, compressed data transfers over Arrow Flight.
- Real-time migration from object storage to Memgraph.
- Seamless integration with modern data lakes.
What’s next?
Memgraph’s flexible migration capabilities allow you to connect to any data source and ingest data in real-time. 🚀 For the list of supported data sources, please check our migrate module.
Don’t see an available data source? Please contact us on Discord and we will enable you in the matter of days!