Data migrationMigrate Apache Iceberg tables from data lake using Dremio

Migrate Data Lake data from Apache Iceberg to Memgraph

Overview

Migrating data from an Apache Iceberg table stored in a data lake to Memgraph can be efficiently done using Dremio as a query engine and Memgraph’s migrate module. This setup eliminates the need for manual data exports and enables real-time data streaming into Memgraph.

Why Apache Iceberg?

Apache Iceberg is a high-performance table format designed for data lakes. It provides:

  • Schema evolution without rewriting files.
  • Time travel for querying historical data versions.
  • Partition evolution to optimize queries.
  • ACID transactions for reliability.

Why use Dremio?

Dremio acts as a query engine for your Iceberg data stored in MinIO or another object storage solution. It provides:

  • Federated querying across multiple data sources.
  • SQL-based access to Iceberg tables.
  • Optimized query performance through Apache Arrow.

Benefits of using Arrow Flight for migration

Arrow Flight is a high-performance data transport built on gRPC. It enables:

  • Efficient data streaming from Dremio to Memgraph.
  • Compression over the network, reducing bandwidth usage.
  • Parallel execution of queries for faster ingestion.
  • Schema-aware migration, preserving types and structures.

With Memgraph’s migrate module, we can directly stream Apache Iceberg data from Dremio into Memgraph without using CSV exports.


Setting up MinIO and Dremio

We will use Docker Compose to set up MinIO (for object storage) and Dremio (for querying the data lake).

Docker Compose configuration

version: '3.8'
 
services:
  minio:
    image: minio/minio
    container_name: minio
    environment:
      MINIO_ROOT_USER: admin
      MINIO_ROOT_PASSWORD: password
    ports:
      - "9000:9000"
      - "9001:9001"
    command: server /data --console-address ":9001"
    volumes:
      - minio_data:/data
 
  dremio:
    image: dremio/dremio-oss
    container_name: dremio
    ports:
      - "9047:9047"  # Dremio Web UI
      - "31010:31010" # JDBC port
      - "32010:32010" # ODBC port
      - "45678:45678" # Internal port
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dservices.coordinator.enabled=true -Dservices.executor.enabled=true
      - DREMIO_USERNAME=admin
      - DREMIO_PASSWORD=admin
    volumes:
      - dremio_data:/opt/dremio/data
      - dremio_conf:/opt/dremio/conf
 
volumes:
  minio_data:
  dremio_data:
  dremio_conf:
  1. Start MinIO and Dremio:
    docker-compose up -d
  2. Access MinIO at http://localhost:9001 and create an Iceberg bucket.

We will first create a bucket in the MinIO object storage.

We can see that the bucket is successfully added to MinIO

  1. Connect MinIO to Dremio through the Dremio UI at http://localhost:9047.

Log in to Dremio:

Create a MinIO datasource. MinIO is an S3-compatible object storage so the datasource selected is Amazon S3.

In order to connect to MinIO, we need to:

  • tick the Enable compatibility mode
  • add property fs.s3a.path.style.access and set it to true
  • add property fs.s3a.endpoint and set it to minio:9000 since that is our network name in docker compose

For this example, we will create a dummy table using Dremio and populate it with 2 persons:


Migrating Iceberg data to Memgraph

Once MinIO and Dremio are set up, and an Iceberg table (persons) is available, we can use Memgraph’s migrate module to transfer the data. We will use the Arrow Flight migration module since that enables fast transfer of Arrow compatible format using gRPC protocol.

Executing the Migration Query in Memgraph Lab

Run the following Cypher query in Memgraph Lab:

CALL migrate.arrow_flight(
  "SELECT * FROM minio.iceberg.persons;",
  {
    host: "localhost", 
    port: 32010, 
    username: "admin", 
    password: "admin1!!"
  }
) YIELD row
WITH row
CREATE (:Person {age: row.age, name: row.name});

Explanation:

  • SELECT * FROM minio.iceberg.persons; → Queries the Iceberg table in Dremio.
  • Arrow Flight is used for streaming the results efficiently into Memgraph.
  • Memgraph ingests each row and creates Person nodes with age and name properties.

Make sure that Memgraph sees the gRPC port exposed by Dremio for data transfer.


Visualizing the migration

Once executed, you will see nodes successfully ingested into Memgraph:


Conclusion

By using Apache Iceberg, Dremio, and Memgraph’s migrate module, we achieve:

  • Streaming data ingestion without CSV exports.
  • Fast, compressed data transfers over Arrow Flight.
  • Real-time migration from object storage to Memgraph.
  • Seamless integration with modern data lakes.

What’s next?

Memgraph’s flexible migration capabilities allow you to connect to any data source and ingest data in real-time. 🚀 For the list of supported data sources, please check our migrate module.

Don’t see an available data source? Please contact us on Discord and we will enable you in the matter of days!