Spark Catalogs Overview

A Spark Catalog is a component in Apache Spark that manages metadata for tables and databases within a Spark session. It allows for the creation, deletion, and querying of tables, as well as access to their schemas and properties.

IOMETE allows users to configure multiple Spark Catalogs to access data from both IOMETE systems and external sources. This configuration supports lakehouse queries and Spark jobs, streamlining data retrieval and management across diverse environments.

Catalog Overview

To view all configured Spark Catalogs, navigate to the Settings menu item and switch to the Spark Catalogs tab. In the list below, you'll find the Spark catalog. The default catalog created by the system is called spark_catalog.

Configuring Managed Catalogs

IOMETE enables the creation of new catalogs that operate as Iceberg REST Catalogs, managed internally by the IOMETE system. These catalogs are accessible to external tools via the path /catalogs/<catalog name>. They seamlessly integrate with existing data security rules to enforce role-based access control (RBAC). For authentication, a session or a PAT token is required.

Creating a new Managed Catalog

To add new catalog item, click Create button in the top right corner. You'll see following screen:

Catalog Types Overview

IOMETE supports both internally managed catalogs and external catalogs that can be integrated for a seamless data experience across different environments. Below is an overview of the various types of catalogs you can configure.

These are the catalogs managed internally by the IOMETE system.

Internal Spark Catalogs

An internally managed Iceberg catalog using a REST API connection. It is fully managed within the IOMETE platform and provides an easy way to manage metadata through the Iceberg REST API.

Iceberg Rest Catalog

Seamlessly transfer your MySQL tables to IOMETE Lakehouse with our easy-to-configure Spark Job.

External Spark Catalogs

IOMETE seamlessly connects to external Iceberg catalogs, allowing you to integrate with your existing data infrastructure while continuing to manage metadata and data in external systems. This integration lets you leverage IOMETE’s powerful capabilities without the need to relocate your data.

Supported external Iceberg catalogs include REST APIs, AWS Glue, and Nessie catalogs. To learn how to configure each type of catalog, follow the links below.

JDBC Catalog

This catalog connects to an existing JDBC (Java Database Connectivity) resource, which facilitates communication with relational database management systems (RDBMS)

REST Iceberg Catalog

An external Iceberg catalog connected via REST API integration. This type of catalog allows you to connect IOMETE to any REST-compliant Iceberg metadata storage.

AWS Glue Iceberg Catalog

This catalog connects IOMETE to Iceberg metadata stored in AWS Glue, leveraging AWS Glue’s cataloging features to manage schemas and metadata for Iceberg tables.

Nessie Iceberg Catalog

The Nessie catalog provides version control for your Iceberg tables, with Git-like operations for branching and managing table versions.

New Catalog in SQL Editor.

After c SQL Editor to view the Spark catalog that has been created. Check out the SQL Editior section in the left panel. If you don't see the created catalog, please refresh by using the button. In the given image, we have already created some databases and tables inside the catalog.

Connecting to the IOMETE REST Catalog from External Applications

IOMETE exposes the Iceberg REST URL at the subpath /catalogs/<catalog name> on the same domain as the IOMETE console. For instance, you wish to query a catalog named my_catalog if your console is available on https://console.iomete.internal, then you can utilize tools like cURL, Postman, or others to send requests to https://console.iomete.internal/catalogs/my_catalog/v1/config. All REST endpoints are compliant with the Iceberg REST OpenAPI specification.

To connect to the REST Catalog via PySpark, the following configuration will enable the connection. Be sure to fill in the required fields. If you are using HTTPS with certificates, you will also need to provide a valid truststore and its associated password.

pyspark \
    --conf spark.sql.catalog.spark_catalog.uri="<http|https>://<url running IOMETE console>/catalogs/my_catalog" \
    --conf spark.sql.catalog.spark_catalog.token="<(personal) access token>" \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.spark_catalog.warehouse="<path-to-your-s3-compatible warehouse>" \
    --conf spark.sql.defaultCatalog=spark_catalog \
    --conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
    --conf spark.sql.catalog.spark_catalog.s3.endpoint="<s3 compatible storage endpoint>" \
    --conf spark.sql.catalog.spark_catalog.s3.client.region="<region for your s3 compatible storage, default it us-east1>" \
    --conf spark.sql.catalog.spark_catalog.s3.access-key-id="<access key for your s3 compatible storage>" \
    --conf spark.sql.catalog.spark_catalog.s3.secret-access-key="<secret key for your s3 compatible storage>" \
    --conf spark.sql.catalog.spark_catalog.s3.path-style-access=true \
    --conf "spark.driver.extraJavaOptions=-Djavax.net.ssl.trustStore=<path-to-truststore> -Djavax.net.ssl.trustStorePassword=<truststore password>" \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --packages org.apache.iceberg:iceberg-spark-extensions-3.4_2.12:1.6.1,org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1,software.amazon.awssdk:s3:2.27.21

Access token

Please mind that you will require a valid session or PAT token pass authentication. Additionally, the user associated with the token must have the necessary Access Policies on the catalog, namespace, and table you wish to operate on.

Catalog Overview​

Configuring Managed Catalogs​

Creating a new Managed Catalog​

Catalog Types Overview​

Internal Spark Catalogs​