Skip to main content

File Streaming


Transfer files to iceberg continuously.

File formats

Tested file formats.

  • CSV

Job creation

  • In the left sidebar menu choose Spark Jobs
  • Click on Create

Specify the following parameters (these are examples, you can change them based on your preference):

  • Name: file-streaming-job
  • Docker image: iomete/iomete_file_streaming_job:0.2.0
  • Main application file: local:///app/driver.py
  • Environment variables: LOG_LEVEL: INFO or ERROR
IOMETE Spark Jobs Create | IOMETEIOMETE Spark Jobs Create | IOMETE
Environment variables

You can use Environment variables to store your sensitive variables like password, secrets, etc. Then you can use these variables in your config file using the ${DB_PASSWORD} syntax.


Config file

  • Config file: Scroll down and expand Application configurations section and click Add config file and paste following JSON.

    IOMETE Spark Jobs add config file | IOMETEIOMETE Spark Jobs add config file | IOMETE
{
file: {
format: csv,
path: "files/",
max_files_per_trigger: 1,
latest_first: false,
max_file_age: "7d"
}
database: {
schema: default,
table: awesome_csv_addresses
}
processing_time: {
interval: 5
unit: seconds # minutes
}
}

Configuration properties

PropertyDescription
file

Required properties to connect and configure.

  • format The format of file.
  • path The source path to connect file directory
  • max_files_per_trigger Maximum file number per trigger.
  • latest_first Whether to process the latest new files first, useful when there is a large backlog of files.
  • max_file_age Maximum age of files to be processed.
database

Destination database properties.

  • schema Specify the schema (database) to store into.
  • table Specify the table.
processing_time

Processing time to persist incoming data on iceberg.

  • interval Processing trigger interval.
  • table Processing trigger unit: seconds, minutes.

Create Spark Job - Deployment

Deployment preferences.

Create Spark Job - Instance

note

You can use Environment Variables to store your sensitive data like password, secrets, etc. Then you can use these variables in your config file using the ${ENV_NAME} syntax.

Instance and environment variable parameters.

Create Spark Job - Application Config

Job config.

Tests

Prepare the dev environment

virtualenv .env #or python3 -m venv .env
source .env/bin/activate

pip install -e ."[dev]"

Run test

python3 -m pytest # or just pytest