File Streaming

Transfer files to iceberg continuously.

File formats

Tested file formats.

Job creation

In the left sidebar menu choose Spark Jobs
Click on Create

Specify the following parameters (these are examples, you can change them based on your preference):

Name: file-streaming-job
Docker image: iomete/iomete_file_streaming_job:0.2.0
Main application file: local:///app/driver.py
Environment variables: LOG_LEVEL: INFO or ERROR

Environment variables

You can use Environment variables to store your sensitive variables like password, secrets, etc. Then you can use these variables in your config file using the ${DB_PASSWORD} syntax.

Config file

Config file: Scroll down and expand Application configurations section and click Add config file and paste following JSON.

{
  file: {
    format: csv,
    path: "files/",
    max_files_per_trigger: 1,
    latest_first: false,
    max_file_age: "7d"
  }
  database: {
    schema: default,
    table: awesome_csv_addresses
  }
  processing_time: {
    interval: 5
    unit: seconds # minutes
  }
}

Configuration properties

Property Description

Property	Description
`file`	Required properties to connect and configure. `format` The format of file. `path` The source path to connect file directory `max_files_per_trigger` Maximum file number per trigger. `latest_first` Whether to process the latest new files first, useful when there is a large backlog of files. `max_file_age` Maximum age of files to be processed.
`database`	Destination database properties. `schema` Specify the schema (database) to store into. `table` Specify the table.
`processing_time`	Processing time to persist incoming data on iceberg. `interval` Processing trigger interval. `table` Processing trigger unit: seconds, minutes.

file

Required properties to connect and configure.

format The format of file.
path The source path to connect file directory
max_files_per_trigger Maximum file number per trigger.
latest_first Whether to process the latest new files first, useful when there is a large backlog of files.
max_file_age Maximum age of files to be processed.

database

Destination database properties.

schema Specify the schema (database) to store into.
table Specify the table.

processing_time

Processing time to persist incoming data on iceberg.

interval Processing trigger interval.
table Processing trigger unit: seconds, minutes.

Create Spark Job - Deployment

Deployment preferences.

Create Spark Job - Instance

note

You can use Environment Variables to store your sensitive data like password, secrets, etc. Then you can use these variables in your config file using the ${ENV_NAME} syntax.

Instance and environment variable parameters.

Create Spark Job - Application Config

Job config.

Tests

Prepare the dev environment

virtualenv .env #or python3 -m venv .env
source .env/bin/activate

pip install -e ."[dev]"

Run test

python3 -m pytest # or just pytest

File formats​

Job creation​

Config file​

Configuration properties​

Tests​

Prepare the dev environment​

Run test​

ON THIS PAGE