Skip to main content

Cleanup Untracked Table Folders Job


The Cleanup Untracked Table Folders Job helps find table folders that still exist in object storage even though the corresponding tables are no longer active in the catalog.

Use this job when tables were dropped without purging their physical files. It can run in dry-run mode to report candidate folders, and it deletes folders only when deletion is explicitly enabled.

warning

This job can delete data from object storage. Always run it in dry-run mode first, review the candidates and audit table, and only enable deletion after confirming the output.

Installation

You can deploy this job from the Marketplace or create the job template manually. The Marketplace flow is recommended because it pre-populates the image, main class, and default configuration.

Marketplace

  • Open Job Templates from the left sidebar.
  • Click Marketplace to open the list of preconfigured Marketplace jobs.
  • Find cleanup-untracked-table-folders, open the actions menu, and click Deploy.
  • The Marketplace template opens a pre-filled Create New Job form. Review the defaults and update the configuration for your environment.

Manual Setup

Create a job template manually with New Job Template if you do not use the Marketplace flow.

Specify the following parameters when creating the job manually:

FieldValue
Namecleanup-untracked-table-folders
Docker imageiomete.azurecr.io/iomete/cleanup-untracked-table-folders:0.1.0
Main classcom.iomete.cleanup.untrackedtablefolders.App
Main application filespark-internal
Config file path/etc/configs/application.json
Config file nameapplication.json

Running the Job

The Marketplace template includes a default configuration under the Config Maps tab. Review the databases and safety settings before creating the job.

Dry-Run Example

Use dry-run mode first to inspect candidates without deleting data.

{
"catalog": "spark_catalog",
"databases": ["analytics"],
"exclude_paths": [],
"exclude_database_folders": [],
"older_than_hours": 24,
"dry_run": true,
"delete_enabled": false,
"max_candidate_folders_per_database": 10,
"collect_size_statistics": true
}

Delete-Enabled Example

Enable deletion only after reviewing the dry-run output.

{
"catalog": "spark_catalog",
"databases": ["analytics"],
"exclude_paths": [
"s3a://bucket/data/analytics/protected_table"
],
"exclude_database_folders": [],
"older_than_hours": 24,
"dry_run": false,
"delete_enabled": true,
"max_candidate_folders_per_database": 5,
"collect_size_statistics": true
}

Configuration Fields

FieldDefaultDescription
catalogspark_catalogCatalog to inspect.
databasesrequiredList of databases to scan. The job does not scan all databases automatically.
exclude_paths[]Full object-storage paths that must never be selected as candidates.
exclude_database_folders[]Database-relative folder exclusions in the format database.folder.
older_than_hours24Candidate folders must be older than this cutoff.
dry_runtrueReports candidates without deleting data.
delete_enabledfalseMust be true together with dry_run=false before deletion can happen.
max_candidate_folders_per_database10Safety limit. If more candidates are detected for a database, the database is skipped.
collect_size_statisticstrueEstimates candidate and deleted-folder sizes for audit reporting.
  1. Run with dry_run=true and delete_enabled=false.
  2. Review the driver logs and audit table output.
  3. Add exclude_paths or exclude_database_folders for anything that must be protected.
  4. Keep max_candidate_folders_per_database low for the first delete-enabled run.
  5. Enable deletion only after the dry-run output is reviewed.
  6. Query the audit table after the run.

Safety Controls

The job is designed to fail closed. If it cannot prove that a path is safe to delete, it does not delete it.

Safety checkPurpose
dry_run=true by defaultReports candidates without deleting data.
delete_enabled=false by defaultRequires explicit confirmation before destructive cleanup.
Explicit database listLimits cleanup scope to configured databases.
Scan-root validationPrevents scanning outside the database storage boundary.
Active table protectionProtects folders that are active table locations or contain active table locations.
Pre-delete recheckRe-discovers active table locations immediately before deletion.
exclude_pathsProtects explicitly configured storage paths.
exclude_database_foldersProtects database-relative folders without requiring full object-storage paths.
Sentinel folder exclusionSkips Spark/Hadoop staging folders such as _temporary and .spark-staging*.
older_than_hoursFilters out folders newer than the configured cutoff.
max_candidate_folders_per_databaseSkips cleanup when too many candidates are detected.
Empty-database guardSkips databases with no active table locations by default.
Audit tableRecords success, skipped, and failed outcomes for each database.

Excluding Paths

Use exclude_paths when you know the full object-storage path to protect:

{
"exclude_paths": [
"s3a://bucket/data/analytics/manual_archive"
]
}

Use exclude_database_folders when you want to protect a folder relative to a configured database:

{
"exclude_database_folders": [
"analytics.manual_archive"
]
}

Both forms are included in the effective excluded-path list written to the audit table.

Audit Table

The job writes audit rows to:

spark_catalog.iomete_system_db.cleanup_untracked_table_folder_runs

Use the audit table to review candidates, skipped databases, deleted folders, runtime metadata, and failure details.

Important Audit Columns

ColumnMeaning
run_idID shared by all database rows from the same job execution.
spark_app_idSpark application ID for the job run.
runtime_compute_idPlatform runtime compute/run identifier, when available.
runtime_compute_namespaceKubernetes namespace where the Spark driver ran, when available.
runtime_domainIOMETE domain from the runtime environment, when available.
runtime_userSpark runtime/run-as user from the driver environment, usually SPARK_USER.
external_job_idStable platform Job ID shown in the UI, when exposed by the platform runtime.
platform_started_byPlatform user who started this specific job run, when exposed by the platform runtime.
catalog_nameCatalog scanned by the job.
database_nameDatabase scanned by this audit row.
operationCleanup operation name.
dry_runWhether the run was report-only.
delete_enabledWhether destructive deletion was explicitly enabled.
older_than_hoursAge threshold used for candidate detection.
cutoff_timeTimestamp cutoff derived from older_than_hours.
max_candidate_folders_per_databaseCandidate-count safety limit used for this run.
excluded_pathsEffective excluded paths used during candidate detection.
statusSUCCESS, SKIPPED, or FAILED.
status_reasonMachine-readable reason for skipped or failed outcomes.
error_messageError details for failed or intentionally skipped outcomes.
discovered_database_locationDatabase location discovered from the catalog.
storage_scan_locationObject-storage root scanned by the job.
active_table_countNumber of active tables discovered for the database.
storage_folder_countNumber of immediate storage folders discovered.
candidate_folder_countNumber of folders selected as cleanup candidates.
deleted_folder_countNumber of folders deleted.
candidate_foldersCandidate folder paths.
deleted_foldersDeleted folder paths.
candidate_object_countEstimated object count across candidate folders when size statistics are collected.
candidate_total_size_bytesEstimated candidate size when size statistics are collected.
deleted_object_countObject count for deleted folders when size statistics are collected.
deleted_total_size_bytesDeleted size when size statistics are collected.
metricsAdditional contextual metadata for troubleshooting.
start_timeDatabase processing start time.
end_timeDatabase processing end time.

Runtime Identity Fields

The audit table separates who the Spark job runs as, who started the run from the platform, and which platform job/run the audit row belongs to.

FieldExampleSourceMeaning
runtime_userhasanSPARK_USER from the driver environmentSpark runtime/run-as user. This is the user the Spark job executes as.
platform_started_byfde_adminIOMETE_JOB_STARTED_BY, when exposed by the platform runtimePlatform user who started this specific run in the UI or platform backend. This can differ from runtime_user.
external_job_id3e9342e7-89d3-42f8-a603-e00b783cca17IOMETE_EXTERNAL_JOB_ID, when exposed by the platform runtimeStable platform Job ID for the job definition/template. This stays the same across multiple runs of the same job.
runtime_compute_idafceb977-cd1e-4ff8-89e3-2f7876436ecfIOMETE_COMPUTE_ID, when availableSpecific platform runtime/application identifier for one execution. This changes for each run.
runtime_compute_namespacespark-resources-1IOMETE_COMPUTE_NAMESPACE, when availableKubernetes namespace where the Spark driver ran.
runtime_domainfdeIOMETE_DOMAIN, when availableIOMETE domain associated with the runtime.
spark_app_idspark-ffbd89771a0a4a248b947676b943557eSpark application contextSpark application ID used to correlate with Spark logs/history.
run_id2edb19ec-8cb0-4d27-8965-ffb4912bfa60Generated by this cleanup jobCleanup job run ID. It groups all per-database audit rows from the same execution.

Example audit query:

SELECT
start_time,
database_name,
status,
status_reason,
dry_run,
delete_enabled,
candidate_folder_count,
deleted_folder_count,
runtime_user,
platform_started_by,
external_job_id
FROM spark_catalog.iomete_system_db.cleanup_untracked_table_folder_runs
ORDER BY start_time DESC
LIMIT 20;

Status Values

StatusDescription
SUCCESSThe database was processed normally. Candidates may or may not have been found.
SKIPPEDThe job intentionally refused to process or delete for a safety reason.
FAILEDAn unexpected failure occurred while processing the database.

Common status_reason values include:

ReasonDescription
database_not_foundConfigured database does not exist or cannot be discovered.
database_location_missingDatabase exists but has no usable storage location.
no_active_tables_in_databaseDatabase has no active table locations, so cleanup is skipped by default.
too_many_candidate_foldersCandidate count exceeded max_candidate_folders_per_database.

Summary

Use this job to review and optionally clean table folders that remain in object storage after tables are no longer tracked by the catalog. Always start with dry-run mode, review the audit table, and enable deletion only after confirming that the candidates are safe to remove.