Skip to main content

Volumes

IOMETE offers a feature for customization of Kubernetes Volume types attached to Spark workloads.


Volume list

To see the Volume list, go to the Settings menu and click on the Volumes tab. You can edit an existing volume or create a new one.

Volumes | IOMETEVolumes | IOMETE

Create volume

To create new volume, click the button.

After that, you'll see the following options for configuration.

There are 2 types:

  • Host Path: Mounts a directory directly from the host (node) machine's filesystem into a pod.
  • On Demand PVC: You can mount a dynamically-created persistent volume claim per executor by using this OnDemand option.
  • tmpfs: (RAM-backed filesystem): Mounts a temporary directory in the container's memory. More details provided below.

If selected type is Host Path:

  • Host Path: Provide directory to mount.
    Host Path create | IOMETEHost Path create | IOMETE

If selected type is On Demand PVC:

  • Storage class name: Name for storage class. Available list of storage classes can be retrieved with kubectl get storageclass command.

  • Max size: Maximum storage capacity requested in a PVC per executor.

    On Demand PVC create | IOMETEOn Demand PVC create | IOMETE
  • Resource tags Tags are custom name/value pairs that you can assign to IOMETE resources. These tags enable you to categorize and organize your resources effectively.

Using volumes

You can utilize node types in Lakehouses, Spark Connect Clusters and Spark Jobs. Let's navigate to the Lakehouse create page. Here, you'l find options for Volume which include volume selections.

Lakehouse Volume select | IOMETELakehouse Volume select | IOMETE

Delete volume

To delete a volume, locate it in the volumes list, click the "Edit" button, and then click the button below the inputs. Afterward, you'll receive a confirmation message; click "Yes, delete" to confirm the deletion.

Node type delete | IOMETENode type delete | IOMETE

tmpfs

tmpfs storage type could be used for certain high-performance, ephemeral data storage needs. tmpfs is a RAM-backed filesystem that provides very fast read and write operations, making it ideal for specific use cases where speed is critical.

Kubernetes emptyDir Volumes with tmpfs

In Kubernetes, emptyDir volumes are a type of ephemeral storage that is created when a Pod is assigned to a node. This volume is initially empty and can be accessed by all containers within the same Pod. The data stored in an emptyDir volume is deleted when the Pod is removed from the node, making it a suitable option for temporary storage needs.

Using tmpfs with emptyDir

By default, emptyDir volumes use the node's backing storage, such as disk, SSD, or network storage. However, to achieve higher performance, particularly in environments where disk-based storage might be a bottleneck (e.g., diskless nodes or remote network storage), user can configure the emptyDir to use tmpfs.

To enable tmpfs, the emptyDir.medium field is set to "Memory" in the Kubernetes Pod specification. This configuration mounts the emptyDir volume as a RAM-backed filesystem (tmpfs). While this provides significant performance benefits due to the speed of RAM, it is important to note that data stored in tmpfs will consume memory from the container's memory allocation. This means that the size of files stored in this volume counts against the container's memory limit.

Spark Integration with tmpfs

In our Spark workloads, tmpfs is used for local storage to avoid performance degradation caused by heavy I/O operations on remote storage. By default, Spark utilizes Kubernetes' emptyDir volumes for local storage, which uses the node's backing storage.

Enabling tmpfs in Spark

To configure Spark to use tmpfs for local storage, we set the spark.kubernetes.local.dirs.tmpfs=true property in the Spark configuration. This ensures that the emptyDir volumes are RAM-backed, significantly speeding up operations that rely heavily on local storage.

Memory Management Considerations

Since tmpfs uses the container's memory, enabling it for Spark's local storage will increase the memory usage of your Spark pods. To accommodate this, you may need to adjust the memory overhead allocated to Spark's driver and executor pods by configuring spark.{driver,executor}.memoryOverheadFactor. This ensures that your Spark workloads have enough memory to handle both their regular tasks and the additional overhead introduced by using tmpfs.

Use Cases

  • Scratch Space for Computations: tmpfs provides an ideal location for temporary storage during complex computations, such as disk-based merge sorts, where speed is crucial.

  • Checkpointing: It is beneficial for storing intermediate data that needs to be quickly accessed and processed before being discarded.

  • Content Staging: In scenarios where one container fetches data and another serves it, tmpfs can be used to stage files rapidly before they are served.

Resources: