Skip to main content

Virtual Lakehouses

A virtual lakehouse is a cluster of compute resources that provide the required resources, such as CPU, memory to perform the querying processing. Table data files are stored in cloud data storage (S3) as a shared data storage that allows multiple virtual lakehouse clusters to share the same data while isolating compute. IOMETE uses Apache Spark as a data lakehouse query engine with ACID support


info

In production environments, it is often required to isolate workloads, for example, to avoid the overhead of batch ETL jobs on ad-hoc analytical queries. Since data is decoupled and shared from virtual lakehouse, it enables the creation of multiple lakehouse clusters to isolate the workloads and turn on/off clusters based on requirements to save costs. Cluster size can be defined based on requirements and workloads.

Create a new Lakehouse

1. Go to the Lakehouses and click the Create New button

Create New Lakehouse

2. Give the new lakehouse a name under Name.

Lakehouse name

3. Under the Type section, choose type.

Lakehouse type
info

Type defines the maximum number of executors/workers that spark could scale. Read more about spark executors here.


4. Select driver, under the Driver section.

Lakehouse driver
info

Spark driver is running all the time until lakehouse stopped manually. Driver is responsible for managing executors/workers and connections. If stopped, no connections could be established to the lakehouse.


5. Select executor, under the Executor section.

Lakehouse executor
info

Executors basically are responsible for executing the queries. They will be scaled up and down automatically based on the auto-scale parameter. Keep auto-scale on to minimize lakehouse costs.


6. Set auto scale, under Auto scale section.

Lakehouse Auto Scale
info

Executors will be scaled down after the specified time of inactivity. Executors will be scaled up automatically on demand (Scale up time around 10-15 seconds). It is recommended to keep auto-scale on to minimize monthly costs.

By clicking checkbox in the left side we can disabled auto scale functionality.

Lakehouse Auto Scale Check

7. Click the Create button after adding a description to the optional description field.

Lakehouse Description

🎉 🎉🎉 Tadaa! The newly created test-lakehouse details view is shown.

Lakehouse Detail
  1. Navigation buttons

    • Spark UI - this button will take us Spark Jobs information.
    • Edit - this button will take us to the editing form.
    • Terminate / Start - buttons for the lakehouse's start and stop.
  2. Lakehouse's general information.

  3. Lakehouse statuses

    info

    More details about lakehouse statuses click here

  4. Connections details In this section we may observe various connections details in this part. For instance, Python, JDBC, and others connections.

  5. Audit logs In this section we may check your lakehouse's start/stop logs.

  6. Delete - this button makes it simple to remove Lakehouse.

Lakehouse Statuses

info

We need to understand the cluster components to understand the lakehouse cluster's statuses. Lakehouse cluster comprises of driver and executors.

  • Driver: is the gateway to accept and keep connections, plan executions, and orchestrate executors.
  • Executors: are the components that do the actual processing.

Statuses

A lakehouse cluster can be one of the following statuses:

StatusDescription
StoppedCluster is completely turned off.

Driver: Not running
Executors: No executors running
Accepting connections: No
Cost charging: No
PendingCluster is newly started manually and waiting for resources for the driver

Driver: Not running. Waiting for the resources
Executors: No executors running
Accepting connections: No
Cost charging: No
SuspendedThis status happens when auto-scale is enabled on the cluster. When the cluster stays without any workload, it scales down and turns off the executors to prevent charging costs. Only the driver is running. When the driver gets a query, it starts executors to handle the processing

Driver: Running
Executors: No executors running
Accepting connections: Yes
Cost charging: Only for driver
Scaling-upThis status happens when auto-scale is enabled on the cluster. The cluster decides to scale up executors based on the workload needs up to the maximum of the cluster size.

Driver: Running
Executors: 0 or some already running, and new executors are being started
Accepting connections: Yes
Cost charging: For driver and already running executors
RunningCluster is running state

Driver: Running
Executors: Running
Accepting connections: Yes
Cost charging: For driver and running executors