A virtual lakehouse is a cluster of compute resources that provide the required resources, such as CPU, memory to perform the querying processing. Table data files are stored in cloud data storage (S3) as a shared data storage that allows multiple virtual lakehouse clusters to share the same data while isolating compute. IOMETE uses Apache Spark as a data lakehouse query engine with ACID support
In production environments, it is often required to isolate workloads, for example, to avoid the overhead of batch ETL jobs on ad-hoc analytical queries. Since data is decoupled and shared from virtual lakehouse, it enables the creation of multiple lakehouse clusters to isolate the workloads and turn on/off clusters based on requirements to save costs. Cluster size can be defined based on requirements and workloads.
Create a new Lakehouse
1. Go to the Lakehouses and click thebutton
2. Give the new lakehouse a name under Name.
3. Under the Type section, choose type.
Type defines the maximum number of executors/workers that spark could scale. Read more about spark executors here.
4. Select driver, under the Driver section.
Spark driver is running all the time until lakehouse stopped manually. Driver is responsible for managing executors/workers and connections. If stopped, no connections could be established to the lakehouse.
5. Select executor, under the Executor section.
Executors basically are responsible for executing the queries. They will be scaled up and down automatically based on the auto-scale parameter. Keep auto-scale on to minimize lakehouse costs.
6. Set auto scale, under Auto scale section.
Executors will be scaled down after the specified time of inactivity. Executors will be scaled up automatically on demand (Scale up time around 10-15 seconds). It is recommended to keep auto-scale on to minimize monthly costs.
By clicking checkbox in the left side we can disabled auto scale functionality.
7. Click the Create button after adding a description to the optional description field.
🎉 🎉🎉 Tadaa! The newly created test-lakehouse details view is shown.
- Spark UI - this button will take us Spark Jobs information.
- Edit - this button will take us to the editing form.
- Terminate / Start - buttons for the lakehouse's start and stop.
Lakehouse's general information.
More details about lakehouse statuses click here
Connections details In this section we may observe various connections details in this part. For instance, Python, JDBC, and others connections.
Audit logs In this section we may check your lakehouse's start/stop logs.
Delete - this button makes it simple to remove Lakehouse.
Lakehouse Cluster Statuses
To effectively manage and monitor your lakehouse cluster, you need to understand its two main components: the Driver and the Executors.
- Driver acts as the control center, managing connections and orchestrating tasks
- Executors carry out the actual data processing.
What Each Status Means
- Stopped: The Driver is offline and not accepting any connections.
- Starting: The Driver is booting up.
- Active: The Driver is running and ready to accept connections.
- Failed: The Driver couldn't start. Contact support for assistance.
You're only charged for the Driver when it's in the 'Active' state.
What Each Status Means
- No Running Executors: There is no active executor. This happens when auto-scale is configured. In this case, when there is no workload for a configured auto-suspend time, the cluster scales down to zero. Executors will scale up automatically based on demand.
- Pending: Executors are scheduled to start and waiting for resources to start.
- Running: Executors are active and processing data.
- Running 1/4: One out of four Executors is active. The cluster scales down to save costs when the workload is light.
- Running 1/4 Pending 3/4: One Executor is active, and three are waiting to start due to an increase in workload.
- Running 4/4: All Executors are active, and the cluster is at full capacity.
You're only billed for Executors when they're in the 'Running' state.
By default, scaling up usually takes 1 to 2 minutes, depending on various factors like the cloud provider's response time and resource availability.
In cloud environments, you can utilize IOMETE to establish a hot pool of preconfigured resources. This markedly accelerates the scaling process, reducing the scale-up time to a mere 10 to 15 seconds. Contact support to learn more about this feature.