Apache Spark 3.5.0 | What is new?

October 3, 2023 · 6 min read

Aytan Jalilova

Developer Advocate @ IOMETE

We are excited to announce the release of Apache Spark 3.5.0! This release is packed with new features and improvements that make Spark more accessible, versatile, and efficient than ever before.

New features and improvements

Spark Connect: General availability of the Scala client, support for distributed training and inference, and parity of Pandas API on SPARK.
PySpark and SQL functionality: SQL IDENTIFIER clause, named argument support for SQL function calls, SQL function support for HyperLogLog approximate aggregations, and Python user-defined table functions.
Distributed training with DeepSpeed: Simplified configuration and improved performance.
Structured Streaming: Watermark propagation among operators, dropDuplicatesWithinWatermark operations.

Accessibility

Spark Connect: Spark Connect makes it easier to connect to Spark from other programming languages and frameworks. The general availability of the Scala client and support for distributed training and inference make Spark Connect even more powerful and versatile.
Parity of Pandas API on SPARK: The Pandas API on Spark provides a familiar and easy-to-use interface for data analysis tasks. The parity of the Pandas API on Spark with the latest Pandas release makes it easier for users to migrate from Pandas to Spark, and to use Spark for their data analysis workflows.

Versatility

New PySpark and SQL functionality: The new SQL IDENTIFIER clause, named argument support for SQL function calls, and SQL function support for HyperLogLog approximate aggregations make Spark more expressive and powerful. Python user-defined table functions provide a new way to extend the Spark SQL API, making it easier to integrate Spark with other Python libraries and frameworks.
Distributed training with DeepSpeed: DeepSpeed is a library that accelerates distributed training of deep learning models. Spark 3.5.0 simplifies the configuration of DeepSpeed training, and provides performance improvements for DeepSpeed-based training jobs.

Efficiency

Watermark propagation among operators: Watermarks are used to identify late data in Structured Streaming. Spark 3.5.0 introduces watermark propagation among operators, which allows watermarks to be propagated more efficiently across the Structured Streaming pipeline.
dropDuplicatesWithinWatermark operations: The dropDuplicatesWithinWatermark operation allows users to drop duplicate rows from a Structured Streaming stream within a specific watermark interval. This can be useful for applications such as anomaly detection and real-time analytics.

New Apache Spark 3.5.0 built-in test helpers and HLL sketches

HLL sketches are a probabilistic data structure that can be used to estimate the number of distinct elements in a large dataset. The built-in test helpers make it easier to test HLL sketches and ensure that they are working correctly.

Accessing HLL sketches directly is more powerful than using the approx_count_distinct() function because it gives developers more control over the sketch parameters and how the results are interpreted. For example, developers can use HLL sketches to estimate the number of distinct elements in a specific subset of a dataset, or to estimate the number of distinct elements in a dataset that is constantly changing.

Here are some specific examples of how developers can use HLL sketches in Spark 3.5.0:

Estimating the number of unique visitors to a website over a period of time.
Identifying the most popular products in an online store.
Tracking the spread of a disease through a population.
Detecting fraud in financial transactions.

Overall, the built-in test helpers and HLL sketches in Spark 3.5.0 give developers a more powerful and flexible way to estimate the number of distinct elements in large datasets.

Here is a comparison of the approx_count_distinct() function and accessing HLL sketches directly:

Feature	approx_count_distinct()	Accessing HLL sketches directly
Control over sketch parameters	Low	High
Interpretation of results	Fixed	Flexible
Use cases	Simple estimation	More complex tasks

drive_spreadsheetExport to Sheets

Overall, accessing HLL sketches directly is more powerful and flexible than using the approx_count_distinct() function. It is especially useful for tasks that require more control over the sketch parameters or a flexible interpretation of the results.

Deploying Apache Spark to on-premises

Apache Spark is a key component of IOMETE, a fully-managed lakehouse platform. You can use IOMETE for cloud, hybrid and on-premise environment. Spark provides the underlying processing engine for IOMETE, enabling users to perform a wide range of data processing and analytics tasks, including:

Batch processing: Spark can efficiently process large datasets using in-memory caching and optimized query execution. Interactive queries: Spark can be used to run interactive queries against data lakes, providing users with immediate access to insights. Streaming data processing: Spark can be used to process real-time data streams, such as those generated by IoT devices or social media platforms. Machine learning: Spark offers a rich set of libraries and APIs for machine learning tasks, such as classification, regression, and clustering. IOMETE provides a number of features that make it easy to use Spark for data processing and analytics:

User-friendly interface: IOMETE provides a user-friendly interface for writing and executing Spark code. Integrated notebook service: IOMETE includes a notebook service that makes it easy to develop and run Spark code interactively. Seamless integration with popular tools: IOMETE integrates seamlessly with popular big data tools and libraries, such as DBT, Airflow, and MLflow. Optimized resource utilization: IOMETE optimizes the allocation and utilization of computing resources for Spark jobs. Overall, Apache Spark is a powerful and versatile data processing engine that is well-suited for use in IOMETE. Spark's capabilities are enhanced by IOMETE's user-friendly interface, integrated notebook service, and seamless integration with popular tools and libraries.

Here are some specific examples of how Apache Spark can be used in IOMETE:

A data engineer could use Spark to build a pipeline to process and analyze data from a variety of sources, such as databases, cloud storage, and IoT devices.
A data scientist could use Spark to train a machine learning model to predict customer churn or identify fraud.
A business analyst could use Spark to run interactive queries against a data lake to investigate trends and identify opportunities.

Overall, You can use Spark and IOMETE together to get the most benefit from your data architecture by deploying it on-premises, in the cloud, or in a hybrid environment.

New Apache Spark 3.5.0 built-in test helpers and HLL sketches​

drive_spreadsheetExport to Sheets​

Deploying Apache Spark to on-premises​

New Apache Spark 3.5.0 built-in test helpers and HLL sketches

drive_spreadsheetExport to Sheets

Deploying Apache Spark to on-premises