← Back to articles
News· 2 min read

Migrating from Apache Spark 3 to Spark 4: what changes and how to prepare

Rob Gibbon at Canonical has put together a guide for anyone running data and analytics workloads on Apache Spark who now has to make the jump to version 4. This isn’t a drop-in upgrade. Spark 4 breaks compatibility in a few places worth checking before you touch production.

The first thing you’ll hit is the version requirements. Scala 2.12 is no longer supported, so you need to move to Scala 2.13. On the Java side, version 17 becomes the intended default and Java 21 also works; anything below Java 17 is discontinued. If your pipeline carries old dependencies, this is where most of the work lands. For the Scala part, the post points to Scalafix, an open source tool that automates a good chunk of the code changes.

Breaking changes

The headline one is ANSI SQL mode being on by default. Until now, a division by zero or an invalid type cast returned NULL and carried on. In Spark 4, those same operations throw a runtime exception. It’s stricter and more correct, but if your code relied on the old behaviour you’ll see failures where there weren’t any before.

Logging changes too: structured JSON output is now the default. If you have monitoring systems that parse plain-text logs, plan to adjust them. On top of that, javax.* imports must be replaced with jakarta.* for servlet and XML compatibility, and RocksDB becomes the default backend for shuffle and state management in structured streaming.

What you gain

It’s not all cleanup. Spark 4 adds the VARIANT data type for semi-structured data such as JSON, with better query performance. There’s also SQL pipe syntax using the |> operator to chain operations, and procedural SQL with multi-statement scripts, local variables and control flow (IF/WHILE logic). On the AI side, it brings vector data types and optimized LLM batch inference.

How to approach the migration

The guide recommends doing it in phases rather than flipping everything at once:

  1. Foundational changes. Upgrade Java and Scala, and validate third-party connectors.
  2. Runtime validation. Turn on ANSI SQL mode while still on Spark 3 to surface problems early, and replace risky operations with safe alternatives like try_cast.
  3. Enabler modernization. Test the new RocksDB defaults for streaming and validate your CREATE TABLE statements.

For rollout order, the advice is to start with batch workloads, then streaming, and leave analytical and decision-support workloads for last. That order makes sense: batch jobs are easier to re-run and compare results if something goes wrong.

If you run Spark on Kubernetes, Canonical offers its Charmed solution for Apache Spark with Java 17 and 21 pre-configured, plus support options. As tends to be the case with its catalogue, it slots in naturally if you already work on Ubuntu.

Source

Guide published by Rob Gibbon on the Canonical blog: Migrating from Apache Spark 3 to Spark 4. Content and product by Canonical.