Published April 28, 2025
I want to talk about Spark. It's still around. It's still widely in use. It's still a vital skill on a data engineering resume. There are tons of tutorials, blogs, guides, talks, and books on Spark. A shocking amount of them are just about skew. A shocking amount of them are self-aggrandizing with record counts (1 trillion records!) without disclosing actual data volumes. Some of them are conveniently vague about whether they moved a petabyte in a single job run or stored it over several years of daily batches. And, did you know you should try to avoid shuffles!
And yet, I still turn to them when I find a finicky Spark job. Oh jeez, what am I supposed to do? Which parameter to tune? Do I need spark.default.parallelism or spark.executor.cores and why doesn't EMR default maximumResourceAllocation to true? Does anyone even use multi-tenant clusters anymore?
Frankly, I'm sick of the self-doubt. I've been using Spark in earnest since 2018. Before 2018, I was helping a team research Spark in an environment with heavy red tape and personally delivered the message that we didn't really need Spark because the database was only 60GB. Since then I've written and maintained a wrapper around EMR with convenience functions in Spark for my data lake and I am currently working in a similarly large EMR wrapper. I've used Spark on Docker, EMR, and Glue. I've seen Spark Streaming used to replace Lambda and Flink based streaming jobs. I've seen EMR used as a parachute for Glue jobs that don't scale just right. Full disclosure, I've moved terabytes of data in a single run (I know, what a chump) and petabytes over years.
I will not guarantee an answer to your scalability problem and you shouldn't expect one! But, the reason isn't that we aren't good enough to fix it. If you are reading this, it's probably because all the simple technical fixes aren't working and because Spark and the way that companies use Spark is sometimes outrageously complex. You would need to train a huge model to learn the optimal values for all the variables going into a job's performance. The other issue, as I have alluded, is that our sources of information are far too narrowly focused to be of value in the real world. THAT is what I want to address in this article. So buckle up, this is a long one!
Skew is now Youtube Clickbait (but so is everything, no shade to these two).
Layers. Layers and layers of complexity. OMG distributed systems complexity. What many don't realize or, at least, fail to plan is that even the simplest Spark application is a distributed system. Wikipedia defines distributed computing as "computer systems whose inter-communicating components are located on different networked computers". Even if you use a heavily abstracted and managed deployment of Spark on AWS Glue, for example, there are servers running operating systems behind the scenes. They're probably virtualized servers too! On top of that, you have a network stack (which is unreliable). Then you have several applications working in concert (HDFS, Hadoop, YARN, Spark). Then you have the integrations for your data, such as object storage or database connections which are usually running in scalable and highly available systems themselves. Each of these domains is subject to your organization's unique configurations (data organization and formats, security model, attempted Spark configs, etc.). Then you have your application, which we'll talk more about in a bit. Lastly, you have orchestration tools, ETL job timing and sequencing, and the interplay between the many jobs and data stores in your organization.
Here's just a little image of what's going on:
That's plenty, but let's get back to your application. Your Spark application, at a minimum, initializes a session, retrieves some data, applies some processing, and stores the result. The simplest case of this would lead to an embarrassingly parallel application, which is ideal. Beyond that, individuals are tempted to start to better organize their code with functions. Using regular Python, Java, or Scala functions in a Spark application is deceptive though. Spark's lazy evaluation makes it easy to lose track of performance implications while developing locally. Not only that, but functions don't have simple parameters and outputs. They may have those, but they also likely require and emit DataFrames or RDDs which contain tabular data. This data needs to be sampled and scrubbed or generated to be useful locally.
Even if you write great functions and test them well, you might be avid SQL-avoider (my term for my former software engineering self before years of data engineering brought me to the light). You may find that thinking declaratively instead of iteratively just doesn't make sense for your problem and that a User-Defined Function is the way to go, even in Python, even without Apache Arrow.
You still need to test on production-like infrastructure and performance test. You should write code to manage backfills and reprocessing for outages too, but will you?
If there is one takeaway from this article, it is to acknowledge the complexity in your application. Please. For all of us.
Plan to upgrade. The Apache Spark migration guide is very useful. You can see upgrades from 2.4 to 3.5.4. This applies to AWS EMR Releases as well, which have their own support and end of life windows. Why is this so important? Because it's not always trivial and can lead to major performance improvements. Spark migrations are so non-trivial, that Airbnb engineers have given presentations on the systems they built to manage them and so did Pinterest engineers. Spark's major performance improvements come from the evolution of its self-optimizing abilities like Catalyst Optimizer and Tungesten, to Cost-Based Optimizer, and now Adaptive Query Execution (AQE).
New releases aren't coming out nightly but that's all the more reason to have a plan to incorporate them. I have seen at least 3 large scale Spark implementations and all of them failed to include upgrade planning in their roadmaps. Forming a plan depends heavily on whether or not you have a single or multi-team effort to manage and how much centralization you need (and the corresponding control and observability and response time you can manage). However, each application will need to be tested on the newer version and have its status, performance, and output validated (see Airbnb's example). You can potentially leverage your workflow tooling to automate this process (see Pinterest example). In truth, the upgrade process is fairly straightforward, although time consuming and potentially expensive. While the real challenges are in the details of each upgrade issue, those are well documented in the migration guides and issue trackers. So there's no excuse not to plan!
Airbnb Migration Framework
Pinterest Migration State Machine
Again, plan to upgrade. Whether you are new to Spark or have been using it for years, you should communicate that this is a serious requirement, more so even than you might worry about your web service's dependencies.
As alluded in the introduction, there's a lot of repetitive advice out there. So I'll start with the best tools I have ever used for Spark optimization:
Does this job need to exist?
Does this need to be a Spark job?
If the answer to either of those is no, you should remove or replace it with something simpler. Our companies always have enough legacy cruft and there's no reason to be attached to a Spark job if nobody uses the data it produces. And nothing beats a simple local development loop with code that executes instantly in a production-like environment right on your laptop.
When data is small or jobs can only be described in single-threaded ways, you should absolutely not use Spark - although you should always seriously question why something must be a single server if it's not training an ML model that isn't distributable. Spark's lazy execution can always strike when data is unknowingly pulled back to the driver node. More often than not, this happens in jobs with complex query plans that are too large to interpret, not simply from inexperience on the part of the developer.
If you are not occasionally frightened by what you propose to axe on your list of Spark jobs and datasets to maintain, be more ambitious. If you are met with resistance, do a one-off assessment of how much your under-utilized data sets cost your company and put it in a proposal for better data lineage tooling. Then, watch your ambitions become practical expectations.
Spark configuration parameters show up in blogs and StackOverflow posts and guides everywhere. Wouldn't it be great if you could tweak a simple parameter and cut your job time by a fraction? This isn't always the case though. Many managed services handle these for you so overriding them can be counterproductive. Generally, deployment mode is a good thing to check and not to doubt which you need (it's always cluster). Also, use any simple optimizers from your cloud provider like EMR's maximizeResourceAllocation which are likely relevant if your managed Spark service might run either long-lived cluster or ephemeral clusters. But, most of the time, I see people encountering and using these settings haphazardly and because your Spark's tuning parameters are tightly interconnected, changing them is usually a bad idea unless you're absolutely sure of what they do.
Spark's SQL and Dataframe APIs include various evolutions of cost-based optimizers (Catalyst, CBO, AQE). Few people have simple enough applications and the diligence (i.e. time) to use the RDD API and still have good performance. This is especially true for platform teams trying to enable and streamline Spark for other people. Spark SQL should never be overlooked as well. It has immense value in its portability.
Tools like Airflow, Prefect, Dagster, AWS Step Functions and others are really useful for scheduling and ordering data processing jobs. They help provide at least a small improvement to data lineage while greatly improving reliability and reprocessing needs. Specifically, the ability to retry a job is paramount. Network blips, spot-loss, and other effectively random disruptions can cause a Spark job to fail (remember distributed computing?) and quite often, simply retrying the job is suitable.
HOWEVER, orchestration is no substitute for flaky jobs. Jobs that behave non-deterministically on their own or that just don't scale well will be masked by a retry policy. If you find yourself increasing retries beyond 3, start getting skeptical. If you need more than 10, you're doing it wrong. Just because you don't understand the problem or don't have time to investigate, doesn't mean its ewable. Only some issues are really best handled by retries and knowing the difference is critical. At the very least, we want to hold our cloud providers accountable to the availability promises that they make and not give them a free pass with our time and credit cards.
Another interesting pattern that I see emerge from time to time is addressing problems with the workflow tool which can be addressed within the Spark job. For example, if you have a major skew issue, you need to address it directly. If you can't address it at a business level, address it with Spark (salting and hashing like the wealth of those tutorials online are hoping you will in exchange for an extra view). Think of your driver node and Spark application as your first orchestrator before Airflow or Step Functions. This will allow you leverage locality of reference and in-memory processing (the reasons Spark was created!).
If you must process skewed data with separate Spark applications (you don't), use dedicated and fully-isolated ephemeral clusters. Otherwise, you'll need to use HDFS or shared tables to get any performance benefits. In most cases, splitting jobs will require you to repeatedly spend expensive compute time being I/O bound or starved for resources and wasting time on context switching. In these cases, no amount of repartition, coalesce, cache, or persist calls will save you, they'll just make it worse.
Address skew. Avoid shuffles, full table scans, repeated table reads, and redundant writes. Use broadcast tables. Avoid UDFs and if you must use one, use a better serializer like Arrow. Be very judicious with functions at all. And again, check for small data, single-threading, and client mode.
Out of this list, table scans are probably one of the most frustrating. While most of these fixes are implemented with a few lines of code, they are less often the biggest concern. Mismatched access patterns are hard to detect at scale though and hard to correct, often requiring cross-team coordination to address. But, if you can't predicate-pushdown your way to quick jobs, you'll have a lot of trouble with testing too which will just add to the frustration. This is where your metadata, table stats, and fancy data governance tools will seriously save you some stress.
The learning curve for the Spark UI is high. You have to understand the difference between an application, a job, a stage, and a task. You need to know about drivers and executors and why they're separate from nodes. You need to know the difference between your code and a query plan and between a logical and a physical plan. You'll also need to know the quirks of the UI, like when is it lagging or when is it just broken or if it just looks broken because your app is broken.
Be patient! This is much better than looking at logs and it is quite literally all the information that you need. The biggest frustration I see from people is thinking that there's going to be something obvious to fix like a big red error or a huge spike in a utilization graph (by the way, Ganglia is helpful but differently licensed from Spark). The truth is that Spark performance issues are often due to a sub-optimal query plan that has more to do with your code or data organization than some obvious bottleneck in your cluster tuning. That's why the visual inspectors in the UI are very useful.
If you find that your UI is flooded with small stages or tasks or totally empty, it's a clue. In the first case, you may have actually over-scaled your cluster forcing small chunks of data everywhere which can lead to lots of overhead. In the second case, you may be able to find application failures in YARN or elsewhere that are more straightforward to solve once you find them.
For cloud users, switching server instance types - especially to ARM-based processors - or using a serverless option can be a quick win. Some services, like AWS Glue, offer some automated generation of tuning recommendations based on job code and logs but can be limited to just a few versions of Spark. Just be aware that these options are not magic and while they may provide ongoing benefit, they are one-time improvements. Once you go serverless or ARM, you've used all your options until AWS thinks up something new. They aren't a substitute for other best practices and they can also introduce compatibility challenges or unexpected cost increases if not applied carefully.
Step one to tackling any Spark issue is to think like a scientist, then an engineer. Think of Spark as a highly multi-variate system that you are observing to better understand (because it is). Now think like an engineer. If you have a singular goal in mind (say, reducing run time) and you change multiple variables, how will you repeat your success elsewhere? Do you repeat them all? Controlling your variables and testing one at a time is painstaking for sure but so is going in circles while racking up thousands of dollars in cloud fees and not getting any answers. Or worse, thinking you have an answer only to repeat the same haphazard process again in 6 months. The worst feeling I have while debugging any complex system is when it just starts working again and I don't know why (yes, I know I have a problem).
Here's a quick list of key variables to control or test:
Spark/Glue/EMR version
Spark configuration parameters (one at a time)
Infrastructure configurations (also one at a time)
Application code
Underlying data tables (even hourly/daily fluctuation can confound a performance test)
Multi-tenancy on your cluster (I would simply avoid this altogether)
If you have a large number of applications to manage, you'll probably find whatever out-of-the-box observability that you have is lacking. This is especially true if you have short-lived ephemeral/transient clusters since it is a newer paradigm. You need to be able to track job status and run times at a glance and across runs. You want to see how company-wide changes impacted overall failure rates, running times, and data quality. This will require additional tools, instrumentation, and automation to make possible.
Whether your team is managing one job or 1,000, you need a way to test them. For many, this means a few meaningless unit tests followed by a staging deployment, a manual run in staging, a production deployment, and crossed fingers while you wait for the next run. Why is this?
Scalability and real-world data complexity can limit the usefulness of local testing.
Lazy evaluation and data evolution makes unit testing hard to write and maintain.
If you want to get ahead of these problems, you need to prioritize testing with performance in mind.
Here's a few steps that facilitate this approach:
Parameterize your data output locations as early in your toolchain as possible.
Parameterize scalability limits as a percentage if possible or as a record count if not.
Make your critical code path easy to run both interactively in a notebook and in production.
Parameterizing data output is fairly self-explanatory. You don't want to hardcode your output to your production table.
Parameterizing scalability limits is less obvious. It's one thing to toggle the input table to a non-production environment, however, these environments can sometimes be poorly governed and have unrealistic data. They also may have a fixed or trivial data volume, which doesn't enable realistic profiling. Since most data lakes can easily accommodate extra read capacity, it's useful to have the job be capable of limiting itself while still using production data. This starts with limiting which partitions are read and introducing filters early in the query plan so that maximum processing logic is performed on the smaller data subset. Percentages are more useful to others who are unfamiliar with the relationship between record count and scale but may require code to convert to record counts. Having data reading limits not only lets you run your jobs faster, it lets you use production data for better data validation and to see if your application runtime grows linearly with its data volume.
Making your critical code path easy to run interactively is also very useful. Writing Spark code in a Jupyter notebook on a live cluster can be a great way to continuously assess job performance while writing or refactoring code. Parameterizing code as described above is the first step. Moving most of your code for reading and processing data to a function while leaving Spark session initialization and data writes in the main function is the goal. Beyond that, making it easy to copy/paste that function or build, push, and reinstall the package is also essential. While running on a live cluster can cost money and take more time than unit tests, it is also much more comprehensive and realistic and most importantly, puts performance testing at the beginning and entirety of the workflow.
I am going to plug for Spark SQL again. It is sadly unpopular for a lot of engineers outside of the data community.
One reason is that SQL is often treated by engineers as less expressive. I am personally of the belief that this comes from unfamiliarity. I know that Ruby is considered highly expressive but since I haven't used it in 10 years, I don't think I would feel that way today. At the same time, I have seen folks who argue against SQL also argue for declarative programming with Infrastructure as Code (e.g. Docker or CloudFormation, Terraform). When it comes to managing and scaling big data workloads, the problems are usually not hard to express but rather hard to scale and manage. In those cases, it's better to focus on portability.
That leads us to the next reason people avoid SQL which is that it almost invariably has some dialect. This can increase the learning curve and decrease interoperability. However, ANSI SQL is generally the base for most dialects and it is certainly easier to port Spark SQL to pgSQL or Trino/Presto/Athena than it is to port RDD or Dataframe code in Python to pgSQL!
I once wrote an "algorithm" for identifying high-activity bounding boxes for robots in SQL. It wasn't easy and it was hard to scale but we were able to port it from Postgres to Athena which made a massive improvement in cost, scale, and complexity and allowed us to standardize on existing tools. While that feature is now working in realtime with more custom infrastructure to make things cheaper and faster, the investment in that infrastructure would have been wildly impractical before we knew this feature was possible and marketable. So, while I will continue to advocate for SQL, even for really hard problems, I still wouldn't recommend Spark in this scenario. I would have needed to use a UDF, manage packaging dependencies for the EMR runtime environment, reconfigure data serialization, and worry more about skew.
This comes back to the idea that Spark is super complex on its own and that you should use it for embarassingly parallel problems. When you start needing a ton of expressiveness to transform the data or solve a problem, you may want to check your understanding of the problem or break it into smaller problems. Unpacking a finicky legacy Spark job is expensive and time consuming so being diligent upfront and focusing on the big picture and staying flexible in the long run are key.
You can even make a clustering algorithm in SQL.
So, I have stated my case that all the sage wisdom about Spark is useful but that it sometimes assumes that the context is a black box, which is often untrue and very limiting. I've proposed what I view as useful mental strategies for reasoning about Spark optimization challenges. If you have thoughts or questions about this strategy guide, I'd love to hear them!