When you’re optimizing a data lake, your only metrics are cost and time. You’re working with
expensive clusters of worker machines, and can approximate cost as:
1 | COST = NUMBER_OF_WORKERS * COST_OF_WORKERS * TIME |
Memory/CPU usage, spill, and shuffle can be symptoms for diagnosing problems. Ultimately, the real objective is ensuring the job completes in an appropriate amount of time and for a reasonable cost.
Spill
First, consider spill. Spill means you ran out of memory, are writing to disk, and will have to be read back again.
Disk access is slow and costly.
Shuffle
Next, think about shuffle. Shuffle is an undesirable situation where data is on the wrong machine and
must be reorganized on workers, adding significant network time. This is not necessarily a problem, but a symptom to be avoided.
Resources (CPU/Memory)
Memory and CPU usage should be monitored, but the focus is efficient completion.
The ideal scenario is that “100% of compute” is used and memory is “100%” utilized, ensuring efficient use of resources.
Less vCPU or Memory will always save cost, as long as they are not adding significant time.
Tune your cluster with common sense. If you expect a memory-intensive task, add memory-weighted machines; otherwise, use compute-heavy machines to crunch through.
Clusters have a min-max number of workers that can manage cost and are how you scale. Well planned
jobs can scale by expanding the number of workers. Idle machines are always a wasted expense.
Common traps include relying on single-machine libraries (like python pandas) or writing code outside of spark frameworks that can’t be shared across cluster workers.
Bigger isn’t always faster and underutilizing specialized hardware (high-memory or GPUs) comes with big price tags.
Ensure you use frameworks like Spark, distributed data frames, or SQL to allow workloads to distribute across machines.
Effectively using available resources and adjusting the number of workers allows you to keep processes scalable and cost-efficient.
It’s expected that clusters are temporary. They spin up for a job and turn them off as soon as possible.
Running them 24 hours a day for small, frequent tasks is an expensive proposition in a data lake.
Additionally, after jobs are done, there is a period of idle before the machines are shut off.
A low idle shutoff setting will ensure clusters turn off quickly when they’re done.
This loses cached memory, but brings quick savings.