Spark Internals: Understanding Tungsten (Part 1)

"When Apache Spark first hit the big data scene, it felt like absolute magic. It dethroned Hadoop's MapReduce by processing data in memory, making it lightning fast."

"The JVM treats memory like a luxury hotel where every single guest gets a massive penthouse. In Java, an object is incredibly bloated, leading to wasted memory."

"Garbage Collection (GC) becomes a nightmare as Spark creates millions of objects that survive, causing the JVM to panic and trigger massive, multi-second 'stop-the-world' pauses."

"Databricks engineers discovered that the performance bottlenecks in Spark were not due to disk I/O or network speed, as hardware had evolved significantly."

Apache Spark transformed big data processing by enabling in-memory data handling, significantly outperforming Hadoop's MapReduce. However, as usage increased, the Java Virtual Machine (JVM) became a bottleneck due to its inefficient memory management. Java objects are bloated, leading to excessive memory consumption and triggering problematic garbage collection pauses. These pauses create latency issues, hindering performance. Databricks engineers identified that Spark's performance bottlenecks were not solely due to disk I/O or network speed, as hardware advancements had improved these areas significantly.

#apache-spark #jvm #garbage-collection #big-data #project-tungsten

Read at Medium

Unable to calculate read time

Collection

[

...

]

Spark Internals: Understanding Tungsten (Part 1)Spark Internals: Understanding Tungsten (Part 1) Briefly

Spark Internals: Understanding Tungsten (Part 1)
Spark Internals: Understanding Tungsten (Part 1)
Briefly