Scio: From Spotify's Challenges to an Open-Source Scala API for Apache Beam

"Discover Weekly, Release Radar, Daily Mix and Wrapped are all algorithmic and personalized product features of Spotify. They are powered by analyzing over 800 billion events per day, across 18,000 data pipelines (4,000 of which run daily) and with over 2,000 engineers involved. Data is not just a byproduct of the service; it is part of the product itself."

"Managing data at this scale introduces significant challenges: volumes, orchestration, observability, data quality, privacy and costs. In the early 2010s, Spotify had a fragmented data processing stack that had evolved over time. It started with Luigi, the open-source framework developed by Spotify to orchestrate Python MapReduce jobs on Hadoop. While simple and productive, it was untyped and too slow for complex machine learning workloads."

"As data volumes grew, Scalding was introduced. It is a Scala library from Twitter built on Hadoop MapReduce. It offered a more concise, type-safe, and performant approach, but remained limited to batch processing with no support for streaming. For real-time use cases such as ad targeting and new user onboarding, Storm was adopted. However, its low-level APIs made it impractical for building complex applications."

"Spark was introduced for iterative machine learning workloads thanks to its in-memory processing capabilities, but it proved difficult to tune at Spotify's scale. For ad hoc queries by analysts and product managers, Hive was used. It translates queries into MapReduce jobs but with significant I/O overhead. The result was a collection of separate systems to maintain, while physical infrastructure was reaching its limits. The turning point came with the migration to Google Cloud."

Discover Weekly, Release Radar, Daily Mix, and Wrapped rely on algorithmic personalization powered by analyzing over 800 billion events per day across 18,000 data pipelines, with thousands of engineers supporting the work. Data is treated as a core product component rather than a byproduct. Managing this scale creates challenges in volume, orchestration, observability, data quality, privacy, and cost. Early 2010s processing used Luigi to orchestrate Python MapReduce on Hadoop, then Scalding for type-safe, performant batch processing, and Storm for real-time needs. Spark supported iterative machine learning but was hard to tune at scale, while Hive enabled ad hoc queries with high I/O overhead. Infrastructure limits and system fragmentation drove a migration to Google Cloud and adoption of the Dataflow Model for unified batch and streaming.

#data-engineering #streaming-and-batch-processing #machine-learning-infrastructure #distributed-systems #cloud-migration

Read at Medium

Unable to calculate read time

Collection

[

...

]

Scio: From Spotify's Challenges to an Open-Source Scala API for Apache BeamScio: From Spotify's Challenges to an Open-Source Scala API for Apache Beam Briefly

Scio: From Spotify's Challenges to an Open-Source Scala API for Apache Beam
Scio: From Spotify's Challenges to an Open-Source Scala API for Apache Beam
Briefly