Magika 1.0: Smarter, Faster File Detection with Rust and AI

"Google highlights that many of the newly added file types are specialized text-based file types that were previously difficult to detect. These include Dockerfiles, TOML, HCL, Bazel files and many more. Magika 1.0 can also distinguish between source code files written in Swift, Kotlin, TypeScript, Dart, Web Assembly, and Zig (zig). Additionally, it supports file types commonly used in data science, such as Jupyter Notebooks, Numpy arrays, PyTorch models, ONNX files, and others."

"The sheer volume of data represented a challenge in itself: Our training dataset grew to over 3TB when uncompressed, which required an efficient processing pipeline. To handle this, we leveraged our recently released SedPack dataset library. This tool allows us to stream and decompress this large dataset directly to memory during training, bypassing potential I/O bottlenecks and making the process feasible."

Magika 1.0 is a substantial rewrite of an open-source file type detection system, rebuilt in Rust to maximize speed and security. The release expands supported formats to over 200, up from 100, adding many specialized text-based and data-science file types. The tool now distinguishes closely related formats such as TypeScript versus JavaScript and TSV versus CSV. Engineers trained a specialized AI model on a large dataset, which grew to over 3TB uncompressed, and used SedPack to stream and decompress data in memory during training. Underrepresented formats were augmented with synthetic examples generated via Gemini to improve coverage.

#file-type-detection #rust #ai-models #dataset-engineering

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

Magika 1.0: Smarter, Faster File Detection with Rust and AIMagika 1.0: Smarter, Faster File Detection with Rust and AI Briefly

Magika 1.0: Smarter, Faster File Detection with Rust and AI
Magika 1.0: Smarter, Faster File Detection with Rust and AI
Briefly