"Bots crawl virtually infinite endpoints on our Git repositories (as opposed to downloading an archive or shallow clone), including our fork of Firefox, Tor Browser, a massive repository. At first, we've tried various methods: robots.txt, blocking user agents, and finally blocking entire networks. I wrote asncounter. It worked for a while. But now, blocking entire networks doesn't work: they come back some other way, typically through shady proxy networks, which is kind of ironic considering we're essentially running the largest proxy network of the world."
"Out of desperation, we've forced users to use cookies when visiting our site. We haven't deployed Anubis yet, as we worry that bots have broken Anubis anyways and that it does not really defend against a well-funded attacker, something which Pretix warned against in 2025 already. (We have a whole discussion regarding those tools here.) But even that, predictably, has failed. I suspect what we consider bots are now really agents. They run full web browsers, JavaScript included, so a feeble cookie is no match for the massive bot armies."
"Side note on LLM "order of battle" We often underestimate the size of that army. The cloud was huge even before LLMs, serving about two thirds of the web. Even larger swaths of clients like government and corporate databases have all moved to the cloud, in shared, but private infrastructure with massive spare capacity that is readily available to anyone who pays."
Bot traffic has targeted GitLab repositories by crawling many endpoints rather than using archives or shallow clones. Initial defenses such as robots.txt, blocking user agents, and blocking entire networks provided only temporary relief. Attackers returned through proxy networks, including routes that mirror the organization’s own proxy-like infrastructure. Users were forced to use cookies, but browser-based agents running full JavaScript defeated cookie-based protection. The scale of the bot “army” is underestimated because cloud infrastructure already serves most of the web and provides large shared capacity to anyone who can pay. This environment enables well-resourced automated clients to operate at high volume and adapt quickly.
Read at Anarc
Unable to calculate read time
Collection
[
|
...
]