My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.
rezonant 4 hours ago [-]
It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.
samlinnfer 12 minutes ago [-]
This has basically been Wikipedia's bot policy for a long long time.
If you run a bot you should identify it via the UserAgent.
Anubis should be something that doesn't inconvenience all the real humans that visit your site.
I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.
opan 2 hours ago [-]
I only had issues with it on GNOME's bug tracker and could work around it with a UA change, meanwhile Cloudflare challenges are often unpassable in qutebrowser no matter what I do.
throwaway290 3 hours ago [-]
I understand why ffmpeg does it. No one is expected to pay for it. Until this age of LLMs when bot traffic became dominant on the web ffmpeg site was probably acceptable expense. But they probably don't want to be unpaid data provider for big LLM operators who get to extract a few bucks from their users.
It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.
As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.
TJSomething 31 minutes ago [-]
I know it's beside the point, but I think a chunk of the reason for many of the security measures in airports is because creating the appearance of security increases people's willingness to fly.
mariusor 3 hours ago [-]
I don't understand the hate when people look at a countermeasure against unethical shit and complain about it instead of being upset at the unethical shit. And it's funny when it's the other way around, like cookie banners being blamed on GDPR not on the scumminess of some web operators.
elashri 51 minutes ago [-]
I don't understand that some people don't realize that you can be upset about status que that both sides of the equation sucks. And you can hate thing and also the countermeasure that someone deploy against. These are not mutually exclusive.
mariusor 39 minutes ago [-]
I didn't see parent be upset about both sides on this one. I don't see it implied anywhere that they even considered it.
elashri 30 minutes ago [-]
>which was initially something I supported.
That quote is strong indication that he sees it this way.
bakql 1 hours ago [-]
[flagged]
trenchpilgrim 1 hours ago [-]
Unfortunately in countries like Brazil and India, where a majority of humans collectively live, better computers are taxed at extremely high rates and are practically unaffordable.
bakql 1 hours ago [-]
[flagged]
uqers 5 hours ago [-]
> Unfortunately, the price LLM companies would have to pay to scrape every single Anubis deployment out there is approximately $0.00.
The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.
drum55 5 hours ago [-]
The "cost" of executing the JavaScript proof of work is fairly irrelevant, the whole concept just doesn't make sense with a pessimistic inspection. Anubis requires the users to do an irrelevant amount of sha256 hashes in slow javascript, where a scraper can do it much faster in native code; simply game over. It's the same reason we don't use hashcash for email, the amount of proof of work a user will tolerate is much lower than the amount a professional can apply. If this tool provides any benefit, it's due to it being obscure and non standard.
When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.
for (; ;) {
const hashBuffer = await calculateSHA256(data + nonce);
const hashArray = new Uint8Array(hashBuffer);
let isValid = true;
for (let i = 0; i < requiredZeroBytes; i++) {
if (hashArray[i] !== 0) {
isValid = false;
break;
}
}
It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.
jsnell 3 hours ago [-]
That is not a productive way of thinking about it, because it will lead you to the conclusion that all you need is a smarter proof of work algorithm. One that's GPU-resistant, ASIC-resistant, and native code resistant. That's not the case.
Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.
aniviacat 3 hours ago [-]
They do use SubtleCrypto digest [0] in secure contexts, which does the hashing natively.
Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):
> One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.
Right, but that's the point. It's not that the idea is bad. It's that PoW is the wrong fit for it. Internet-wide scrapers don't keep state? Ok, then force clients to do something that requires keeping state. You don't need to grind SHA2 puzzles to do that; you don't need to grind anything at all.
echelon 5 hours ago [-]
This whole thing is pointless.
OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.
The firewall is now moot.
The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.
At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.
OpenAI and Google want you to block everybody else.
happyopossum 4 hours ago [-]
> Google, has already been doing this for decades
Do you have any proof, or even circumstantial evidence to point to this being the case?
If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.
echelon 4 hours ago [-]
Sorry, I mean they're between the customer relationship.
Who would dare block Google Search from indexing their site?
The relationship is adversarial, but necessary.
Dylan16807 4 hours ago [-]
Is it confirmed that site loads go into the training database?
But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.
heavyset_go 4 hours ago [-]
> Is it confirmed that site loads go into the training database?
Would you trust OpenAI if they told you it doesn't?
If you would, would you also trust Meta to tell you if its multibillion dollar investment was trained on terabytes of pirated media the company downloaded over BitTorrent?
viraptor 52 minutes ago [-]
We don't have to trust it or not. If there's such claim, surely someone can point at least at a pcap file with an unknown connection. Or at some decompiled code. Otherwise it's just a conspiracy theory.
valicord 5 hours ago [-]
The point is that the scrapers can easily bypass this if they cared to do so
uqers 5 hours ago [-]
How so?
tecoholic 5 hours ago [-]
Hmm… by setting the verified=1 cookie on every request to the website?
Am I missing something here? All this does is set an unencrypted cookie and reload the page right?
notpushkin 4 hours ago [-]
They could, but if this is slightly different from site to site, they’ll have to either do this for every site (annoying but possible if your site is important enough), or go ahead and run JS (which... I thought they do already, with plenty of sites still being SPAs?)
rezonant 4 hours ago [-]
I would be highly surprised if most of these bots are already running JavaScript, I'm confused by this unquestioned notion that they don't.
5 hours ago [-]
agnishom 44 minutes ago [-]
Exactly. I don't understand what computation you can afford to do in 10 seconds on a small number of cores that bots running on large data centers cannot
juliangmp 3 minutes ago [-]
The point of anubis isn't to make the scraping impossible, but make it more expensive.
gbuk2013 1 hours ago [-]
The Caddy config in the parent article uses status code 418. This is cute, but wouldn’t this break search engine indexing? Why not use 307 code?
paweladamczuk 3 hours ago [-]
Internet in its current form, where I can theoretically ping any web server on earth from my bedroom, doesn't seem sustainable. I think it will have to end at some point.
I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.
weinzierl 2 hours ago [-]
"Unfortunately, Cloudflare
is pretty much the only reliable way to protect against bots."
With footnote:
"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."
That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.
(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)
praptak 4 hours ago [-]
There are reasons to choose the slightly annoying solution on purpose though. I'm thinking of a political statement along the lines "We have a problem with asshole AI companies and here's how they make everyone's life slightly worse."
geokon 4 hours ago [-]
Big picture, why does everyone scrape the web?
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
utopiah 3 hours ago [-]
My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.
Jackson__ 2 hours ago [-]
Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.
E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.
jchw 4 hours ago [-]
I was briefly messing around with Pangolin, which is supposed to be a self-hosted Cloudflare Tunnels sort of thing. Pretty cool.
One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.
bootsmann 4 hours ago [-]
Not sure crowdsec is fit for this purpose. Its more a fail2ban replacement than a ddos challenge.
jchw 49 minutes ago [-]
One of the main ways that Cloudflare is able to avoid presenting CAPTCHAs to a lot of people while still filtering tons of non-human traffic is exactly that, though: just having a boatload of data across the Internet.
tptacek 5 hours ago [-]
This came up before (and this post links to the Tavis Ormandy post that kicked up the last firestorm about Anubis) and without myself shading the intent or the execution on Anubis, just from a CS perspective, I want to say again that the PoW thing Anubis uses doesn't make sense.
Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.
Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.
Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.
Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
Gander5739 14 seconds ago [-]
But youtube can still be scraped with yt-dlp, so apparently it wasn't enough.
mariusor 2 hours ago [-]
With all due respect, but almost all I see in this thread is people looking down their nose at a proven solution, and giving advice instead of doing the work. I can see how you are a _very important person_ with bills to pay and money to make, but at least have the humility of understanding that the solution we got is better than the solution that could be better if only there was someone else to think of it and build it.
gucci-on-fleek 5 hours ago [-]
> Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
Agreed, residential proxies are far more expensive than compute, yet the bots seem to have no problem obtaining millions of residential IPs. So I'm not really sure why Anubis works—my best guess is that the bots have some sort of time limit for each page, and they haven't bothered to increase it for pages that use Anubis.
> with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
> The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
They did [0], but it doesn't work [1]. Of course, the Anubis implementation is much simpler than YouTube's, but (1) Anubis doesn't have dozens of employees who can test hundreds of browser/OS/version combinations to make sure that it doesn't inadvertently block human users, and (2) it's much trickier to design an open-source program that resists reverse-engineering than a closed-source program, and I wouldn't want to use Anubis if it went closed-source.
Google's content-protection system didn't simply make sure you could run client-side Javascript. It implemented an obfuscating virtual machine that, if I'm remembering right (I may be getting some of the detailed blurred with Blu Ray's BD+ scheme) built up a hash input of runtime artifacts. As I understand it, it was one person's work, not the work of a big team. The "source code" we're talking about here is clientside Javascript.
Either way: what Anubis does now --- just from a CS perspective, that's all --- doesn't make sense.
indrora 5 hours ago [-]
The problem is that increasingly, they are running JS.
In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."
moebrowne 2 hours ago [-]
> increasingly, they are running JS.
Does anyone have any proof of this?
utopiah 3 hours ago [-]
> increasingly, they are running JS.
I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.
typpilol 1 hours ago [-]
Chrome even released a dev tools mcp they gives any LLM full tool access to do anything in the browser.
Navigate, screenshots, etc. it has like 30 tools in it alone.
Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.
defraudbah 3 hours ago [-]
does you 12-liner show a cute anime girl? yeah, that's what I thought
viaoktavia 3 hours ago [-]
[dead]
gucci-on-fleek 5 hours ago [-]
> But it still works, right? People use Anubis because it actually stops LLM bots from scraping their site, so it must work, right?
> Yeah, but only because the LLM bots simply don’t run JavaScript.
I don't think that this is the case, because when Anubis itself switched from a proof-of-work to a different JavaScript-based challenge, my server got overloaded, but switching back to the PoW solution fixed it [0].
I also semi-hate Anubis since it required me to add JS to a website that used none before, but (1) it's the only thing that stopped the bot problem for me, (2) it's really easy to deploy, and (3) very few human visitors are incorrectly blocked by it (unlike Captchas or IP/ASN bans that have really high false-positive rates).
It seems that people do NOT understand its already game over.. Lost.. When stuff was small, and we had abusive actors, nobody cared.. oh just few bad actors, nothing to worry about, they will get bored and go away. No, they wont, they will grow and grow and now most even good guys turned bad because there is no punishment for it.. So as I said, game over.
Its time to start do own walled gardens, build overlay VPN networks for humans. Put services there, if someone misbehave? BAN his IP. Came back? BAN again. Came back? wtf? BAN VPN provider.. Just clean the mess.. different networks can peer and exchange. Look, Internet is just network of networks, its not that hard.
utopiah 3 hours ago [-]
"Yes, it works, and does so as effectively as Anubis, while not bothering your visitors with a 10-second page load time."
Cool... but I guess now we need a benchmark for such solutions. I don't know the author, I roughly know the problem (as I self host and most of my traffic now comes from AI scrapper bots, not the usual indexing bots or, mind you, humans) but when they are numerous solutions to a multi-dimensional problem I need a common way to compare them.
Yet another solution is always welcomed but without being able to efficiently compare it doesn't help me to pick the right one for me.
hubraumhugo 3 hours ago [-]
What's the endgame of this increasing arms race? A gated web where you need to log in everywhere? Even more captchas and Cloudflare becoming the gateway to the internet? There must be a better way.
We're somehow still stuck with CAPTCHAs (and other challenges), a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How else would I inter my dead and make sure they get to the afterlife?
GauntletWizard 4 hours ago [-]
Anubis's design is copied from a great botnet protection mechanism - You serve the Javascript cheaply from memory, and then the client is forced to do expensive compute in order to use your expensive compute. This works great at keeping attackers from attempting to waste your time; It turns a 1:1000 amplification in compute costs into a 1000:1.
It is a shitty, and obviously bad solution for preventing scraping traffic. The goal of scraping traffic isn't to overwhelm your site, it's to read it once. If you make it prohibitively expensive to read your site even once, nobody comes to it. If you make it only mildly expensive, nobody scraping cares.
Anubis is specifically DDOS protection, not generally anti-bot, aside from defeating basic bots that don't emulate a full browser. It's been cargo-culted in front of a bunch of websites because of the latter, but it was obviously not going to work for long.
viraptor 4 hours ago [-]
> The goal of scraping traffic isn't to overwhelm your site, it's to read it once.
If the authors of the scrapers actually cared about it, we wouldn't have this problem in the first place. But today the more appropriate description is: the goal is to scrape as much data as possible as quickly as possible, preferably before your site falls over. They really don't care and side effects beyond that. Search engines have an incentive to leave your site running. AI companies don't. (Maybe apart from perplexity)
reppap 2 hours ago [-]
First of all Anubis isn't meant to protect simple websites that gets read once. It's meant for things like a gitlabs instance where AI bots are indexing every single commit of every single file. Resulting in thousands of not millions of reads. And reading an Anubis page once isn't expensive either. So I don't really understand what point you are trying to make as the premise seems completely wrong.
purple_turtle 2 hours ago [-]
Some people deployed Anubis not to stop scraping, but to stop scraping the same page multiple times per second.
GaryBluto 4 hours ago [-]
I find Anubis really useful to me because it effectively gives me a handy list of what projects are managed by incompetents and should be avoided. Mascot aside, as this article states, it's generally useless. Old timey snake oil had more functional applications.
herpessimplex10 4 hours ago [-]
Nothing makes me think a site is operated by unserious man-children more than seeing that anime cat-girl flash up on my screen for a few seconds.
viraptor 4 hours ago [-]
Or you know, they may be serious and grown up enough to not be bothered by an image of an anime cat girl. There's really nothing there to be offended by.
pkal 47 minutes ago [-]
I am also not a fan, but since a recent discussion on HN I have been thinking about what I don't like about it.
The conclusion I have come to is more general: I just personally don't like nerd-culture. Having an anime girl (but the same would be the case for an star trek, my litte pony/furry, etc.-themed site) signifies a kind of personality that I don't feel comfortable about, mainly due to the oblivious social awkwardness, but also due to "personal" habits of some people you meet in nerdy spaces. I guess there is something about the fact of not distinguishing between a public presentation and personal interests that this is reminiscent of. For instance: A guy can enjoy model trains, sure, but he is your college at work and always just goes on about model trains (without considering if this interests you or not!), then the fact that this subsumes his personality becomes a bore, but also indicative for a "poverty of personality" (sorry if that sounds harsh, what I am trying to get at is that a person is "richer" if their interests are not all determined by consumption habits). This is not to generalize that this is the case for everyone in these spaces, I am friends with nerdy-people on an individual basis, but I am painfully aware that I don't fit in perfectly like the last piece of a jigsaw puzzle -- and increasingly have less of a desire to do so.
So for me at least this is not offence, but in addition to the above also some kind of reminder that there is a fundamental rift in how decency and intersocial relations are imagined between the people who share my interests and me, which does bother me. Having that cat-girl appear every time I open some site reminds me of this fact.
Does any of this make sense? The way you and others phrase objections to the objections makes it seem like anyone who dislikes this is some obsessive or bigoted weirdo, which I hope I don't make the impression of. (Hit me up, even privately off-HN if anyone wants to chat about this, especially if you disagree with me, this is a topic that I find interesting and want to better understand!)
GaryBluto 3 hours ago [-]
> grown up enough to not be bothered by an image of an anime cat girl
This is some real four-dimensional chess.
"You're the childish one for not wanting Japanese cartoons on software projects!"
tavavex 3 hours ago [-]
It's not just "not wanting" something, the original comment wasn't nearly that mild. It's being enraged by it to the extent of making petty, low, personal attacks on someone who steps just barely out of line of their preferred behaviors.
This whole comment chain solidifies my opinion that disgust is one of the driving human emotions. People feel initial, momentary disgust and only then explain it using the most solid justification that comes to mind, but the core disgust is unshakable and precedes all explanations. No one here has managed to procure any argument for why seeing a basic sketch in a certain style is objectively bad or harmful to someone, only that it's "weird" in some vague way. Basically, it goes against the primal instinct of how the person thinks the world "ought to work", therefore it's bad, end of story.
To me it seems obvious. The anime art style is in, especially in Western countries, especially^2 among younger people, and especially^3 among techy people. Ergo, you may see a mascot in that style once in a while in hobbyist projects. Doesn't seem like anything particularly objectionable to me.
krapp 27 minutes ago [-]
It isn't a driving human emotion. The world is full of serious businesses that use "cute" icons or employ anime-styled elements, and most people don't care. It's just a subset of tech and CS people who feel compelled to register their disdain and every opportunity.
And yet if you bring up that "Gimp" is an unserious name, or anything about RMS that's far more problematic than a cute cartoon, that same subset will defend it to the death.
herpessimplex10 4 hours ago [-]
It's not offensive; it's weird and you know it is.
Like a webserver returning an animated cock and balls on a 403 then acting like "we're all adults here why does anyone have a problem with this?"
viraptor 1 hours ago [-]
If those 2 are in any way close in your mind... maybe think about why a cartoon character and explicit sexual image seem related? There's nothing weird about an anime cat girl for me, but also I don't see any relation to anything sexualised in it.
mariusor 2 hours ago [-]
Mister simplex, having a preference for small internet to crumble under the shit LLM companies are piling upon it to being exposed once in a while to an inoffensive image of an anime character would make for a very worrisome direction of my priorities if I were you.
petesergeant 3 hours ago [-]
> an anime cat-girl
> an animated cock and balls
You don't see a difference between these things?
GaryBluto 3 hours ago [-]
While it's a bit of an extreme comparison, they're both weird, unprofessional imagery associated with things you wouldn't wan to associate with a software project.
lakecresva 3 hours ago [-]
If it were just the imagery I don't think this would be such a huge flashpoint relative to something like tux or octocat (the github mascot).
squigz 2 hours ago [-]
Because it's really about the type of people (they think) watch anime, and their inability to separate this preconception from reality.
petesergeant 3 hours ago [-]
I guess I'm not seeing the special category that an anime cat girl sits in. Is there some kind of sex implication I'm just not aware of? Linux has a penguin, FreeBSD has a devil(!), OpenBSD has a blowfish, Go has the weird Gopher thing, Gnome has a foot...
Wikipedia suggests that there's an association with queer and trans youth, is that what's meant to make the cock-and-balls comparison work? But it also says it has a history back to 17th century Japan...
spinf97 3 hours ago [-]
I don't, actually. Why is it weird?
spinf97 3 hours ago [-]
Only man-children can be bothered by anime catgirls enough to post about it on on hacker news, so it says more about you tbh
GaryBluto 3 hours ago [-]
When did this notion that caring about things and wanting things to be professional is bad, or makes you a "man-child"? That would mean that practically everybody in human history has been a man-child. It feels like the whole world (even formerly professional areas) have decided to be casual and it's frustrating to those who think things matter.
brendoelfrendo 3 hours ago [-]
I want less professionalism, thanks. I think the idea that everything needs to be an emotionless product has been largely harmful to the internet as a place of community and expression.
GaryBluto 3 hours ago [-]
Professional =/= emotionless or product. I'd argue that early Linux, with all of Linus' rants, was more professional than most companies today.
I suppose it all comes down to what your definition of "professional" is.
watwut 2 hours ago [-]
> That would mean that practically everybody in human history has been a man-child.
I would argue that this statement is blatantly false. Currently, most people really do not care about anubis anime cat girl icon which is actually fairly tame and boring picture.
In history, people used all kind of images for professional things, including stuff they found funny or cute.
Rendered at 10:17:21 GMT+0000 (Coordinated Universal Time) with Vercel.
E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...
But if you run this, you get the page content straight away:
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...
[0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.
It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.
As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.
That quote is strong indication that he sees it this way.
The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.
When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.
It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.
Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):
> One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.
[0] https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypt...
[1] https://github.com/TecharoHQ/anubis/blob/main/web/js/algorit...
[2] https://github.com/TecharoHQ/anubis/releases/tag/v1.22.0
OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.
The firewall is now moot.
The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.
At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.
OpenAI and Google want you to block everybody else.
Do you have any proof, or even circumstantial evidence to point to this being the case?
If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.
Who would dare block Google Search from indexing their site?
The relationship is adversarial, but necessary.
But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.
Would you trust OpenAI if they told you it doesn't?
If you would, would you also trust Meta to tell you if its multibillion dollar investment was trained on terabytes of pirated media the company downloaded over BitTorrent?
Am I missing something here? All this does is set an unencrypted cookie and reload the page right?
I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.
With footnote:
"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."
That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.
(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.
One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.
Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.
Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.
Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.
Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
Agreed, residential proxies are far more expensive than compute, yet the bots seem to have no problem obtaining millions of residential IPs. So I'm not really sure why Anubis works—my best guess is that the bots have some sort of time limit for each page, and they haven't bothered to increase it for pages that use Anubis.
> with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
> The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
They did [0], but it doesn't work [1]. Of course, the Anubis implementation is much simpler than YouTube's, but (1) Anubis doesn't have dozens of employees who can test hundreds of browser/OS/version combinations to make sure that it doesn't inadvertently block human users, and (2) it's much trickier to design an open-source program that resists reverse-engineering than a closed-source program, and I wouldn't want to use Anubis if it went closed-source.
[0]: https://anubis.techaro.lol/docs/admin/configuration/challeng...
[1]: https://github.com/TecharoHQ/anubis/issues/1121
Either way: what Anubis does now --- just from a CS perspective, that's all --- doesn't make sense.
In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."
Does anyone have any proof of this?
I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.
Navigate, screenshots, etc. it has like 30 tools in it alone.
Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.
> Yeah, but only because the LLM bots simply don’t run JavaScript.
I don't think that this is the case, because when Anubis itself switched from a proof-of-work to a different JavaScript-based challenge, my server got overloaded, but switching back to the PoW solution fixed it [0].
I also semi-hate Anubis since it required me to add JS to a website that used none before, but (1) it's the only thing that stopped the bot problem for me, (2) it's really easy to deploy, and (3) very few human visitors are incorrectly blocked by it (unlike Captchas or IP/ASN bans that have really high false-positive rates).
[0]: https://github.com/TecharoHQ/anubis/issues/1121
Its time to start do own walled gardens, build overlay VPN networks for humans. Put services there, if someone misbehave? BAN his IP. Came back? BAN again. Came back? wtf? BAN VPN provider.. Just clean the mess.. different networks can peer and exchange. Look, Internet is just network of networks, its not that hard.
Cool... but I guess now we need a benchmark for such solutions. I don't know the author, I roughly know the problem (as I self host and most of my traffic now comes from AI scrapper bots, not the usual indexing bots or, mind you, humans) but when they are numerous solutions to a multi-dimensional problem I need a common way to compare them.
Yet another solution is always welcomed but without being able to efficiently compare it doesn't help me to pick the right one for me.
We're somehow still stuck with CAPTCHAs (and other challenges), a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
[0] https://arxiv.org/abs/2311.10911
It is a shitty, and obviously bad solution for preventing scraping traffic. The goal of scraping traffic isn't to overwhelm your site, it's to read it once. If you make it prohibitively expensive to read your site even once, nobody comes to it. If you make it only mildly expensive, nobody scraping cares.
Anubis is specifically DDOS protection, not generally anti-bot, aside from defeating basic bots that don't emulate a full browser. It's been cargo-culted in front of a bunch of websites because of the latter, but it was obviously not going to work for long.
If the authors of the scrapers actually cared about it, we wouldn't have this problem in the first place. But today the more appropriate description is: the goal is to scrape as much data as possible as quickly as possible, preferably before your site falls over. They really don't care and side effects beyond that. Search engines have an incentive to leave your site running. AI companies don't. (Maybe apart from perplexity)
The conclusion I have come to is more general: I just personally don't like nerd-culture. Having an anime girl (but the same would be the case for an star trek, my litte pony/furry, etc.-themed site) signifies a kind of personality that I don't feel comfortable about, mainly due to the oblivious social awkwardness, but also due to "personal" habits of some people you meet in nerdy spaces. I guess there is something about the fact of not distinguishing between a public presentation and personal interests that this is reminiscent of. For instance: A guy can enjoy model trains, sure, but he is your college at work and always just goes on about model trains (without considering if this interests you or not!), then the fact that this subsumes his personality becomes a bore, but also indicative for a "poverty of personality" (sorry if that sounds harsh, what I am trying to get at is that a person is "richer" if their interests are not all determined by consumption habits). This is not to generalize that this is the case for everyone in these spaces, I am friends with nerdy-people on an individual basis, but I am painfully aware that I don't fit in perfectly like the last piece of a jigsaw puzzle -- and increasingly have less of a desire to do so.
So for me at least this is not offence, but in addition to the above also some kind of reminder that there is a fundamental rift in how decency and intersocial relations are imagined between the people who share my interests and me, which does bother me. Having that cat-girl appear every time I open some site reminds me of this fact.
Does any of this make sense? The way you and others phrase objections to the objections makes it seem like anyone who dislikes this is some obsessive or bigoted weirdo, which I hope I don't make the impression of. (Hit me up, even privately off-HN if anyone wants to chat about this, especially if you disagree with me, this is a topic that I find interesting and want to better understand!)
This is some real four-dimensional chess. "You're the childish one for not wanting Japanese cartoons on software projects!"
This whole comment chain solidifies my opinion that disgust is one of the driving human emotions. People feel initial, momentary disgust and only then explain it using the most solid justification that comes to mind, but the core disgust is unshakable and precedes all explanations. No one here has managed to procure any argument for why seeing a basic sketch in a certain style is objectively bad or harmful to someone, only that it's "weird" in some vague way. Basically, it goes against the primal instinct of how the person thinks the world "ought to work", therefore it's bad, end of story.
To me it seems obvious. The anime art style is in, especially in Western countries, especially^2 among younger people, and especially^3 among techy people. Ergo, you may see a mascot in that style once in a while in hobbyist projects. Doesn't seem like anything particularly objectionable to me.
And yet if you bring up that "Gimp" is an unserious name, or anything about RMS that's far more problematic than a cute cartoon, that same subset will defend it to the death.
Like a webserver returning an animated cock and balls on a 403 then acting like "we're all adults here why does anyone have a problem with this?"
> an animated cock and balls
You don't see a difference between these things?
Wikipedia suggests that there's an association with queer and trans youth, is that what's meant to make the cock-and-balls comparison work? But it also says it has a history back to 17th century Japan...
I suppose it all comes down to what your definition of "professional" is.
I would argue that this statement is blatantly false. Currently, most people really do not care about anubis anime cat girl icon which is actually fairly tame and boring picture.
In history, people used all kind of images for professional things, including stuff they found funny or cute.