Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

413 points by todsacerdoti 211 days ago | 93 comments

tomsmeding 210 days ago [-]

They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

bstsb 210 days ago [-]

previously the author wrote in a comment reply about not configuring robots.txt at all:

> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.

yorwba 210 days ago [-]

The spigot doesn't seem to distinguish between crawlers that make more than 15 requests per second and those that make less. I think it would be nicer to throw up a "429 Too Many Requests" page when you think the load is too much and only poison crawlers that don't back off afterwards.

evgpbfhnr 210 days ago [-]

when crawlers use a botnet to only make one request per ip per long duration that's not realistic to implement though..

DamonHD 210 days ago [-]

Almost no bot responds usefully to 429 that I have seen, and a few respond to it like 500 and 503 to speed up / retry / poll more.

dhosek 209 days ago [-]

Reminds me of a service I led the development on where we had to provide mocks for the front end to develop against as well as develop against mocks of an external service which wasn’t ready for us to use.

When we finally were able to do an end-to-end test, everything worked perfectly on the first try.

Except, the front end REST library, when given a 401 error when an incorrect auth code was sent, retried the request rather than reporting to the user that there was an error which meant that entering an incorrect auth code would lock the user out of their account immediately.

We ended up having to return all results with a 200 response regardless of the contents because of that broken library.

josephg 210 days ago [-]

> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

So? What duty do web site operators have to be "nice" to people scraping your website?

gary_0 210 days ago [-]

The Marginalia search engine or archive.org probably don't deserve such treatment--they're performing a public service that benefits everyone, for free. And it's generally not in one's best interests to serve a bunch of garbage to Google or Bing's crawlers, either.

marginalia_nu 210 days ago [-]

It's not really too big of a problem for a well-implemented crawler. You basically need to define an upper bound both in terms of document count and time for your crawls, since crawler traps are pretty common and have been around since the cretaceous.

darkwater 210 days ago [-]

If you have such a website, then you will just serve normal data. But it seems perfectly legit to serve fake random gibberish from your website if you want to. A human would just stop reading it.

suspended_state 210 days ago [-]

The point is that not every web crawler is out there to scrape websites.

andybak 210 days ago [-]

Unless you define "scrape" to be inherently nefarious - then surely they are? Isn't the definition of a web crawler based on scraping websites?

suspended_state 210 days ago [-]

I think that web scraping is usually understood as the act of extracting information of a website for ulterior self-centered motives. However, it is clear that this ulterior motive cannot be assessed by a website owner. Only the observable behaviour of a data collecting process can be categorized as morally good or bad. While the bad behaving people are usually also the ones with morally wrong motives, one doesn't entail the other. I chose to qualify the bad behaving ones as scrapers, and the good behaving ones as crawlers.

That being said, the author is perhaps concerned by the growing amount of collecting process, which carries a toll on his server, and thus chose to simply penalize them all.

jandrese 210 days ago [-]

I wonder if you could mess with AI input scrapers by adding fake captions to each image? I imagine something like:

    (big green blob)

    "My cat playing with his new catnip ball".


    (blue mess of an image)

    "Robins nesting"

Dwedit 210 days ago [-]

A well-written scraper would check the image against a CLIP model or other captioning model to see if the text there actually agrees with the image contents.

Simran-B 210 days ago [-]

Then captions that are somewhat believable? "Abstract digital art piece by F. U. Botts resembling wide landscapes in vibrant colors"

Someone 210 days ago [-]

Do scrapers actually do such things on every page they download? Sampling a small fraction of a site to check how trustworthy it is, I can see happen, but I would think they’d rather scrape many more pages than spend resources doing such checks on every page.

Or is the internet so full of garbage nowadays that it is necessary to do that on every page?

vintermann 209 days ago [-]

Ain't nobody got the processing time for that! Scraping is about more, more, more. If they do any filtering it'll be afterwards.

levzzz 210 days ago [-]

[dead]

marcod 211 days ago [-]

Reading about Spigot made me remember https://www.projecthoneypot.org/

I was very excited 20 years ago, every time I got emails from them that the scripts and donated MX records on my website had helped catching a harvester

> Regardless of how the rest of your day goes, here's something to be happy about -- today one of your donated MXs helped to identify a previously unknown email harvester (IP: 172.180.164.102). The harvester was caught a spam trap email address created with your donated MX:

notpushkin 210 days ago [-]

This is very neat. Honeypot scripts are fairly outdated though (and you can’t modify them according to ToS). The Python one only supports CGI and Zope out of the box, though I think you can make a wrapper to make it work with WSGI apps as well.

mrbluecoat 211 days ago [-]

> I felt sorry for its thankless quest and started thinking about how I could please it.

A refreshing (and amusing) attitude versus getting angry and venting on forums about aggressive crawlers.

ASalazarMX 211 days ago [-]

Helped without doubt by the capacity to inflict pain and garbage unto those nasty crawlers.

EspadaV9 211 days ago [-]

I like this one

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

Some kind of statement piece

creatonez 211 days ago [-]

For the full experience:

Firefox: Press F12, go to Network, click No Throttling > change it to GPRS

Chromium: Press F12, go to Network, click No Throttling > Custom > Add Profile > Set it to 20kbps and set the profile

extraduder_ire 210 days ago [-]

Good mention. There's probably some good art to be made by serving similar jpeg images with the speed limited server-side.

myelinsheep 211 days ago [-]

Anything with Shakespeare in it?

EspadaV9 210 days ago [-]

Looks like he didn't get time to finish

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

Terry Pratchett has one I'd like to think he'd approve of. Just a shame I'm unable to see the 8th colour, I'm sure it's in there somewhere.

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

Szpadel 210 days ago [-]

the worst offender I saw is meta.

they have facebookexternalhit bot (they sometimes use default python request user agent) that (as they documented) explicitly ignores robots.txt

it's (as they say) used to validate links if they contain malware. But if someone would like to serve malware the first thing they would do would be to serve innocent page to facebook AS and their user agent.

they also re-check every URL every month to validate if this still does not contain malware.

the issue is as follows some bad actors spam Facebook with URLs to expensive endpoints (like some search with random filters) and Facebook provides then with free ddos service for your competition. they flood you with > 10 r/s for days every month.

kuschku 210 days ago [-]

Since when is 10r/s flooding?

That barely registers as a blip even if you're hosting your site on a single server.

Szpadel 210 days ago [-]

In our case this was very heavy specialized endpoint and because each request used different set of parameters could not benefit from caching (actually in this case it thrashed caches with useless entries).

This resulted in upscale. When handling such bot cost more than rest of the users and bots, that's an issue. Especially for our customers with smaller traffic.

This request rate varied from site to site, but it ranged from half to 75% of whole traffic and was basically saturating many servers for days if not blocked.

oofbey 209 days ago [-]

If you're serving static pages through nginx or something, then 10/sec is nothing. But if you're running python code to generate every page, it can add up fast.

anarazel 210 days ago [-]

That depends on what you're hosting. Good luck if it's e.g. a web interface for a bunch of git repositories with a long history. You can't cache effectively because there's too many pages and generating each page isn't cheap.

kazinator 210 days ago [-]

Faking a JPEG is not only less CPU intensive than making one properly, but by doing os you are fuzzing whatever malware is on the other end; if it is decoding the JPEG and isn't robust, it may well crash.

derefr 211 days ago [-]

> It seems quite likely that this is being done via a botnet - illegally abusing thousands of people's devices. Sigh.

Just because traffic is coming from thousands of devices on residential IPs, doesn't mean it's a botnet in the classical sense. It could just as well be people signing up for a "free VPN service" — or a tool that "generates passive income" for them — where the actual cost of running the software, is that you become an exit node for both other "free VPN service" users' traffic, and the traffic of users of the VPN's sibling commercial brand. (E.g. scrapers like this one.)

This scheme is known as "proxyware" — see https://www.trendmicro.com/en_ca/research/23/b/hijacking-you...

jeroenhd 210 days ago [-]

That's just a variant of a botnet that the users are willingly joining. Someone well-intentioned should probably redirect those IP addresses to a "you are part of a botnet" page just in case they find the website on a site like HN and don't know what their family members are up to.

Easiest way to deal with them is just to block them regardless, because the probability that someone who knows what to do about this software and why it's bad will read any particularly botnetted website are close to zero.

cAtte_ 211 days ago [-]

sounds like a botnet to me

whatsupdog 210 days ago [-]

Botnet with extra steps.

derefr 211 days ago [-]

Eh. To me, a bot is something users don't know they're running, and would shut off if they knew it was there.

Proxyware is more like a crypto miner — the original kind, from back when crypto-mining was something a regular computer could feasibly do with pure CPU power. It's something users intentionally install and run and even maintain, because they see it as providing them some potential amount of value. Not a bot; just a P2P network client.

Compare/contrast: https://en.wikipedia.org/wiki/Winny / https://en.wikipedia.org/wiki/Share_(P2P) / https://en.wikipedia.org/wiki/Perfect_Dark_(P2P) — pieces of software which offer users a similar devil's bargain, but instead of "you get a VPN; we get to use your computer as a VPN", it's "you get to pirate things; we get to use your hard drive as a cache node in our distributed, encrypted-and-striped pirated media cache."

(And both of these are different still to something like BitTorrent, where the user only ever seeds what they themselves have previously leeched — which is much less questionable in terms of what sort of activity you're agreeing to play host to.)

tgsovlerkhgsel 211 days ago [-]

AFAIK much of the proxyware runs without the informed consent of the user. Sure, there may be some note on page 252 of the EULA of whatever adware the user downloaded, but most users wouldn't be aware of it.

ronsor 211 days ago [-]

because it is, but it's a legal botnet

superjan 210 days ago [-]

There is a particular pattern (block/tag marker) that is illegal the compressed JPEG stream. If I recall correctly you should insert a 0x00 after a 0xFF byte in the output to avoid it. If there is interest I can followup later (not today).

superjan 210 days ago [-]

Ok this is correct for traditional JPEG. Other flavors like Jpeg2000 use a similar (but lower overhead) version of this byte-stuffing to avoid JPEG markers from appearing in the compressed stream.

vintermann 209 days ago [-]

I remember there was a guy on compression forums who was very annoyed at this waste of coding space. If you're doing compression, shouldn't you make sure every encoded file decodes to a distinct output? He thought so, and made bijective versions of Huffman coding, arithmetic coding, LZ coding and (even more impressive) the BWT transform known from bzip2.

He was a bit crazy, but I liked that guy. Rest in peace, David A. Scott. Maybe there will be new uses for making all compression bijective over all byte streams.

ethan_smith 210 days ago [-]

You're referring to JPEG's byte stuffing rule: any 0xFF byte in the entropy-coded data must be followed by a 0x00 byte to prevent it from being interpreted as a marker segment.

thayne 210 days ago [-]

I'm curious how the author identifies the crawlers that use random User Agents and and distinct ip addresses per request. Is there some other indicator that can be used to identify them?

On a different note, if the goal is to waste resources for the bot, on potential improvement could be to uses very large images with repeating structure that compress extremely well as jpegs for the templates, so that it takes more ram and cpu to decode them with relatively little cpu and ram required to generate them and bandwidth to transfer them.

bschwindHN 211 days ago [-]

You should generate fake but believable EXIF data to go along with your JPEGs too.

bigiain 210 days ago [-]

Fake exif data with lat/longs showing the image was taken inside Area 51 or The Cheyenne Mountain Complex or Guantanamo Bay...

russelg 211 days ago [-]

They're taking the valid JPEG headers from images already on their site, so it's possible those are already in place.

electroglyph 210 days ago [-]

there's no metadata in the example image

derektank 211 days ago [-]

From the headline that's actually what I was expecting the link to discuss

dheera 211 days ago [-]

> So the compressed data in a JPEG will look random, right?

I don't think JPEG data is compressed enough to be indistinguishable from random.

SD VAE with some bits lopped off gets you better compression than JPEG and yet the latents don't "look" random at all.

So you might think Huffman encoded JPEG coefficients "look" random when visualized as an image but that's only because they're not intended to be visualized that way.

maxbond 211 days ago [-]

Encoded JPEG data is random in the same way cows are spherical.

BlaDeKke 211 days ago [-]

Cows can be spherical.

211 days ago [-]

bigiain 210 days ago [-]

And have uniform density.

anyfoo 210 days ago [-]

Yeah, but in practice you only get that in a perfect vacuum.

nasretdinov 210 days ago [-]

I imagine gravity matters more here than atmosphere

mhuffman 210 days ago [-]

I don't understand the reasoning behind the "feed them a bunch of trash" option when it seems that if you identify them (for example by ignoring a robots.txt file) you can just keep them hung up on network connections or similar without paying for infinite garbage for crawlers to injest.

schroeding 210 days ago [-]

The "poisoning the data supply" angle seems to be a common motive, similar to tools like nightshade[1] (for actual images and not just garbage data).

[1] https://nightshade.cs.uchicago.edu/whatis.html

mhuffman 209 days ago [-]

I get that, but paying to do it?

Modified3019 211 days ago [-]

Love the effort.

That said, these seem to be heavily biased towards displaying green, so one “sanity” check would be if your bot is suddenly scraping thousands of green images, something might be up.

lvncelot 210 days ago [-]

Nature photographers around the world rejoice as their content becomes safe from scraping.

recursive 211 days ago [-]

Mission accomplished I guess

ykonstant 210 days ago [-]

Next we do it with red and blue :D

112233 210 days ago [-]

So how do I set up an instance of this beautiful flytrap? Do I need a valid personal blog, or can I plop something on cloudflare to spin on their edge?

ffsm8 210 days ago [-]

It's a flask app, he linked to it

https://github.com/gw1urf/spigot/

lblume 211 days ago [-]

Given that current LLMs do not consistently output total garbage, and can be used as judges in a fairly efficient way, I highly doubt this could even in theory have any impact on the capabilities of future models. Once (a) models are capable enough to distinguish between semi-plausible garbage and possibly relevant text and (b) companies are aware of the problem, I do not think data poisoning will be an issue at all.

jesprenj 211 days ago [-]

Yes, but you still waste their processing power.

Zecc 210 days ago [-]

> Once (a) models are capable enough to distinguish between semi-plausible garbage and possibly relevant text

https://xkcd.com/810/

immibis 211 days ago [-]

There's no evidence that the current global DDoS is related to AI.

ykonstant 210 days ago [-]

We have investigated nobody and found no evidence of malpractice!

lblume 210 days ago [-]

The linked page claims that most identified crawlers are related to scraping for training data of LLMs, which seems likely.

jeroenhd 210 days ago [-]

This makes me wonder if there are more efficient image formats that one might want to feed botnets. JPEG is highly complex, but PNG uses a relatively simple DEFLATE stream as well as some basic filters. Perhaps one could make a zip-bomb like PNG that only consists of a few bytes?

john01dav 210 days ago [-]

That might be challenging because you can trivially determine the output file sized based on the dimensions in pixels and pixel format, so if the DEFLATE stream goes beyond that you can stop decoding and discard the image as malformed. Of course, some decoders may not do so and thus would be vulnerable.

_ache_ 210 days ago [-]

Is it a problem through ? I'm pretty sure that any check is on the weight of the PNG, not the actual dimension of the image.

PNG doesn't have size limitation on the image dimensions (4bytes each). So I bet you can break at least one scrap bot with that.

sltkr 210 days ago [-]

DEFLATE has a rather low maximum compression ratio of 1:1032, so a file that would take 1 GB of memory uncompressed still needs to be about 1 MB.

ZIP bombs rely on recursion or overlapping entries to achieve higher ratios, but the PNG format is too simple to allow such tricks (at least in the usual critical chunks that all decoders are required to support).

time0ut 210 days ago [-]

JPEG is fascinating and quite complex. Here is a really excellent high level explanation of how it works:

https://www.youtube.com/watch?v=0me3guauqOU

larcanio 210 days ago [-]

Happy to realize real heros does exists.

puttycat 211 days ago [-]

> compression tends to increase the entropy of a bit stream.

Does it? Encryption increases entropy, but not sure about compression.

gregdeon 211 days ago [-]

Yes: the reason why some data can be compressed is because many of its bits are predictable, meaning that it has low entropy per bit.

JCBird1012 211 days ago [-]

I can see what was meant with that statement. I do think compression increases Shannon entropy by virtue of it removing repeating patterns of data - Shannon entropy per byte of compressed data increases since it’s now more “random” - all the non-random patterns have been compressed out.

Total information entropy - no. The amount of information conveyed remains the same.

gary_0 210 days ago [-]

Technically with lossy compression, the amount of information conveyed will likely change. It could even increase the amount of information of the decompressed image, for instance if you compress a cartoon with simple lines and colors, a lossy algorithm might introduce artifacts that appear as noise.

a-biad 210 days ago [-]

I am bit confused about the context. What is exactly the point of exposing fake data to webcrawlers?

whitten 210 days ago [-]

penalizing the web spider for scraping their site

gosub100 210 days ago [-]

They crawl for data, usually to train a model. Poisoning the models training data makes it less useful and therefore less valuable

sim7c00 210 days ago [-]

love how u speak about pleasing bots and them getting excited :D fun read, fun project. thanks!

jekwoooooe 210 days ago [-]

It’s our moral imperative to make crawling cost prohibitive and also poison LLM training.

BubbleRings 210 days ago [-]

Is there reason you couldn’t generate your images by grabbing random rectangles of pixels from one source image and pasting it into a random location in another source image? Then you would have a fully valid jpg that no AI could easily successfully identify as generated junk. I guess that would require much more CPU than your current method huh?

adgjlsfhk1 210 days ago [-]

Given the amount of money AI companies have, you need at least ~100x work amplification for this to begin to be a punishment.

hashishen 211 days ago [-]

the hero we needed and deserved

210 days ago [-]

bvan 210 days ago [-]

Love this.

ardme 210 days ago [-]

Old man yells at cloud, then creates a labyrinth of mirrors for the images of the clouds to reflect back on each other.

Domainzsite 210 days ago [-]

This is pure internet mischief at its finest. Weaponizing fake JPEGs with valid structure and random payloads to burn botnet cycles? Brilliant. Love the tradeoff thinking: maximize crawler cost, minimize CPU. The Huffman bitmask tweak is chef’s kiss. Spigot feels like a spiritual successor to robots.txt flipping you off in binary.

rfrey 209 days ago [-]

But where is my chocolate cupcake recipe?

Rendered at 15:45:18 GMT+0000 (Coordinated Universal Time) with Vercel.