Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Glassworm is back: A new wave of invisible Unicode attacks hits repositories (aikido.dev)

290 points by robinhouston 1 days ago | 182 comments

btown 1 days ago [-]

IMO while the bar is high to say "it's the responsibility of the repository operator itself to guard against a certain class of attack" - I think this qualifies. The same way GitHub provides Secret Scanning [0], it should alert upon spans of zero-width characters that are not used in a linguistically standard way (don't need an LLM for this, just n-tuples).

Sure, third-party services like the OP can provide bots that can scan. But if you create an ecosystem in which PRs can be submitted by threat actors, part of your commitment to the community should be to provide visibility into attacks that cannot be seen by the naked eye, and make that protection the norm rather than the exception.

[0] https://docs.github.com/en/get-started/learning-about-github...

andrewflnr 1 days ago [-]

Regardless of the thorny question of whether it's Github's responsibility, it sure would be a good thing for them to do ASAP.

godelski 1 days ago [-]

Here's the big reason GitHub should do it:

  It makes the product better

I know people love to talk money and costs and "value", but HN is a space for developers, not the business people. Our primary concern, as developers, is to make the product better. The business people need us to make the product better, keep the company growing, and beat out the competition. We need them to keep us from fixating on things that are useful but low priority and ensuring we keep having money. The contention between us is good, it keeps balance. It even ensures things keep getting better even if an effective monopoly forms as they still need us, the developers, to make the company continue growing (look at monopolies people aren't angry at and how they're different). And they need us more than we need them.

So I'd argue it's the responsibility of the developers, hired by GitHub, to create this feature because it makes the product better. Because that's the thing you've been hired for: to make the product better. Your concern isn't about the money, your concern is about the product. That's what you're hired for.

btown 1 days ago [-]

I'd say that this is also true from a money-and-costs-and-value perspective. Sure, all press is good press... but any number of stakeholders would agree that "we got some mindshare by proactively protecting against an emerging threat" is higher-ROI press than "Ars did a piece on how widespread this problem is, and we're mentioned in the context of our interface making the attack hard to detect."

And when the incremental cost to build a feature is low in an age of agentic AI, there should be no barrier to a member of the technical staff (and hopefully they're not divided into devs/test/PM like in decades past) putting a prototype together for this.

godelski 23 hours ago [-]

I agree and think it's extra important when you have specialized products. Experts are more sensitive to the little things.

Engineers and developers are especially sensitive. It's our job to find problems and fix them. I don't trust engineers that aren't a bit grumpy because it usually means they don't know what the problems are (just like when they don't dogfood). Though I'll also clarify that what distinguishes a grumpy engineer from your average redditer is that they have critiques rather than just complaints. Critique oriented is searching for solutions of problems, you can't just stop at problem identification.

  > And when the incremental cost to build a feature is low in an age of agentic AI

I'm not sure that's even necessary. A very quick but still helpful patch would be to display invisible characters. Just like we often do with whitespace characters. The diff can be a bit noisier and it's the perfect place for this even if you purposefully use invisible characters in your programming environment.

Though we're also talking about an organization that couldn't merge a PR for a year that fixed a one liner. A mistake that should never have gotten through review. Seriously, who uses a while loop counter checking for equality?!? I'm still convinced they left the "bug" because it made them money

fingerlocks 14 hours ago [-]

>Though we're also talking about an organization that couldn't merge a PR for a year that fixed a one liner. A mistake that should never have gotten through review. Seriously, who uses a while loop counter checking for equality?!? I'm still convinced they left the "bug" because it made them money

What is this in reference to? I tried to search for it but only found this comment. “Github while loop fix that was in review for a year”?

rkagerer 22 hours ago [-]

At the end of the day it boils down to putting your users first.

Making the product better generally stems from acting in their interest, honing the tool you offer to provide the best possible experience, and making business decisions that respect their dignity.

Your comment talks a lot about product and I agree with it, I just mentioned this so we don't lose sight of the fact this is ultimately about people.

tapland 1 days ago [-]

Tldr: Yeah it would make it better!

godelski 1 days ago [-]

I hope I left the lead as the lead.

But I also think we've had a culture shift that's hurting our field. Where engineers are arguing about if we should implement certain features based on the monetary value (which are all fictional anyways). But that's not our job. At best, it's the job of the engineering manager to convince the business people that it has not only utility value, but monetary.

andrewflnr 21 hours ago [-]

> Your concern isn't about the money, your concern is about the product. That's what you're hired for.

According to whom? Certainly not the people did the hiring.

I somewhat agree that developers should optimize for something other than pure monetary value, but it has nothing to do with the hiring relationship, just the moral duty to use what power you have to make the world better. In general, this can easily conflict with "what you're hired for."

In this case I think showing suspicious (or even all) invisible Unicode in PRs is even a monetarily valuable feature, so the moral angle is mostly moot. And I would put the primary moral burden primarily on the product management either way, since they're the ones with the most power to affect the product, potentially either ordering the right thing to be done or stopping the devs when they try to do it on their own.

godelski 21 hours ago [-]

  > According to whom? Certainly not the people did the hiring.

Actually yes, according to them. Maybe they'll say that you should also be concerned about the money but that just makes the business people redundant now doesn't it? So is it better if I clarify and say that the product is your primary concern?

As a developer you have a de facto primary concern with the product. They hire you to... develop. They do not hire you to manage finances, they hire you to manage the product. Doing both is more the job of the engineering manager. But as a developer your expertize is in developing. I don't think this is a crazy viewpoint.

You were hired for your technical skills, not your MBA.

  > In this case I think showing suspicious (or even all) invisible Unicode in PRs is even a monetarily valuable feature

I agree. Though I also think this is true for many things that improve the product.

Also note that I'm writing to my audience.

  >> but HN is a space for developers, not the business people.

How I communicate with management is different, but I'm exhausted when talking to fellow developers and the first question being about monetary value. That's not the first question in our side of things. Our first question is "is this useful?" or "does this improve the product?" If the answer is "yes" then I am /okay/ talking about monetary value. If it's easy to implement and helps the product, just implement it. If it requires time and the utility is valuable then yes, it helps to formulate an argument about monetary value since management doesn't understand any other language, but between developers that is a rather crazy place to start out (unless the proposal is clearly extremely costly. But then say "I don't think you'd ever convince management" instead of "okay, but what is the 'value' of that feature?"). If I wanted to talk to business people I'd talk to the business people, not another developer...

andrewflnr 21 hours ago [-]

They might say that your job is to make the product "better", and they might even think they mean it, but I think in practice you'll find that their definition of "better" as it relates to products is pretty closely related to money, and further that they are the authorities on what makes the product "better" so you should shut up and do what they say. If you want to make the product actually better, you're going to have to defy them occasionally. That's not what you were hired for, that's just being a human with principles.

godelski 20 hours ago [-]

To be frank, I tried to address your point with my comment about the audience.

I very much disagree that you start with money and work backwards to technical problems. I do not think this approach would make you efficient at solving problems nor at increasing profits for the business.

And I still firmly believe they need us more than we need them. At the end of the day this is why they want AI coding agents to work out but I do not think that even in the best situation we'll end up in any different of a situation than COBOL. You can make developers more efficient, but replacing them requires an entirely different set of skills.

An MBA-type, with no programming background, has a better chance getting their photos taken with their iPhone in a museum than they do replacing a developer. I'm sure there will be some successful at it, but exceptions do not define the rule.

andrewflnr 20 hours ago [-]

Talking about the audience completely misses my point. I'm not saying it's good to start with money and work back. I'm saying that's what companies actually do, and furthermore that's something the "dev audience" should understand about their employers.

> I do not think this approach would make you efficient at solving problems nor at increasing profits for the business.

If optimizing for profit doesn't result in profit, it's not the fault of the goal. That company was just incompetent. However many companies are, in fact, moderately competent, and optimizing for profit works fine for them. It even has a pretty heavy overlap with optimizing for good products, so that's nice.

It's fine. We agree on the ideal outcome in this situation.

jacquesm 1 days ago [-]

It absolutely is. They are simply spreading malware. You can't claim to be a 'dumb pipe' when your whole reason for existence is to make something people deemed 'too complex' simple enough for others to use, then you have an immediate responsibility to not only reduce complexity but to also ensure safety. Dumbing stuff down comes with a duty of care.

OJFord 10 hours ago [-]

They advertise that they do do it, they just don't/it doesn't work.

See commenter on their 2025 bounty for reporting it, won't-fix resolution: https://news.ycombinator.com/item?id=47393393

zzo38computer 1 days ago [-]

I think a "force visible ASCII for files whose names match a specific pattern" mode would be a simple thing to help. (You might be able to use the "encoding" command in the .gitattributes file for this, although I don't know if this would cause errors or warnings to be reported, and it might depend on the implementation.)

RVuRnvbM2e 13 hours ago [-]

Vigilant mode exists, and would have flagged the malicious commit as unverified in this case. Maybe it should be the default.

https://docs.github.com/en/authentication/managing-commit-si...

athrowaway3z 11 hours ago [-]

For some reason I was under the impression this was already the default.

I first heard about the possibility of this kind of attack >10 years ago, and I'll sometimes do a xxd if i'm feeling a bit paranoid.

iririririr 19 hours ago [-]

specially because it's literally a problem with their code viewer (and vscode, which is also theirs).

i see squares on a properly configured vim on xterm.

ocornut 1 days ago [-]

It baffles me that any maintainer would merge code like the one highlighted in the issue, without knowing what it does. That’s regardless of being or not being able to see the “invisible” characters. There’s a transforming function here and an eval() call.

The mere fact that a software maintainer would merge code without knowing what it does says more about the terrible state of software.

dspillett 23 hours ago [-]

> It baffles me that any maintainer would merge code like the one highlighted in the issue, without knowing what it does.

I don't know if it is relevant in any specific case that is being discussed here, but if the exploit route is via gaining access to the accounts of previously trusted submitters (or otherwise being able to impersonate them) it could be a case of teams with a pile of PRs to review (many of which are the sloppy unverified LLM output that is causing a problem for some popular projects) lets through an update from a trusted source that has been compromised.

It could correctly be argued that this is a problem caused by laziness and corner cutting, but it is still understandable because projects that are essentially run by a volunteer workforce have limited time resources available.

mmlb 23 hours ago [-]

In this instance the PR that was merged was from 6 years ago and was clear https://github.com/pedronauck/reworm/pull/28. Looks to me like a force push overwrote the commit that now exists in history since it was done 6y later.

globular-toast 14 hours ago [-]

So who force pushed and why?

pdonis 24 hours ago [-]

Wish I could upvote this more.

mmsc 21 hours ago [-]

GitHub advertises itself as warning about those Unicode characters: https://github.blog/changelog/2025-05-01-github-now-provides...

Of course, it doesn't work though. I reported this to their bug bounty, they paid me a bounty, and told me "we won't be fixing it": https://joshua.hu/2025-bug-bounty-stories-fail#githubs-utf-f...

The exact quote is "Thanks for the submission! We have reviewed your report and validated your findings. After internally assessing your report based on factors including the complexity of successfully exploiting the vulnerability, the potential data and information exposure, as well as the systems and users that would be impacted, we have determined that they do not present a significant security risk to be eligible under our rewards structure." The funny thing is, they actually gave me $500 and a lifetime GitHub Pro for the submission.

OJFord 10 hours ago [-]

That's bizarre. They won't be fixing it, and yet the changelog post is unretracted.

zzo38computer 1 days ago [-]

I use non-Unicode mode in the terminal emulator (and text editors, etc), I use a non-Unicode locale, and will always use ASCII for most kind of source code files (mainly C) (in some cases, other character sets will be used such as PC character set, but usually it will be ASCII). Doing this will mitigate many of this when maintaining your own software. I am apparently not the only one; I have seen others suggest similar things. (If you need non-ASCII text (e.g. for documentation) you might store them in separate files instead. If you only need a small number of them in a few string literals, then you might use the \x escapes; add comments if necessary to explain it.)

The article is about in JavaScript, although it can apply to other programming languages as well. However, even in JavaScript, you can use \u escapes in place of the non-ASCII characters. (One of my ideas in a programming language design intended to be better instead of C, is that it forces visible ASCII (and a few control characters, with some restrictions on their use), unless you specify by a directive or switch that you want to allow non-ASCII bytes.)

amake 7 hours ago [-]

That’s great for you. Isn’t feasible for software development by teams that are native in a language with a non-Latin script.

TacticalCoder 23 hours ago [-]

> ... and will always use ASCII for most kind of source code files

Same. And I enforce it. I've got scripts and hooks that enforces source files to only ever be a subset of ASCII (not even all ASCII codes have their place in source code).

Unicode chars strings are perfectly fine in resource files. You can build perfectly i18n/l10n apps and webapps without ever using a single Unicode character in a source file. And if you really do need one, there's indeed ASCII escaping available in many languages.

Some shall complan that their name as "Author: ..." in comments cannot be written properly in ASCII. If I wanted to be facetious I'd say that soon we'll see:

    # Author: Claude Opus 27.2

and so the point shall be moot anyway.

userbinator 22 hours ago [-]

CP437 forever!

The biggest use of Unicode in source repos now might be LLM slop, so I certainly don't miss its absence at all.

nstart 17 hours ago [-]

I don't quite understand how this is working tbh. I looked at one of the affected repos, ironically named "reworm".

The malicious code was introduced in this commit - https://github.com/pedronauck/reworm/commit/d50cd8c8966893c6...

It says coauthored by dependabot and refers to a PR opened in 2020 (https://github.com/pedronauck/reworm/pull/28).

That PR itself was merged in 2020 here - https://github.com/pedronauck/reworm/commit/df8c1803c519f599...

But the commit with the worm (d50cd8c), re-introduces the same change from df8c180 to the file `yarn.lock`.

And when you look at the history of yarn.lock inside of github, all references to the original version bump (df8c180) are gone...? In fact if you look at the overall commit history, the clean df8c180 commit does not exist.

I'm struggling to understand what kind of shenanigans happened here exactly.

RVuRnvbM2e 13 hours ago [-]

Someone has maintainer/admin access to the repository and has force-pushed to master overwriting the git history.

Notice that the original commit is verified: https://github.com/pedronauck/reworm/commit/df8c1803c519f599...

While the malicious one is not: https://github.com/pedronauck/reworm/commit/d50cd8c8966893c6...

globular-toast 12 hours ago [-]

This reveals a deeper flaw in the whole git/npm pipeline (would apply to other systems like PyPI etc, not npm exclusively). These systems should operate on a "pull" model, not a push. The system should have rejected a build that wasn't derived from the latest in its repository. It would be quite easy in concept to set up one's own system to pull every source on npm and alert when the upstream has deviated.

jibal 13 hours ago [-]

The malicious code was added to package.json, not yarn.lock

vitus 1 days ago [-]

Looks like the repo owner force-pushed a bad commit to replace an existing one. But then, why not forge it to maintain the existing timestamp + author, e.g. via `git commit --amend -C df8c18`?

Innocuous PR (but do note the line about "pedronauck pushed a commit that referenced this pull request last week"): https://github.com/pedronauck/reworm/pull/28

Original commit: https://github.com/pedronauck/reworm/commit/df8c18

Amended commit: https://github.com/pedronauck/reworm/commit/d50cd8

Either way, pretty clear sign that the owner's creds (and possibly an entire machine) are compromised.

chrismorgan 1 days ago [-]

The value of the technique, I suppose, is that it hides a large payload a bit better. The part you can see stinks (a bunch of magic numbers and eval), but I suppose it’s still easier to overlook than a 9000-character line of hexadecimal (if still encoded or even decoded but still encrypted) or stuff mentioning Solana and Russian timezones (I just decoded and decrypted the payload out of curiosity).

But really, it still has to be injected after the fact. Even the most superficial code review should catch it.

vitus 1 days ago [-]

Agreed on all those fronts. I'm just dismayed by all the comments suggesting that maintainers just merged PRs with this trojan, when the attack vector implies a more mundane form of credential compromise (and not, as the article implies, AI being used to sneak malicious changes past code review at scale).

jeltz 1 days ago [-]

Yeah, the attack vector seems to be stolen credentials. I would be much more interested in an attack which actually uses Invisible characters as the main vector.

minus7 1 days ago [-]

The `eval` alone should be enough of a red flag

whizzter 18 hours ago [-]

Sadly JS has ways around it that is far from obvious since you can chain effects over multiple files that leads to running code.

Like the following example (you can paste it into node to verify), could be spread out over multiple source files to make it even harder to follow:

  // prelude 1, obfuscate the constructor property name to avoid raising simple analyser alarms
  const prefix = "construction".substring(0,7);
  const suffix = "tractor".substring(3);
  const obfuscatedConstructorName = prefix + suffix; // innocent looking, but we have the indexing name.

  // prelude 2, get the Function class by indexing a function object with our constructor property name (that does not show up in source-code)
  const existingFunction = ()=>"nothing here";
  const InnocentLookingClass = existingFunction[obfuscatedConstructorName];

  // payload decoding elsewhere (this is where we decode our nasty source)
  const nastyPayloadDisguisedAsData = "console.log('sourced string that could be malicious')";

  // Unrelated location where payload gets executed
  const hardToMissFun = new InnocentLookingClass(nastyPayloadDisguisedAsData);
  hardToMissFun(); // when this function is run somewhere.. the nasty things happen.

Unless you have a data-tracing verifier or a sandbox that is continiously run it's going to be very hard to even come close to determining that arbitrary code is being evaluated in this example. Not a single trace of eval or even that the property name constructor is used.

jeltz 1 days ago [-]

Yeah, I would have loved to see an example where it was not obvious that there is an exploit. Where it would be possible for a reviewer to actually miss it.

godelski 1 days ago [-]

I'm not a JS person, but taking the line at face value shouldn't it to nothing? Which, if I understand correctly, should never be merged. Why would you merge no-ops?

kordlessagain 1 days ago [-]

No it’s not.

simonreiff 1 days ago [-]

OWASP disagrees: See https://cheatsheetseries.owasp.org/cheatsheets/Nodejs_Securi..., listing `eval()` first in its small list of examples of "JavaScript functions that are dangerous and should only be used where necessary or unavoidable". I'm unaware of any such uses, myself. I can't think of any scenario where I couldn't get what I wanted by using some combination of `vm`, the `Function` constructor, and a safe wrapper around `JSON.parse()` to do anything I might have considered doing unsafely with `eval()`. Yes, `eval()` is a blatant red flag and definitely should be avoided.

jacquesm 1 days ago [-]

While there are valid use cases for eval they are so rare that it should be disabled by default and strongly discouraged as a pattern. Only in very rare cases is eval the right choice and even then it will be fraught with risk.

godelski 1 days ago [-]

The parent didn't say "there's no legitimate uses of eval", they said "using eval should make people pay more attention." A red flag is a warning. An alert. Not a signal saying "this is 100% no doubt malicious code."

Yes, it's a red flag. Yes, there's legitimate uses. Yes, you should always interrogate evals more closely. All these are true

pavel_lishin 1 days ago [-]

When is an eval not at least a security "code smell"?

SahAssar 1 days ago [-]

It really is. There are very few proper use-cases for eval.

nswango 1 days ago [-]

For a long time the standard way of loading JSON was using eval.

bawolff 1 days ago [-]

Not that long, browsers implemented JSON.parse() back in 2009. JSON was only invented back in 2001 and took a while to become popular. It was a very short window more than a decade ago when eval made sense here.

Eval for json also lead to other security issues like XSSI.

creatonez 23 hours ago [-]

Problem is, it took until around 2016 for IE6 to be fully dead, so people continued to justify these hacks for a long time. Horrifying times.

_flux 1 days ago [-]

And why do we not anymore make use of it, but instead implemented separate JSON loading functionality in JavaScript? Can you think of any reasons beyond performance?

bawolff 1 days ago [-]

I'd be surprised if there is a performance benefit of processing json with eval(). Browsers optimize the heck out of JSON.

fhars 23 hours ago [-]

You are arguing against the opposite of what the comment you answered to said.

bawolff 15 hours ago [-]

Am i? "Can you think of any reasons beyond performance?" implies that the comment author thinks performance would be a valid reason.

_flux 12 hours ago [-]

Quoting my original message:

> And why do we not anymore make use of it, but instead implemented separate JSON loading functionality in JavaScript?

In other words: I'm asking for reasons why was native JSON JavaScript module created, if we already had eval.

> Can you think of any reasons beyond performance?

One of the reasons is that native JSON parser is faster than eval: give some other reason.

bulbar 1 days ago [-]

Why did you opt in for such a comment while a straight forward response without belittling tone would have achieved the same?

_flux 1 days ago [-]

I actually gave it some thought. I had written the actual reason first, but I realized that the person I was responding to must know this, yet keeps arguing in that eval is just fine.

I would say they are arguing that in bad faith, so I wanted to enter a dialogue where they are either forced to agree, or more likely, not respond at all.

creatonez 23 hours ago [-]

For IE7 support, yes... https://caniuse.com/mdn-javascript_builtins_json_parse

gnabgib 1 days ago [-]

Small discussion yesterday (9+9 points, 9+4 comments) https://news.ycombinator.com/item?id=47374479 https://news.ycombinator.com/item?id=47385244

bawolff 1 days ago [-]

I feel like the threat of this type of thing is really overstated.

Sure the payload is invisible (although tbh im surprised it is. PUA characters usually show up as boxes with hexcodes for me), but the part where you put an "empty" string through eval isn't.

If you are not reviewing your code enough to notice something as non sensical as eval() an empty string, would you really notice the non obfuscated payload either?

Arrowmaster 2 hours ago [-]

Honestly I was expecting more. There are many languages that support Unicode in variable or function names and I expected it to be used there.

It sounds like Python only allows approved Unicode characters to start a variable name but if it allowed any you could do something like `nonprintable = lambda x: insert exploit code here`. If that was hidden in what looked like a blank line between other additions would you catch it?

I'm sure there's some other language out there that has similar syntax and lax Unicode rules this could be used in.

The solution is that this and many other Unicode formatting characters should be ignored and converted to a visible indicator in all code views when you expect plain text.

loumf 21 hours ago [-]

The threat is that you depend on this library or use the VS Code Extension.

tolciho 1 days ago [-]

Attacks employing invisible characters are not a new thing. Prior efforts here include terminal escape sequences, possibly hidden with CSS that if blindly copied and pasted would execute who knows what if the particular terminal allowed escape sequences to do too much (a common feature of featuritis) or the terminal had errors in its invisible character parsing code.

For data or code hiding the Acme::Bleach Perl module is an old example though by no means the oldest example of such. This is largely irrelevant given how relevant not learning from history is for most.

Invisible characters may also cause hard to debug issues, such as lpr(1) not working for a user, who turned out to have a control character hiding in their .cshrc. Such things as hex viewers and OCD levels of attention to detail are suggested.

DropDead 1 days ago [-]

Why didn't some make av rule to find stuff like this, they are just plain text files

nine_k 1 days ago [-]

The rule must be very simple: any occurrence of `eval()` should be a BIG RED FLAG. It should be handled like a live bomb, which it is.

Then, any appearance of unprintable characters should also be flagged. There are rather few legitimate uses of some zero-width characters, like ZWJ in emoji composition. Ideally all such characters should be inserted as \xNNNN escape sequences, and not literal characters.

Simple lint rules would suffice for that, with zero AI involvement.

hamburglar 1 days ago [-]

I think there’s debate (which I don’t want to participate in) over whether or not invisible characters have their uses in Unicode. But I hope we can all agree that invisible characters have no business in code, and banishing them is reasonable.

WalterBright 1 days ago [-]

> There are rather few legitimate uses of some zero-width characters, like ZWJ in emoji composition.

Emojis are another abomination that should be removed from Unicode. If you want pictures, use a gif.

_flux 1 days ago [-]

Arguably them being in Unicode is an accessibility issue, unless we thought to standardize GIF names, and then that already sounds a lot like Unicode.

WalterBright 1 days ago [-]

How is it an accessibility issue? HTML allows things like little gif files. I've done this myself when I wrote text that contained Egyptian hieroglyphs. It works just fine!

_flux 1 days ago [-]

I mean if you don't have sight.

WalterBright 1 days ago [-]

Then use words. Or tooltips (HTML supports that). I use tooltips on my web pages to support accessibility for screen readers. Unicode should not be attempting to badly reinvent HTML.

sghitbyabazooka 1 days ago [-]

( ꏿ ﹏ ꏿ ; )

hrmtst93837 8 hours ago [-]

Automatic escaping sounds nice until you need to grep or diff across repos and get buried in opaque escapes that turn ordinary review into unreadable junk. Once that lands in a repo, even routine deps updates can turn into edge-case mismatch roulette.

Lint zero-width chars, sure. But if the actual sink is runtime string injection, banning eval is only half a fix because Function and friends still get you to the same bad place while the linter congratulates itself.

trollbridge 1 days ago [-]

In our repos, we have some basic stuff like ruff that runs, and that includes a hard error on any Unicode characters. We mostly did this after some un-fun times when byte order marks somehow ended up in a file and it made something fail.

I have considered allowing a short list that does not include emojis, joining characters, and so on - basically just currency symbols, accent marks, and everything else you'd find in CP-1521 but never got around to it.

abound 1 days ago [-]

Yeah it would have been nice to end with "and here's a five-line shell script to check if your project is likely affected". But to their credit, they do have an open-source tool [1], I'm just not willing to install a big blob of JavaScript to look for vulns in my other big blobs of JavaScript

[1] https://github.com/AikidoSec/safe-chain

nine_k 1 days ago [-]

Something like this should work, assuming your encoding is Unicode (normally UTF-8), which grep would interpret:

  grep -P '[\x{200B}\x{200C}\x{200D}\x{FEFF}]' code.ts

See https://stackoverflow.com/q/78129129/223424

charcircuit 1 days ago [-]

Isn't that what this article is about? Advertising an av rule in their product that catches this.

codechicago277 1 days ago [-]

I wonder if this could be used for prompt injection, if you copy and paste the seemingly empty string into an LLM does it understand? Maybe the affect Unicode characters aren’t tokenized.

ancillary 8 hours ago [-]

There's at least one paper (though pretty recent) about it: https://arxiv.org/html/2603.00164v1

jibal 13 hours ago [-]

Yes, and that happens.

anesxvito 22 hours ago [-]

The scary part is how invisible this is in code review. Unicode direction overrides and zero-width characters don't show up in most editors by default. Anyone know a solid pre-commit hook config that catches this reliably?

invalidusernam3 9 hours ago [-]

eval is the major red flag here

herpdyderp 19 hours ago [-]

I keep seeing this and wondering if the ESLint default rules against weird characters would catch this? But I can’t figure out how to check.

CGamesPlay 19 hours ago [-]

Appears not to. https://claude.ai/share/ac070cf5-0034-4f3c-9a8c-1c43a58eea36

Claude’s analysis seems solid here based on reading the snippets it tested.

A purpose-built linter could be cross-language, it’s pretty reasonable to blanket ban these characters entirely, or at least allowlist them.

P-MATRIX 16 hours ago [-]

This gets a lot worse when a coding agent is in the loop. A human at least has a review step—an autonomous agent that reads a Glassworm-infected file just acts on it. The fix probably needs to happen at the tool result layer, before the payload ever enters the agent's context, not just on what the agent writes out.

mhitza 1 days ago [-]

Their button animations almost "crash" Firefox mobile. As soon as I reach them the entire page scrolls at single digit FPS.

WalterBright 1 days ago [-]

Unicode should be for visible characters. Invisible characters are an abomination. So are ways to hide text by using Unicode so-called "characters" to cause the cursor to go backwards.

Things that vanish on a printout should not be in Unicode.

Remove them from Unicode.

pvillano 1 days ago [-]

Unicode is "designed to support the use of text in all of the world's writing systems that can be digitized"

Unicode needs tab, space, form feed, and carriage return.

Unicode needs U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK to switch between left-to-right and right-to-left languages.

Unicode needs U+115F HANGUL CHOSEONG FILLER and U+1160 HANGUL JUNGSEONG FILLER to typeset Korean.

Unicode needs U+200C ZERO WIDTH NON-JOINER to encode that two characters should not be connected by a ligature.

Unicode needs U+200B ZERO WIDTH SPACE to indicate a word break opportunity without actually inserting a visible space.

Unicode needs MONGOLIAN FREE VARIATION SELECTORs to encode the traditional Mongolian alphabet.

WalterBright 1 days ago [-]

[flagged]

bulbar 1 days ago [-]

That's a very narrow view of the world. One example: In the past I have handled bilingual english-arabic files with switches within the same line and Arabic is written from left to right.

There are also languages that are written from to to bottom.

Unicode is not exclusively for coding, to the contrary, pretty sure it's only a small fraction of how Unicode is used.

> Somehow people didn't need invisible characters when printing books.

They didn't need computers either so "was seemingly not needed in the past" is not a good argument.

WalterBright 24 hours ago [-]

> That's a very narrow view of the world.

Yes, it is. Unicode has undergone major mission creep, thinking it is now a font language and a formatting language. Naturally, this has lead to making it a vector for malicious actors. (The direction reversing thing has been used to insert malicious text that isn't visible to the reader.)

> Unicode is not exclusively for coding

I never mentioned coding.

> They didn't need computers

Unicode is for characters, not formatting. Formatting is what HTML is for, and many other formatting standards. Neither is it for meaning.

pibaker 1 days ago [-]

> That's a very narrow view of the world.

But not one that would surprise anyone familiar with WalterBright's antics on this website…

WalterBright 23 hours ago [-]

At least my antics do not include insulting people.

jmusall 1 days ago [-]

The fact is that there were so many character sets in use before Unicode because all these things were needed or at least wanted by a lot of people. Here's a great blog post by Nikita Prokopov about it: https://tonsky.me/blog/unicode/

WalterBright 23 hours ago [-]

Sometimes you gotta say no. Trying to please every hare brained idea leads to madness.

Normalized code point sequences are another WTF feature.

jmusall 22 hours ago [-]

Of course! I bet there are tons of ideas that didn't make it into Unicode, for better of worse. Where you draw the line is kind of arbitrary. You, personally, can of course opt out of all of that by restricting yourself to ASCII only, for example. But the rest of the world will continue to use Unicode.

WalterBright 20 hours ago [-]

> restricting yourself to ASCII only

My early compilers used code pages to work with Japanese, French and German customers. The original idea of Unicode was absolutely brilliant and I was all for it. D was an early total adopter of Unicode (C and C++ followed years later). I rejected code page support for D.

It's mission was to support all the letters in all the languages, which was a good straightforward mission. But then came fonts, formatting, layout, rendering, casing, sort ordering, normalization, combining, vote-for-my-letter-and-Ill-vote-for-yours, emoji, icons, semantic meanings, elvish, people who invent things and campaign to put them in so they'll leave a mark in history, ...

WalterBright 1 days ago [-]

    Look Ma
    xt! N !
    e tee S
    T larip

(No Unicode needed.)

chongli 1 days ago [-]

Unicode is for human beings, not machines.

WalterBright 24 hours ago [-]

How does invisible Unicode text fit into that?

chongli 23 hours ago [-]

It's not text, it's control characters, which have always been in character sets going back to ASCII.

WalterBright 23 hours ago [-]

ASCII having a few obsolete control characters does not justify opening the floodgates.

zorpner 1 hours ago [-]

Over 25% of the original ASCII specification is control characters.

WalterBright 56 minutes ago [-]

True. And nearly all of them are obsolete. Many were intended for control flow on an interactive terminal, which have long since passed into obsolescence. When was the last time you embedded a CTRL-C in text? The only ones that matter any more are newline and space.

luke-stanley 1 days ago [-]

So we need a new standard problem due to the complexity of the last standard? Isn't unicode supposed to be a superset of ASCII, which already has control characters like new space, CR, and new lines? xD

WalterBright 1 days ago [-]

The only ones people use any more are newline and space. A tab key is fine in your editor, but it's been more or less abandoned as a character. I haven't used a form feed character since the 1970s.

tetha 1 days ago [-]

That ship has sailed, but I consider Unicode a good thing, yet I consider it problematic to support Unicode in every domain.

I should be able to use Ü as a cursed smiley in text, and many more writing systems supported by Unicode support even more funny things. That's a good thing.

On the other hand, if technical and display file names (to GUI users) were separate, my need for crazy characters in file names, code bases and such are very limited. Lower ASCII for actual file names consumed by technical people is sufficient to me.

WalterBright 24 hours ago [-]

> That ship has sailed

Sure, but more crazy stuff gets added all the time.

ted_dunning 20 hours ago [-]

No need to remove them. Just make them visible for applications that don't need to render every language. Make that behavior optional as well in case you really want to name characters with Hangul or Tibetan.

Some middle ground so that you can use greek letters in Julia might be nice as well.

But I don't see any purpose in using the Personal Use Areas (PUA) in programming.

WalterBright 1 days ago [-]

Another dum dum Unicode idea is having multiple code points with identical glyphs.

Rule of thumb: two Unicode sequences that look identical when printed should consist of the same code points.

estebank 1 days ago [-]

If anything, Unicode should have had more disambiguated characters. Han unification was a mistake, and lower case dotted Turkish i and upper case dotless Turkish I should exist so that toUpper and toLower didn't need to know/guess at a locale to work correctly.

WalterBright 23 hours ago [-]

Characters should not have invisible semantics.

nswango 1 days ago [-]

So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

WalterBright 1 days ago [-]

> So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

> And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

Those Unicode homonyms are a solution looking for a problem.

bawolff 1 days ago [-]

> Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

Do you think 1, l and I should be encoded as the same character, or does this logic only extend to characters pesky foreigners use.

WalterBright 23 hours ago [-]

They are visually distinct to the reader.

debazel 18 hours ago [-]

That is entirely dependent on the font.

sukilot 19 hours ago [-]

[dead]

Yokohiii 1 days ago [-]

Unicode is about semantics not appearance. If you don't need semantics then use something different.

WalterBright 23 hours ago [-]

> Unicode is about semantics not appearance.

And that's where it went off the rails into lala land. 'a' can have all kinds of distinct meanings. How are you going to make that work? It's hopeless.

Yokohiii 22 hours ago [-]

It already works.

Tell me what the problem is and what your proposed solution would be.

WalterBright 19 hours ago [-]

Infer the meaning from the context.

    a) it's a bullet point
    b) a+b means a is a variable
    c) apple means a means the sound "aaaah"
    d) ape means a means the sound "aye"
    e) 0xa means a means "10"
    f) "a" on my test paper means I did well on it
    g) grade "a" means I bought the good bolts
    h) "achtung" means it's a German "a"

I didn't need 8 different Unicode characters. And so on.

Yokohiii 17 hours ago [-]

Your trolling is really rock bottom. All this already works fine. Millions of times, each day. Just once a week it fails because someone messed up. Not an issue.

WalterBright 14 hours ago [-]

I showed that there is no need for semantic information about the glyphs. It's more compelling to demonstrate a need for semantic information rather than just asserting it.

Yokohiii 5 hours ago [-]

so you contradict yourself because your context window is exhausted?

WalterBright 3 hours ago [-]

Since you insist on being rude, I shall exit.

Muromec 1 days ago [-]

>Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

>should not be about semantic meaning,

It's always better to be able to preserve more information in a text and not less.

WalterBright 23 hours ago [-]

> I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

They look visually distinct to me. I don't get your point.

> It's always better to be able to preserve more information in a text and not less.

Text should not lose information by printing it and then OCR'ing it.

ted_dunning 20 hours ago [-]

But these characters only look identical in some fonts. Are you saying that if you change font, some characters in a string should change appearance and others should not?

And what about the round-trip rule?

And ligatures? Aren't those a semantic distinction?

WalterBright 19 hours ago [-]

> But these characters only look identical in some fonts.

That's a problem with the fonts.

> And what about the round-trip rule?

Print Unicode on paper, then ocr it, and you'll get different Unicode. Oh, and normalization.

> ligatures

Generally an issue with rendering.

> semantic distinction

Unicode isn't about semantics (or shouldn't be). Consider 'a'. It's used for all kinds of meanings.

Yokohiii 1 days ago [-]

What about numbers? Would they be assigned to arabic only? I guess someone will be offended by that.

While at it we could also unify I, | and l. It's too confusing sometimes.

WalterBright 23 hours ago [-]

> While at it we could also unify I, | and l. It's too confusing sometimes.

They render differently, so it's not a problem.

ted_dunning 20 hours ago [-]

They only render differently in some fonts, on some displays.

Yokohiii 22 hours ago [-]

totally not true :D

WalterBright 19 hours ago [-]

Look again at its rendering!

jeltz 1 days ago [-]

I don't think that would help much. There are also characters which are similar but not the same and I don't think humans can spot the differences unless they are actively looking for them which most of the time people are not. If only one of two glyphs which are similar appear in the text nobody would likely notice, expectation bias will fuck you over.

WalterBright 1 days ago [-]

I wonder how anybody got by with printed books.

wcoenen 1 days ago [-]

As far as I know, glyphs are determined by the font and rendering engine. They're not in the Unicode standard.

WalterBright 23 hours ago [-]

Fraktur (font) and italic (rendering) are in the Unicode standard, although Hackernews will not render them. (I suspect that the Hackernews software filters out the nuttier Unicode stuff.)

ted_dunning 20 hours ago [-]

One of the ground rules of Unicode is the round trip rule. You have to be able to translate to and from Unicode without loss of information.

WalterBright 20 hours ago [-]

They threw that out the window with normalization.

abujazar 1 days ago [-]

Invisible characters are there for visible characters to be printed correctly...

WalterBright 1 days ago [-]

I'll grant that a space and a newline are necessary. The rest, nope.

abujazar 1 days ago [-]

You're talking about a subset of ASCII then. Unicode is supposed to support different languages and advanced typography, for which those characters are necessary. You can't write e.g. Arabic or Hebrew without those "unnecessary" invisible characters.

WalterBright 23 hours ago [-]

Please explain why an invisible zero width "character" is necessary.

ted_dunning 20 hours ago [-]

To prevent ligatures from forming when you need that.

WalterBright 19 hours ago [-]

That's the job of a typesetting language.

krior 20 hours ago [-]

To mark linewrapping-breakpoints in strings.

WalterBright 19 hours ago [-]

Leave typesetting to a proper typesetting language, like Latex.

slim 20 hours ago [-]

if you write كلب which is an arabic word written right to left in the middle of an english sentence, you want to preserve the order of the characters in the stream for computer processing purposes. meaning the chararacter ك must come before the ل and after the e and the space with respect to the memory layout. whereas when displayed, it must be inverted to be legible. the solution is to have an invisible character that indicates a switch in text direction. if you were wondering, the situation where you want to write text in a foreign language within your text is very common outside english speaking countries.

WalterBright 19 hours ago [-]

Look I'm writing sdrawkcab (amazingly, I did it without using Unicode!). Layout is the job of your text formatting program. It's easy to fix a text editor to support right-to-left text entry.

The switch in text direction has resulted in malicious code injection attacks, as the reversed text becomes invisible. I had to change my compiler to reject those Unicode characters for that reason. It can be used in other cases to have hidden, malicious text.

Have you checked your SQL code for invisible backwards text that injects malware?

estebank 5 hours ago [-]

> Look I'm writing sdrawkcab

How would that work with Text-To-Speech output?

WalterBright 1 hours ago [-]

Good question! Two possibilities:

1. Tell the TTS program that the text is RTOL.

2. If the TTS program can speak Arabic, it can detect RTOL Arabic text.

The only purpose for RTOL English I can think of is to insert hidden text for malicious purposes.

uhoh-itsmaciek 1 days ago [-]

>Remove them from Unicode.

Do you honestly think this is a workable solution?

WalterBright 1 days ago [-]

Yes, absolutely. See my other replies.

1 days ago [-]

eviks 1 days ago [-]

So you'd remove space and tab from Unicode?

moritzruth 1 days ago [-]

greatidea,whoneedsspacesanyway

WalterBright 1 days ago [-]

Spaces appear on a printout.

ted_dunning 20 hours ago [-]

As do tabs, ems, ens and quads.

WalterBright 19 hours ago [-]

Unicode shouldn't be a typesetting language. The proper tool for that is Latex.

bawolff 1 days ago [-]

Good luck with that given there are invisible characters in ascii.

Also this attack doesnt seem to use invisible characters just characters that dont have an assigned meaning.

WalterBright 19 hours ago [-]

The only problematic one is CR which can be used to hide text on a glass terminal with a tty interface. I'd get rid of it if I could.

faangguyindia 1 days ago [-]

Back in time I was on hacking forums where lot of script kiddies used to make malicious code.

I am wondering how that they've LLM, are people using them for making new kind of malicious codes more sophisticated than before?

Yokohiii 1 days ago [-]

In this case LLMs were obviously used to dress the code up as more legitimate, adding more human or project relevant noise. It's social engineering, but you leave the tedious bits to an LLM. The sophisticated part is the obscurity in the whole process, not the code.

rvnx 22 hours ago [-]

This shows the failure of human reviews alone, an LLM-based reviewer would have caught it. Both approaches are complementary

rhysfonixone 11 hours ago [-]

Exactly this. I think a hybrid approach is going to be mandatory before long, if it's not already. A well-prompted frontier-lab LLM would catch things like this easily.

chairmansteve 1 days ago [-]

eval() used to be evil....

Are people using eval() in production code?

like_any_other 1 days ago [-]

Invisible characters, lookalike characters, reversing text order attacks [1].. the only way to use unicode safely seems to be by whitelisting a small subset of it.

And please, everyone arguing the code snippet should never have passed review - do you honestly believe this is the only kind of attack that can exploit invisible characters?

[1] https://attack.mitre.org/techniques/T1036/002/

NoMoreNicksLeft 1 days ago [-]

Why can't code editors have a default-on feature where they show any invisible character (other than newlines)? I seem to remember Sublime doing this at least in some cases... the characters were rendered as a lozenge shape with the hex value of the character.

Is there ever a circumstance where the invisible characters are both legitimate and you as a software developer wouldn't want to see them in the source code?

ted_dunning 20 hours ago [-]

Check out emacs for options like this.

And, yes, there is a circumstance if you want to include Arabic or Hebrew in comments or strings. You need the zero width left-right markers to make that work.

hananova 1 days ago [-]

My hot take is that all programming languages should go back to only accepting source code saved in 7-bit ASCII. With perhaps an exception for comments.

krior 20 hours ago [-]

Yeah, fuck those non-english-speaking peasants /s.

hananova 11 hours ago [-]

I'm a non-english-speaking peasant. I code in English, because it's the lingua franca of coding, and because they form the only characters that you can reliably use everywhere.

Besides, that's why the ban only extends to syntax and string literals (use escapes instead), and not comments.

From my experience, the only two nationalities that insist on mixing their native languages with the mostly English syntax of programming languages are the French and the Japanese. And they can just suck it up for the other 8 billion of us.

iam_circuit 21 hours ago [-]

[dead]

aneyadeng 1 days ago [-]

[dead]

diven_rastdus 17 hours ago [-]

[dead]

efilife 12 hours ago [-]

AI comment

nulltrace 16 hours ago [-]

Grepping your own source for variation selectors is the easy part. The problem is nobody's doing that on what they install. A compromised upstream package lands those characters in your node_modules and your CI never looks twice. `npm audit signatures` catches some supply chain stuff but not this. Honestly surprised no package manager has a "scan installed files for suspicious Unicode" step yet.

robutsume 1 days ago [-]

[dead]

aplomb1026 1 days ago [-]

[dead]

robutsume 22 hours ago [-]

[dead]

NooneAtAll3 13 hours ago [-]

[dead]

max_ 1 days ago [-]

I don't have to worry about any of this.

My clawbot & other AI agents already have this figured out.

stainlu 19 hours ago [-]

[flagged]

Rendered at 20:31:47 GMT+0000 (Coordinated Universal Time) with Vercel.