NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Forensic linguists use grammar, syntax and vocabulary to help crack cold cases (thedial.world)
hnbad 17 days ago [-]
Popular crime shows like CSI (and arguably detective novels even before that) have created a false impression that forensics are an exact science that can perfectly narrow down the list of suspects to the exact person who did it.

But of course in reality it is much more complex. All forms of forensic evidence are vulnerable to noise: sure, that one interesting artifact may be evidence but it also may not be, and the inverse may be true for that perfectly ordinary thing everyone missed. A linguistic quirk can be a piece of evidence but it can also be accidental (e.g. nowadays it might also result from bad predictive typing or autocomplete, or even more recently, as others have pointed out, LLMs).

So all of these are in effect just probabilistic filters. And filters are only useful when your sample size includes your target (i.e. if the actual perpetrator is a suspect and you have adequate data about them). And even then they may not only produce false positives but also false negatives and these may interact the more filters you attempt to combine.

Forensic linguistics can be useful when you have a small set of suspects that you absolutely know includes the actual perpetrator. But otherwise they can send you on a wild goose chase or hurt the innocent.

psunavy03 17 days ago [-]
Trial lawyers have written about the "CSI Effect," where the existence of such shows produces jurors who now expect trials to contain the types of flashy scientific evidence they see on TV, and become less likely to convict even obviously-guilty people when this type of evidence can't be realistically produced.
alanh 17 days ago [-]
Here's your clew (sic) that the article may not be so reliable: It credits the FBI's linguistic analysis for locating the Unabomber, when in reality, it was his own brother who said “hey, it sounds like Ted.”
wglb 16 days ago [-]
Articles at the time noted that a significant part of the ongoing investigation was the FBI linguistic analysis of the manifesto but that the brothers own reading led to the identification. This article seems to be shading the story.
giarc 17 days ago [-]
I don't totally agree with you but I see where you are coming from. The article states "Eventually, the linguistic evidence was strong enough to issue a search warrant..." which could be referring to the brother pointing out the similar writing style or the FBI's assessment pointing to someone from Chicago. It's not clear.
lqet 17 days ago [-]
> Spellings such as “wilfully” for “willfully” and “clew” for “clue” pointed to someone from the Chicago area, for example. Eventually, the linguistic evidence was strong enough to issue a search warrant for the home of a reclusive mathematician named Theodore Kaczynski, raised in Chicago but living in rural Montana

One thing Kaczynski's brother noticed as particularly idiosyncratic was the consistent use of the phrase "you can’t eat your cake and have it too", which is usually phrased as "you can't have your cake and eat it too".

wizzwizz4 17 days ago [-]
> Indeed, this used to be the most common form of the expression until the 1930s–1940s, when it was overtaken by the have-eat variant.

https://en.wikipedia.org/w/index.php?title=You_can%27t_have_...

yunruse 17 days ago [-]
I wonder if this has some sort of preference for ablaut reduplication [0]? I don't have the vowel phonics off by hand, but "have your cake and eat it" seems to flow a little more smoothly than "eat your cake and have it".

[0] https://en.wikipedia.org/wiki/Reduplication#English

lcnPylGDnU4H9OF 17 days ago [-]
It sounds like reduplication is about individual words being repeated rather than a phrase.

> In linguistics, reduplication is a morphological process in which the root or stem of a word, part of that or the whole word is repeated exactly or with a slight change.

The ablaut section also seems to suggest that "eat" should come before "have" anyway.

> In ablaut reduplications, the first vowel is almost always a high vowel or front vowel (typically ɪ as in hit) and the reduplicated vowel is a low vowel or back vowel (typically æ as in cat or ɒ as in top).

I suspect you feel that it flows more smoothly because it's more familiar. You have to stop a brain process that's become a bit automatic to say that phrase and instead say it slightly differently.

xandrius 17 days ago [-]
It does make more sense to me though: of course I can have my cake and then eat it. But I can't eat it and still have it afterwards.

So the order implies the temporality of the actions.

seanhunter 17 days ago [-]
^ unabomber posts on hackernews.

Joking aside, the key word which is sometimes implied rather than included is “too”. The order isn’t important. The saying is both things can’t simultaneously be true.

jonathanlb 17 days ago [-]
> The order isn’t important.

I would argue that it is, given the semantic shift that "to have" has undergone.

“To have” historically has had a more tangible sense of holding or owning something in a lasting, physical way. For example, in medieval and early modern English documents, “to have” frequently referred to holding physical property or goods in a manner implying true, ongoing possession. For instance, the formula “to have and to hold,” found in English property grants and other legal charters dating back to at least the 13th century, specifies that the grantee possesses the land not just in theory, but in continuing, tangible stewardship. This phrase does not simply mean ownership on paper—it affirms the right to keep and maintain the property indefinitely.

Today, "to have" is more abstract, and implies enjoying a condition or availability. In modern English we often use “to have” for intangible states, experiences, or conditions, rather than strictly physical possession. We say we “have time” or “have a headache,” meaning we experience or hold a certain condition, not that we own a concrete object. Saying “I have an idea” frames “idea” as something you possess, but it’s more about the existence of that thought rather than controlling a physical thing. We “have a meeting,” which implies an event scheduled for us to attend, not an object we keep. Over time, “to have” evolved to mark various states—emotional, temporal, conceptual—thus shedding some of its older, property-focused sense and becoming a flexible verb denoting conditions or availability.

So, because we interpret “to have” as less tied tangible possession, the original logic—that once you eat the cake, you cannot still "have it"— doesn’t strike the ear as sharply when we switch the word order.

I would suggest adding a time marker like “then” (e.g., “You cannot eat your cake and then have it, too.”) emphasizes the sequence and delineates that the action of eating precedes the attempt at possession.

ChiMan 17 days ago [-]
I think we can expect sophisticated criminals to start using AI to rewrite their correspondence.
erehweb 17 days ago [-]
Perhaps similarly to how ransom notes would be written with letters from newspapers.
alganet 17 days ago [-]
That is by itself a linguistic choice that can be analyzed.

"Hey, that guy only communicates using cutouts from magazines, what a strange choice"

LLMs introduce all kinds of linguistic choices, and you can focus on those choices.

17 days ago [-]
runamuck 17 days ago [-]
I think so. Along those lines, did the underworld learn the lesson from Bernardo Provenzano (who got caught due to reliance on a Cesar Cipher) and step up their encryption?
tgv 17 days ago [-]
> According to forensic linguists, we all use language in a uniquely identifiable way that can be as incriminating as a fingerprint.

That's a bold and unproven statement, made worse because we can't really see that fingerprint.

crote 17 days ago [-]
It sounds like a fairly accurate statement to me, considering that there isn't a solid scientifically-based foundation behind fingerprint matching either. They aren't quite as unique as we've often been led to believe, and matching them is highly subjective with the same expert often interpreting the same comparison differently when provided with a different story for context.

Fingerprint matching of course isn't completely useless, but it's not as solid as you'd hope either.

tgv 17 days ago [-]
But when two sets of fingerprints, are different, you can be fairly sure they're from different people. But when the percentage of some features is 20% in one text, and 30% in another, you still can't conclude anything. I write in different registers in contexts such as personal emails, professional emails to a large group, professional emails to a direct colleague, a quick post on the internet, an 'app' to a friend in another country, a text message on a phone, etc. I even write them in different languages. It's hard to imagine there's a well-defined, properly grounded model that can unite those yet distinguish them from written output by other people.

And now LLMs are going to add more noise to these features...

ret54 17 days ago [-]
there was this that unmasked alt HN users identity 2 years back using stylometric analysis from a previous comment dump

AFAIU the more people know of it the better expectations are set about real account privacy

https://news.ycombinator.com/item?id=33755016

alex-moon 16 days ago [-]
A friend of a friend is a forensic linguist. It's her party trick! When she meets someone for the first time she'll chat to them for a bit then suddenly say where they're from. She can often get it down to the individual suburb, if from UK, or the specific region of another country. She can also tell you where you moved from and to, where you studied, etc. Suffice to say fun for everyone involved.
sgarland 17 days ago [-]
I wonder if distros like Tails [0] are going to start shipping lightweight LLMs specifically to reword messages. Though I’m also not sure how low of a spec you can go and still run an LLM decent enough to not be excruciatingly slow.

[0]: https://tails.net/

seanhunter 17 days ago [-]
You don’t need a special distro to run a local llm and something like ollama running llama3 7b would be just fine to reword a message on a normal laptop. Inference (ie actually using a model) is much much less compute intensive than training or finetuning.

I strongly suggest people try ollama - it takes a few minutes to set up, download a local model and you’re up and running. https://ollama.com/

sgarland 16 days ago [-]
I know – I've been playing with Ollama on my MBP and it's great.

My specific mention of Tails was because it's designed as an ephemeral OS for the paranoid / extremely security conscious, so it would make sense that they would consider something that allows further obfuscation of a user's identity.

blakesterz 17 days ago [-]
This type of work was used to find Satoshi at least once. This is the one I remember:

https://likeinamirror.wordpress.com/2013/12/01/satoshi-nakam...

khafra 17 days ago [-]
Relatedly, if you have a decently sized writing sample on the Internet, LLMs can do this to you at scale: https://www.lesswrong.com/posts/doPbyzPgKdjedohud/the-case-f...
z3t4 17 days ago [-]
Does anyone have any recommendation on software that can analyses text messages and then tell if two users are the same person?

The use case is for competitive gaming where a player can get a major advantage by using several accounts. So the software can be used for screening and detect accounts that are suspicious alike.

vzaliva 17 days ago [-]
The article touches on the impact of AI in this field but doesn’t mention a potential issue: the possibility of AI being used to rewrite text in a way that makes it unrecognisable and impossible to trace.
17 days ago [-]
Temporary_31337 17 days ago [-]
Cool XX century story but mass use of LLM generated text will obfuscate any such individual differences. My kids already use LLMs extensively to verify and sometimes generate homework completely.
llm_nerd 17 days ago [-]
A fun related tool someone posted on here once-

https://news.ycombinator.com/item?id=33755016

Tool seems to be dead, but link to it for the related discussion.

sidewndr46 17 days ago [-]
nice, between this and bite mark science no criminal is going to be able to escape punishment
TruffleLabs 17 days ago [-]
Reminds me of

"Eats, shoots, and leaves"

Or is it

"Eats shoots and leaves"?

;)

Book "Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation"

https://en.wikipedia.org/wiki/Eats,_Shoots_&_Leaves

Suppafly 17 days ago [-]
or the hilarious "help your uncle jack off a horse" vs "help your uncle, Jack, off a horse" example. although it does also involve caps.
seanhunter 17 days ago [-]
Or the famous:

Time to eat grandma

Vs

Time to eat, grandma

karaterobot 17 days ago [-]
Yes, a comma can solve a crime. You see, the panda didn't actually shoot anyone, it was just eating shoots and leaves.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 10:11:17 GMT+0000 (Coordinated Universal Time) with Vercel.