This was mentioned recently and I had a look[0]. It seems like the benchmark is not quite saying what people think it's saying, and the paper even mentions it. The benchmark is constructed by using a model (Deepseek) to define questions of a certain level of difficulty for the models. That may result in easier problems for "low-resource" languages such as Elixir and Racket and so forth since the differentiator couldn't solve harder problems. From the actual paper:
> Section 3.3:
> Besides, since we use the moderately capable DeepSeek-Coder-V2-Lite to filter simple problems, the Pass@1 scores of top models on popular languages are relatively low. However, these models perform significantly better on low-resource languages. This indicates that the performance gap between models of different sizes is more pronounced on low-resource languages, likely because DeepSeek-Coder-V2-Lite struggles to filter out simple problems in these scenarios due to its limited capability in handling low-resource languages.
At the same time I have used Claude Code on an elixir codebase and it's done a great job. But for me, it's undefined that it would have done a worse job if I had picked any other stack.
Wanted to second this. Been using AI extensively on a relatively large Phoenix / Elixir code base, and it mostly produces excellent results.
The features of Elixir that lead to good software are amplified with LLM's.
One thing that I would perhaps add to the article (or emphasise) is the clarity and quality of error messages in Elixir. In my opinion some of the best error logging in the game. The vast majority of the time the error gives enough information to very quickly fix the problem.
podlp 19 hours ago [-]
I tried Elixir a few months back with several different models (GPT, Claude, and Gemini). I’m not an Elixir or BEAM developer, but the results were quite poor. I rarely got it to generate syntactically correct Elixir (let alone idiomatic). It often hallucinated standard library functions that didn’t exist. Since I had very little prior experience, steering the models didn’t go well. I’ve since been using them for JS/ TS, Kotlin/ Java, and a few other tasks where I’m much more familiar.
My takeaway was that these models excel at popular languages where there’s ample training material, but struggle where the languages change rapidly or are relatively “niche.” I’m sure they’ve since gotten better, so perhaps my perception is already out of date.
pjm331 1 days ago [-]
I’ve had a fantastic experience building out an internal AI agent service using elixir and phoenix - after only dabbling with it in side projects for almost a decade
OTP fits agents like a glove.
pjmlp 1 days ago [-]
If it doesn't target the GPUs, with the same kind of existing tooling for C++, Python and Julia, it isn't.
flexagoon 23 hours ago [-]
Did you even open the article? It claims it's the best language to use with AI, not the best language to develop AI
pjmlp 23 hours ago [-]
I did, for me to use with AI implies also developing.
Rendered at 08:00:51 GMT+0000 (Coordinated Universal Time) with Vercel.
> Section 3.3:
> Besides, since we use the moderately capable DeepSeek-Coder-V2-Lite to filter simple problems, the Pass@1 scores of top models on popular languages are relatively low. However, these models perform significantly better on low-resource languages. This indicates that the performance gap between models of different sizes is more pronounced on low-resource languages, likely because DeepSeek-Coder-V2-Lite struggles to filter out simple problems in these scenarios due to its limited capability in handling low-resource languages.
At the same time I have used Claude Code on an elixir codebase and it's done a great job. But for me, it's undefined that it would have done a worse job if I had picked any other stack.
[0]: https://news.ycombinator.com/item?id=46646007
The features of Elixir that lead to good software are amplified with LLM's.
One thing that I would perhaps add to the article (or emphasise) is the clarity and quality of error messages in Elixir. In my opinion some of the best error logging in the game. The vast majority of the time the error gives enough information to very quickly fix the problem.
My takeaway was that these models excel at popular languages where there’s ample training material, but struggle where the languages change rapidly or are relatively “niche.” I’m sure they’ve since gotten better, so perhaps my perception is already out of date.
OTP fits agents like a glove.