Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting (arxiv.org)

137 points by meame2010 261 days ago | 32 comments

snake_doc 257 days ago [-]

Holy unnecessary use of terminology to explain a reverse graph traversal. “Loss”, “gradients”, “differentiating”— no! stop!

This must be what AI hype actually is. Complete incoherent language to explain a very straight forward concept.

This is just: LLMs judging intermediate node outputs, and reverse traversing the graph while doing so until it modifies the original prompt.

currymj 257 days ago [-]

i am old enough to remember the opposite: people would try to sell deep learning to the mainstream ML community by pointing out that backprop is just message-passing on a Bayesian network with modified sum/product operations.

snake_doc 256 days ago [-]

Valid point, but at least that was mathematics. This paper isn’t even math, it’s a data control flow masquerading as math.

porridgeraisin 256 days ago [-]

Interestingly, backpropagation is the natural way I'd describe this process, not reverse graph traversal.

Background difference I suppose.

> This must be what AI hype actually is. Complete incoherent language to explain a very straightforward concept.

True, a lot of papers overdo the jargon just for hype purposes. My favorite funniest example is this one from Google Research (and universities) (have linked the paper review video below)

https://youtu.be/Pl8BET_K1mc

See the YouTube chapter about "Multidiffusion" (around 38minutes)

They spent multiple paragraphs formulating an "optimisation problem" which when peeled down amounts to taking the mean, just to be able to superficially cite their own previous paper.

Quite the sorry state of things.

upghost 257 days ago [-]

> - the task of crafting textual inputs to effectively direct LLMs -- remains difficult and labor-intensive

Damn, I knew we were lazy but describing prompting as labor-intensive is impressively lazy even to me.

Obviously reading the rest of the abstract was too labor intensive for me but I'm hoping I can just hook a probe up to my drool and it can infer my desires from that.

IanCal 257 days ago [-]

It is labour intensive to optimise a prompt though. Particularly for smaller models.

Setting parameters for any ml model is easy, but we'd call it labour intensive if we expected people to do it manually despite having evals. Instead we have ways of searching for and optimising settings. The methods for that are obvious for small cardinality discrete values or continuous variables. Less so for arbitrary text.

jyhong 257 days ago [-]

It is not labor-intensive if you want a prompt to work in 80%-90% of cases and humans are good at that. But it is labor-intensive if you want to make it to work at 99%. Then you need to go through many cases and "optimize" the prompt, which is the advantage of optimizer.

ionflow 257 days ago [-]

this made my day

xg15 258 days ago [-]

Just read the abstract so far. Sounds amazing, but just for the sake of understanding, what would be the inputs and outputs of such a system? If the prompt is generated, how do you tell the system what you'd like to have? And what is the ground truth that is trained against? Examples of the desired text?

amelius 258 days ago [-]

I think what they mean is intermediate prompts, i.e. the prompts that the system gives to itself when solving a problem that requires multiple stages.

xg15 258 days ago [-]

Ah, that makes sense. So (very) basically, they're putting a number of regular LLMs into a sort of compute chain/graph, where one LLM feeds into the other, then doing gradient descent on the whole chain at once, essentialy treating the boundaries between LLM n and LLM n+1 as "hidden layers"?

meame2010 257 days ago [-]

Author here. Yea, in this fashion. And it can create the feedback using llm as a backward engine

meame2010 261 days ago [-]

Implemented in AdalFlow:https://github.com/SylphAI-Inc/AdalFlow

hnuser123456 260 days ago [-]

Congrats on the paper! I read through some of the github docs and read through the paper, this sounds very impressive, but I'm trying to think of how to best use this in practice... is the idea that I could give some kind of high-level task/project description (like a Python project), and this framework would intelligently update its own prompting to avoid getting stuck and to continue "gaining skill" throughout the process of working on a task? Could this be used to build such a system? Very curious to learn more.

meame2010 260 days ago [-]

you need a training dataset, and a task pipeline that works. You can refer to this doc: https://adalflow.sylph.ai/use_cases/question_answering.html

hnuser123456 260 days ago [-]

Thank you, I missed the use cases section, that explains a lot. Nice documentation. Might play with this when I get home.

stevage 258 days ago [-]

Can anyone ELI5? Or at least a kind of layman's explanation?

rsfern 258 days ago [-]

Consider a complex LLM pipeline with multiple steps. each LLM evaluation has an associated prompt to shape the style/context of the response. Conventionally these prompts are treated like hyperparameters that have to be manually adjusted to get the desired behavior from the LLM.

This work introduces a way to treat these prompts like trainable parameters, updating them through automatic differentiation of some kind of supervised training loss.

For me it kind of feels like deep dream or style transfer, which use autograd to optimize the model inputs (instead of the parameters) to achieve some goal (like mixing the style and content of two input images)

alankarmisra 258 days ago [-]

This paper suggests that LLMs can be trained to handle multi-stage questioning by automatically optimizing prompts using feedback-based methods, improving their ability to process complex, multi-step interactions.

deadbabe 257 days ago [-]

less words makes more prompt

rfw300 257 days ago [-]

I always find myself baffled by “prompt optimization” frameworks. Do people really find themselves needing random perturbations of a fixed prompt to improve accuracy? It’s my experience that the challenging part of writing a prompt is figuring out what the task you want done is, and understanding which data you need to pass to the model to make the task achievable. None of that can be achieved by “optimizing” the prompt—the hard part is a layer of abstraction upward.

popalchemist 257 days ago [-]

It's useful in enterprise scenarios where you need a reliable outcome for some kind of programmatic task, and you are dealing with throughput of jobs in the thousands to hundreds of thousands.

joehewett 257 days ago [-]

depends what you're doing. If you're using ChatGPT via the UI for a one off question, sure. If you're prompting an LLM that is doing a critical task in production millions of times, minor improvements can have significant benefit

rfw300 257 days ago [-]

I have done the latter much more than the former. My experience has been the issues come from inputs that you don’t foresee, not reliability on in-distribution uses (which would be your “training” data for prompt optimization). And the worry is that this kind of optimization would lead to substantive revisions of the guidelines set out in the prompt, which could further compromise performance out of distribution.

To the extent that you need to eke out reliability on the margins, one is vastly better served by actual fine-tuning, which is available both for open-source models and most major proprietary models.

iceman_w 256 days ago [-]

I always thought that the point of instruction tuning and ability to use prompts to get the model to do 0 shot tasks was that you don't have to collect tons of example data/samples. The method proposed here requires you to have tons of data. If you have that, why not just fine tune the underlying model?

thom 258 days ago [-]

Wow, just when I’d accepted MIPRO in DSPy was magic, here we are. Things continue apace.

code_biologist 258 days ago [-]

Not light reading, but seems nifty: https://www.lycee.ai/blog/understanding-mipro-optimizer-dspy

meame2010 257 days ago [-]

Yup. The LLM-AutoDiff is just getting started. But it has proven generation-only without explicitly doing few-shot samples can be even more effective and create shorter final prompts

schappim 256 days ago [-]

Here is a basic implementation of this in Ruby: https://gist.github.com/schappim/ad8b4953486617c7f813751c8ee...

nullc 258 days ago [-]

requires a backwards trained llm no?

I don't think anyone has pretrained a remotely-close-to-SOTA sized backwards model.

seanhunter 258 days ago [-]

Haven’t read the paper yet just the abstract, but it sounds like it uses a backwards trained llm itself to generate prompts and examples but can do the autodiff on any llm.

meame2010 257 days ago [-]

We use gpt4o as the backward model. But I’m excited to try deepseek r1 as it has explicit reasoning available.

We are continuously adding more benchmarks to the paper with UTAustin.

Rendered at 03:33:04 GMT+0000 (Coordinated Universal Time) with Vercel.