Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Extracting financial disclosure and police reports with OpenAI Structured Output (gist.github.com)

254 points by danso 265 days ago | 89 comments

synthc 260 days ago [-]

My first job (around 2010) was to extract events from financial news and police reports.

We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.

It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.

dataguy_ 260 days ago [-]

I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.

danofsteel32 260 days ago [-]

I’ve had a similar experience extracting transactions from my PDF bank statements [1]. GPT-4o and GPT-4o-mini perform as well the janky regex parser I wrote a few years ago. The fact that they can zero shot the problem makes me think there’s a lot of bank statements in the training data.

[1] https://dandavis.dev/pnc-virtual-wallet-statement-parser.htm...

morkalork 260 days ago [-]

Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?

HWR_14 260 days ago [-]

No, his first job would be a more senior developer writing 100 line Python script instead of hiring an intern to write a truck load of RegExs. After that dev saved time just writing the script over mentoring/explaining/hiring the intern, that dev would then do the more interesting things with the events.

That is, his first job is now gone.

rcarmo 261 days ago [-]

I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.

TrainedMonkey 261 days ago [-]

I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.

alach11 260 days ago [-]

This is a common workflow for these sorts of problems. I've done similar a few times. The downside is the additional cost.

infecto 260 days ago [-]

On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.

druskacik 260 days ago [-]

Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.

I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.

The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.

I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.

msp26 260 days ago [-]

Have you looked into structured generation with a library like outlines? https://github.com/dottxt-ai/outlines

druskacik 260 days ago [-]

Yes! This library is great and definitely helps, but I still had problems with performance. For example, smaller models would still hallucinate when extracting a JSON field when the given field wasn't present in the text (I'd expect null, but it provided either an incorrect value from the text, or a totally made up value).

petercooper 260 days ago [-]

Jina did something related for extracting content from raw HTML and wrote about the techniques they used here: https://jina.ai/news/reader-lm-small-language-models-for-cle... .. in my tests, the 1.5bn model works extremely well, though the open model is non commercial.

chx 260 days ago [-]

How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:

https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

> Alas, that does not remotely resemble how people are pitching this technology.

TrackerFF 261 days ago [-]

We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.

Worked better than any OCR we tried.

thenaturalist 261 days ago [-]

How are you going to find (not even talking about correcting) hallucinated errors?

If money is involved and the LLM produces hallucination errors, how do you handle monetary impacts of such errors?

How does that approach scale financially?

ClearAndPresent 261 days ago [-]

Indeed. I anticipate the next Post Office Scandal(1,2) attributed to LLMs.

1 https://en.wikipedia.org/wiki/British_Post_Office_scandal 2 https://www.postofficescandal.uk/

rolandog 260 days ago [-]

That's awful.

Reminds me of the Dutch childcare benefits scandal [0], where 26,000 families were unfairly labeled as having committed tax fraud (11,000 of which had been targeted via "risk profiling", as they had dual nationalities [1]). Bad policy + automation = disaster. The wikipedia article doesn't fully explain how some automated decisions were made (e.g. You had a typo in a form, therefore all previous benefits were clawed-back; if you owe more than €3.000,- then you're a fraudster and if you called to ask for clarification they wouldn't help you — you're officially labeled a fraudster, you see).

Edit: couldn't find a source for my last statement, but I remember hearing it in an episode of the great Dutch News podcast. I'll see if I can find it.

[0]: https://en.wikipedia.org/wiki/Dutch_childcare_benefits_scand...

[1]: https://www.dutchnews.nl/2021/02/full-scale-parliamentary-in...

rolandog 258 days ago [-]

Just for posterity, I couldn't find the specific podcast episode, but there are public statements from some of the victims [0] available online (translated):

> What personally hurt you the most about how you were treated? > Derya: 'The worst thing about all of this, I think, was that I was registered as a fraudster. But I didn't know anything about that. There was a legal process for it, but they blocked it by not telling me what OGS (intent gross negligence) entailed. They had given me the qualification OGS and that was reason for judges to send me home with rejections. I didn't get any help anywhere and only now do I realize that I didn't stand a chance. All those years I fought against the heaviest sanction they could impose on me and I didn't know anything. I worked for the government. I worked very hard. And yet I was faced with wage garnishment and had to use the food bank. If I had known that I was just a fraudster and that was why I was being treated like that, I wouldn't have exhausted myself to prove that I did work hard and could pay off my debts myself. I literally and figuratively worked myself to death. And the consequences are now huge. Unfortunately.'

[0]: https://www.bnnvara.nl/artikelen/hoe-gaat-het-nu-met-de-slac...

is_true 260 days ago [-]

We tried all models from openai and google to get data from images and all of them made "mistakes".

The images are tables with 4 columns and 10 rows of numbers and metadata above that are in a couple of fields. We had thousands of images already loaded and when we tried to check those previously loaded images we found quite a few errors.

infecto 260 days ago [-]

Multimodal LLMs are not up for these tasks imo. It can describe an image but its not great on tables and numbers. Now on the other hand, using something like Textract to get the text representation of the table and then feeding that into a LLM was a massive success for us.

is_true 260 days ago [-]

LLMs don't offer much value for our use case, almost all values are just numbers

infecto 260 days ago [-]

Then you should be using something like Textract or other tooling in that space. Multimodal LLMs are no replacement.

is_true 260 days ago [-]

We use opencv + tesseract and easyocr

thenaturalist 260 days ago [-]

Curious, did that make you "fall back" to more conservative OCR?

Or what else did you do to correct them?

is_true 260 days ago [-]

We already had an OCR solution. We were exploring models in case the information source changes

petercooper 260 days ago [-]

Not the OP, but if doing this at scale, I'd consider a quorum approach using several models and looking for a majority to agree (otherwise bump it for human review). You could also get two different approaches out of each model by using purely the model and external OCR + model and compare those too.

coredog64 260 days ago [-]

I’m working on a problem in this space, and that’s the approach I’m taking.

More detailed explanation: I have to OCR dense, handwritten data using technical codes. Luckily, the form designers included intermediate steps. The intermediate fields are amenable to Textract, so I can use a multimodal model to OCR the full table and then error check.

gloosx 261 days ago [-]

Did you finally balance out lol? If you didn't, would you approach finding a mistake by going through each bill manually?

marcell 261 days ago [-]

I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox

MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’

thenaturalist 261 days ago [-]

How do you plan to address prompt injection/ poisoned data for a method that simply vacuums unchecked inputs into an LLM?

marcell 261 days ago [-]

It hasn’t been an issue yet, but I’m sure it will come up at some point. If you see a problem please file an issue.

thenaturalist 261 days ago [-]

So assuming it would be an issue, given that you’re building such a tool, what would your approach be?

If I put an invisible tag on my website and it tells your scraper to ignore all previous prompts, leak its entire history and send all future prompts and replies to a web address while staying silent about it, how would you handle that?

alach11 260 days ago [-]

A casual look at the source shows the architecture won't allow the attacks you're talking about. Since each request runs separately, there's no way for prompt injection on one request to influence a future request. Same thing for leaking history.

4ad 261 days ago [-]

What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.

TrackerFF 261 days ago [-]

To be fair, there are some considerations here:

1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.

2) There could be policies about making the data public, but in a way that discourages data scraping.

3) The providers of the data simply don't have the resources or incentives to develop a working API.

And many more.

blitzar 261 days ago [-]

What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.

jxramos 260 days ago [-]

I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.

DrillShopper 260 days ago [-]

Yeah, wow, humanity is so stupid for not distributing the machine readable format for the local newspaper in 1920. Gosh we're just so dumb

tpswa 261 days ago [-]

Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.

danso 261 days ago [-]

I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:

     openai.BadRequestError: Error code: 400 - {'error': {'message': "Missing required parameter: 'response_format.json_schema'.", 'type': 'invalid_request_error', 'param': 'response_format.json_schema', 'code': 'missing_required_parameter'}}

I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own

[0] https://platform.openai.com/docs/guides/structured-outputs/s...

ec109685 261 days ago [-]

I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?

chaos_emergent 261 days ago [-]

function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.

Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.

The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.

Happy to link you to some papers I've skimmed on it if you're interested!

pmg0 261 days ago [-]

Could you share some of those papers? I had a great discussion with Marc Fischer from the LMQL team [0] on this topic while at ICML earlier this year. Their work recommended decoding to natural language templates with mad lib-style constraints to follow that “happy path” you refer to, instead of decoding to a (relatively more specific latent) JSON schema [1]. Since you provided a template and knew the targeted tokens for generation you could strip your structured content out of the message. This technique also allowed for beam search where you can optimize tokens which lead to the tokens contain your expected strings, avoiding some weird token concatenation process. Really cool stuff!

[0] https://lmql.ai/ [1] https://arxiv.org/abs/2311.04954

261 days ago [-]

throwup238 261 days ago [-]

Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.

philipwhiuk 260 days ago [-]

I'm deeply worried by the impact of hallucinations in this sort of tool.

beoberha 261 days ago [-]

Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.

0tfoaij 261 days ago [-]

OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.

Baeocystin 261 days ago [-]

Any pointers on where we can check the best local models per amount of VRAM available? I only have consumer level cards available, but I would think something that just fits in to a 24Gb card should noticably outperform something scaled for an 8Gb card, yes?

fnord77 261 days ago [-]

lm studio tells you what models fit in your available RAM, with or without quantization

minimaxir 261 days ago [-]

The Berkeley Function-Calling Leaderboard tracks function calling/structured data performance from multiple models: https://gorilla.cs.berkeley.edu/leaderboard.html

Llama isn't on there but a few finetunes of it (Hermes) are OSS.

lolinder 261 days ago [-]

Llama 3 70B is on there, ranked 20.

int_19h 261 days ago [-]

Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.

gdiamos 261 days ago [-]

I usually come to a different conclusion using the JSON output on Lamini, e.g. even with Llama 3.2 3B

https://lamini-ai.github.io/inference/json_output

Most of these models can read. If the relevant facts are in the prompt, they can almost always extract them correctly.

Of course bigger models do better on more complex tasks and reasoning unless you use finetuning or memory tuning.

dcreater 261 days ago [-]

You should probably disclose you're the founder of lamini.

Do you have any publicly available validation data demonstrating 100% json compliance?

gdiamos 260 days ago [-]

I am a founder. It’s not meant to be a secret.

Obviously I’m biased, but I also spend every day using tools like this.

Regarding json compliance, we have a formal grammar and a test suite. If you find a bug please report it. I’d appreciate having more test coverage.

A4ET8a8uTh0 261 days ago [-]

<< Stuff like this shows how much better the commercial models are than local models.

I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.

edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.

kgeist 261 days ago [-]

In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.

tpm 261 days ago [-]

In my experience the Qwen2-VL models are great at this.

thatcat 261 days ago [-]

I mean, those aren't comparable models. I wonder how the 405b version compares.

Tiberium 261 days ago [-]

You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).

maleldil 261 days ago [-]

Is the size of OpenAI's models public, or is this guesswork?

qwe----3 261 days ago [-]

If your company has a lot of ex openai employees then you know ;)

And the public numbers are mostly right, the latest values are likely smaller now- they have been working on down sizing everything

1oooqooq 260 days ago [-]

if you're "parsing" structured or even semi structured data with a LLM.... sigh.

an true scotch engineer know tagged data goes into the other end. but I guess that doesn't align with openai target audience and business goals.

i guess that would be fine to clean the new training data... but then you risk extrapolating hallucinations

danso 260 days ago [-]

The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.

(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)

[0] https://disclosures-clerk.house.gov/public_disc/financial-pd...

[1] https://disclosures-clerk.house.gov/public_disc/financial-pd...

pooingcode 260 days ago [-]

I am a huge fan of using Stuctured Output to extract data.

Huge benefit that you can lock down model performance with as you fine-tune your prompt or extend out use cases.

I wrote about it here on my blog where i replaced a project’s prompt with Structured Output using Pydantic models https://amberwilliams.io/blogs/474b0361-cbc1-4fa5-b047-c042f...

Zaheer 261 days ago [-]

Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/

There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!

261 days ago [-]

matchagaucho 261 days ago [-]

Similarly I've found old-school OCR is needed for more reliability.

MarkMarine 261 days ago [-]

I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.

bagels 261 days ago [-]

Combining google's ocr with llm gives OCR superpowers. Tell the llm the text is from an ocr and ask it to correct it.

saturn8601 261 days ago [-]

That sounds like it could be very dangerous when the LLM gets it wrong...

bagels 261 days ago [-]

Depends what you're using it for. If you're relying on OCR, you've already got to accept some amount of error.

hackernewds 261 days ago [-]

Is this simply the OCR bits to feed to openai structured output?

artisandip7 261 days ago [-]

tried it works great, ty!

minimaxir 261 days ago [-]

> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision

To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.

tyre 261 days ago [-]

or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems

ec109685 261 days ago [-]

I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.

mmsc 260 days ago [-]

Adding to the list of "now try it with"....

The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.

I tried making two parsers: https://github.com/MegaManSec/SEC-Feed-Parser and https://github.com/MegaManSec/SEC-sec-incident-notifier but they're just hacks.

Then just link it up to your automated investment platform and you're ready to go!

infecto 260 days ago [-]

Would you not want to read the XBRL from the filing? I thought those are now mandatory.

This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.

derivagral 260 days ago [-]

My (admittedly aged) experience with XBRL is that each company was able to define its own fields/format within that spec, and that most didn't agree on common names for common fields. Parsing it wasn't fun.

infecto 260 days ago [-]

I have spotty education on the matter but I believe they all conform to a FASB taxonomy so there is at least a list of possible tags in use. I do wonder if any of the big data vendors actually use this though.

jsemrau 260 days ago [-]

The SEC has a well defined API with EDGAR.

https://jdsemrau.substack.com/p/mem0-building-a-sec-10k-anal...

jxramos 260 days ago [-]

Does anyone follow Vik's work? eg https://x.com/VikParuchuri/status/1846153661791011158

kiakiaa 259 days ago [-]

Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.

myflash13 260 days ago [-]

Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?

thibaut_barrere 260 days ago [-]

This is what I am implementing a the moment (together with sampling for errors).

Imanari 261 days ago [-]

Does anybody have experience with Azure Document Intelligence? How does it compare to OAIs extraction capabilities?

frays 260 days ago [-]

Very cool and real application of LLMs. Although hallucinations is still something to be very wary of.

andrewg4445 260 days ago [-]

This helped me a lot, thx.

black_13 261 days ago [-]

[dead]

Rendered at 22:06:36 GMT+0000 (Coordinated Universal Time) with Vercel.