Next.js App Router + React Server Components Demo

NHacker Next

new
past
show
ask
show
jobs
submit

▲SmolDocling: An ultra-compact VLM for end-to-end multi-modal document conversion (arxiv.org)

66 points by prats226 324 days ago | 12 comments

Oras 323 days ago [-]

After many posts on my feed, I decided to give it a spin.

The good: - Open source.

- Can run locally (Apple Silicon) at a fair speed.

- Image detection is good.

The bad:

- Not detecting tables.

- Text in a perfectly clean PDF (resume) is not detected.

I know its in preview, small and open source which is great, but its far from being usable.

daemonologist 324 days ago [-]

Well it's certainly small. Absolutely bombs my KTANE test though - poor character recognition, poor handling of even mildly complex tables, and prone to getting stuck in repetition loops. (Task was convert to docling, in the official HF space.)

That said, I'm definitely glad to see work in this area, particularly with open weights.

sitkack 323 days ago [-]

https://news.ycombinator.com/item?id=43431609 mentioned that it was fine tuned from https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct

It would be interesting to see if fine tuning on your KTANE test improves your results?

dartos 323 days ago [-]

I may be missing something, but if you train a model to pass a specific test… isn’t it obvious that it would do better on that test?

I thought we called models with test data in their training set “poisoned”

sitkack 323 days ago [-]

That is how training is done. You don't train on all of your material or the tests are worthless.

dartos 323 days ago [-]

If you train on any test material, the tests are worthless.

t1amat 323 days ago [-]

Regarding the repetition loops, I found that adding the end of turn token to the stop param was enough. Documentation mentions detecting this.

But your point about quality stands. Separately, this model emits the docling XML format, not the JSON format, so as far as I know today that means you are using the Python flavored docling only, the JS variant does not support this yet (afaik).

jeyzolo 316 days ago [-]

It also can be used here:https://www.smoldocling.net, works well!

bugglebeetle 324 days ago [-]

What’s the best library for fine-tuning VLMs at the moment and do they support this architecture or that for the IBM Granite vision models? Document understanding tasks seem in special need of fine-tuning.

_ea1k 324 days ago [-]

It looks like the model itself is here: https://huggingface.co/ds4sd/SmolDocling-256M-preview

It was fine tuned from this: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct

There's an example of fine tuning the base that would likely be applicable to this one as well.

th0ma5 324 days ago [-]

Does seem comparable to Tesseract? I feel like the accuracy results are still not significantly improved as a whole.

nanoxid 323 days ago [-]

OCR is not the task being solved here, though. This is supposed to help you when dealing with complex layouts where text is not just read left-to-right, top-to-bottom.

But I agree that accurate OCR is kind of a prerequisite for adaptation.

momozolo 322 days ago [-]

[dead]

Rendered at 17:29:23 GMT+0000 (Coordinated Universal Time) with Vercel.