Next.js App Router + React Server Components Demo

NHacker Next

new
past
show
ask
show
jobs
submit

▲PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning (developers.googleblog.com)

218 points by meetpateltech 6 days ago | 27 comments

minimaxir 6 days ago [-]

Hugging Face's blog post on the release is more technical: https://huggingface.co/blog/paligemma2

xnx 6 days ago [-]

Even more technical detail here: https://arxiv.org/html/2412.03555v1

timmg 6 days ago [-]

I recently wanted to try to get an LLM to help me organize my photos. (I'm someone who takes a lot of photos when I travel and then back them up to a hard drive -- assuming someday I'll organize them :)

I created a prompt to try to get an LLM to do high-level organization:

> Help categorize this photo for organization. Please output in JSON.

> First, add a field called "type" that is one of: Wildlife, Architecture, Landscape, People or other. Pick the one that most appropriately reflects the subject of the photo. Use Other if you don't feel confident in your answer.

> Next, if it is Wildlife, please add another field called "animal" that gives a general type of animal that is the focus of the photo. Use large, common types like Elephant, Bird, Lion, Fish, Antelope, etc. Do not add this field if your confidence is low.

> If the type of animal is Bird, add a field called "bird" that gives the common type of bird, if you can clearly determine it. For example: Eagle, Hummingbird, Vulture, Crow, etc.

> If it is an Architecture photo, and you can determine with good confidence what specific building (or city) it is a photo of, please add a field called "place" with that name. (Name only, please -- no description).

I've tried with llama-vision using Ollama and it worked reasonably well for the top-level categories. A little less-well for identifying specific birds or places. And it didn't always generate proper JSON (and sometimes added new fields to JSON.)

I also tried with Claude's API -- and it seemed to work perfectly (for a small sample size).

It will be interesting to try with PaliGemma and see what I get.

I have like 50k photos, so I don't want to pay $$$ for the Claude API to categorize them all. It will be cool someday (soon?) for an open-source DAM to have something like one of these models available to call locally.

warangal 6 days ago [-]

Disclaimer: I work on such a project[0]

I think a combination of CLIP and some face-recognition may solve your issues! It just takes a path to your directory, and can index all the images while preserving your folder hierarchy along with a high quality face-clustering. Each image indexing takes about 100ms on a cpu. Every combination can then be mixed and matched, from a single interface, It doesn't take much to try as dependencies are very minimal . There is a self contained app for windows too. I have been looking for feedback, so just plugging it here in case some one has a use case.

[0] https://github.com/eagledot/hachi

swyx 5 days ago [-]

> https://github.com/eagledot/hachi

promising but i want to see more before i try it - could you invest a little in your readme to list out features, maybe do a little loom demo?

warangal 5 days ago [-]

for images readme is at: https://github.com/eagledot/hachi/tree/main/images/readme.md with more than enough details! It is supposed to be a search engine for all modalities, for now `images` are supported !

For demo, i don't have much resources to host it, would a video showcasing features help ?

Also for windows there is a portable app at https://github.com/eagledot/hachi/releases/download/v1.3/hac...

swyx 5 days ago [-]

yea simple Loom/youtube video works well!

senko 6 days ago [-]

Simonw estimates it'd cost less than $10 to categorize 67k+ photos using Amazon Nova: https://simonwillison.net/2024/Dec/4/amazon-nova/#gamoa

I agree it'll still be cool to be able to do it all locally.

rsolva 6 days ago [-]

The photo organizing software Ente [0] can do this, and is packaged into a really neat product. I have not gotten around to try the self hosted version yet, but it is on my list!

[0] https://ente.io/ml

magicalhippo 6 days ago [-]

Recently played with something similar using Llama 3.2 Vision locally.

Worked pretty well, and if the model fit in G-RAM decently fast.

Main issue was prompt adherence. In my experience prompt adherence goes down significantly when you reduce the model size.

Llama 3.2 Vision seems to be tuned hard to provide a summary at the end, usually with some social commentary, and was difficult to get the 8B model to avoid outputting it. Also adding multiple if-this-then-that clauses, like in your prompt, was often ignored in the 8B and smaller model compared to 90B.

I've tried the Gemma 2 model before for assistant tasks, and was very pleased with the 9B performance. It had good prompt adherence and performed well on various tasks. So looking forward to trying this.

nulld3v 4 days ago [-]

Are you aware of Immich? I believe it does mostly everything you were trying to do: https://immich.app/docs/features/smart-search It's open source, and is fairly polished too.

I think the main part that's missing is how the classifications are not exposed to you in the UI. But I haven't used this feature much and my instance isn't running right now (SSD failed, still trying to get it replaced) so not able to check.

thorncorona 6 days ago [-]

None of this requires a multimodal LLM.

You can do this with traditional CNNs. For place, use Geospy or another geospatial model.

visarga 6 days ago [-]

use the json mode in ollama

pilooch 6 days ago [-]

Paligemma proves easy to train and useful in fine-tuning. It's main drawback was not being able to handle multiple images without being partly retrained. This new version dies not seem to support multiple images as input at once. Qwen2vl does. This is useful for vision rag typically.

sigmar 6 days ago [-]

It is probably hard to come up with good benchmarks for VLMs like this, but I feel like the "Non entailment sentences" benchmark seems ill-suited. The examples for sentences that were non entailment included[1]: "There is a pile of horse manure in front of the horse." Which is true if you mean "in front of the [photosubject] from the perspective of the camera," but I think they marked it as non entailment because the pile is not in front of the horse's face(?)

[1] page 20 https://arxiv.org/pdf/2412.03555

turnsout 6 days ago [-]

Does anyone know how this stacks up against other multimodal vision models?

mountainriver 6 days ago [-]

They do an exceptionally poor job at evaluating it against competitors.

swyx 5 days ago [-]

how about leaderboards that can pop them in?

lofaszvanitt 2 days ago [-]

Why does that matter? Everyone has their own uses for a vlm. Compare them for the given task at hand.

dmvdoug 6 days ago [-]

Saw name, was expecting something to do with wreaking AI upon the Pali Canon (https://en.m.wikipedia.org/wiki/Pali_Canon).

exe34 6 days ago [-]

does anyone know if they can output bonding box coordinates? like "where is the ball" -> [50, 75, 150, 175].

so far cogvlm is the only one I've seen that works but it's a bit of a pain to run.

xnx 6 days ago [-]

Yes: "The initial four location tokens represent the coordinate of the bounding box, ranging from 0 to 1023. These coordinates are independent of the aspect ratio, as the image is assumed to be resized to 1024 x 1024."

https://developers.googleblog.com/en/gemma-explained-paligem...

exe34 6 days ago [-]

Thank you! Will have a play with that.

__jl__ 6 days ago [-]

Gemini is surprisingly good at this. Look at example 5 here: https://developers.googleblog.com/en/7-examples-of-geminis-m...

They also have a colab notebook with more examples linked in the article.

exe34 6 days ago [-]

I meant weights-available ones, but thank you!

jsight 5 days ago [-]

Yes, just use "detect ball" as the prompt. It will give you y1,x1,y2,x2 coordinates on a scale of 0-1024. It is really good at this.

Unfortunately, without fine tunes you can't have it just detect everything and return all detected objects, afaict.

exe34 4 days ago [-]

The demo is impressive!

https://huggingface.co/spaces/big-vision/paligemma

Rendered at 20:50:12 GMT+0000 (Coordinated Universal Time) with Vercel.