I recently wanted to try to get an LLM to help me organize my photos. (I'm someone who takes a lot of photos when I travel and then back them up to a hard drive -- assuming someday I'll organize them :)
I created a prompt to try to get an LLM to do high-level organization:
> Help categorize this photo for organization. Please output in JSON.
> First, add a field called "type" that is one of: Wildlife, Architecture, Landscape, People or other. Pick the one that most appropriately reflects the subject of the photo. Use Other if you don't feel confident in your answer.
> Next, if it is Wildlife, please add another field called "animal" that gives a general type of animal that is the focus of the photo. Use large, common types like Elephant, Bird, Lion, Fish, Antelope, etc. Do not add this field if your confidence is low.
> If the type of animal is Bird, add a field called "bird" that gives the common type of bird, if you can clearly determine it. For example: Eagle, Hummingbird, Vulture, Crow, etc.
> If it is an Architecture photo, and you can determine with good confidence what specific building (or city) it is a photo of, please add a field called "place" with that name. (Name only, please -- no description).
I've tried with llama-vision using Ollama and it worked reasonably well for the top-level categories. A little less-well for identifying specific birds or places. And it didn't always generate proper JSON (and sometimes added new fields to JSON.)
I also tried with Claude's API -- and it seemed to work perfectly (for a small sample size).
It will be interesting to try with PaliGemma and see what I get.
I have like 50k photos, so I don't want to pay $$$ for the Claude API to categorize them all. It will be cool someday (soon?) for an open-source DAM to have something like one of these models available to call locally.
warangal 6 days ago [-]
Disclaimer: I work on such a project[0]
I think a combination of CLIP and some face-recognition may solve your issues!
It just takes a path to your directory, and can index all the images while preserving your folder hierarchy along with a high quality face-clustering. Each image indexing takes about 100ms on a cpu.
Every combination can then be mixed and matched, from a single interface, It doesn't take much to try as dependencies are very minimal . There is a self contained app for windows too. I have been looking for feedback, so just plugging it here in case some one has a use case.
I agree it'll still be cool to be able to do it all locally.
rsolva 6 days ago [-]
The photo organizing software Ente [0] can do this, and is packaged into a really neat product. I have not gotten around to try the self hosted version yet, but it is on my list!
Recently played with something similar using Llama 3.2 Vision locally.
Worked pretty well, and if the model fit in G-RAM decently fast.
Main issue was prompt adherence. In my experience prompt adherence goes down significantly when you reduce the model size.
Llama 3.2 Vision seems to be tuned hard to provide a summary at the end, usually with some social commentary, and was difficult to get the 8B model to avoid outputting it. Also adding multiple if-this-then-that clauses, like in your prompt, was often ignored in the 8B and smaller model compared to 90B.
I've tried the Gemma 2 model before for assistant tasks, and was very pleased with the 9B performance. It had good prompt adherence and performed well on various tasks. So looking forward to trying this.
I think the main part that's missing is how the classifications are not exposed to you in the UI. But I haven't used this feature much and my instance isn't running right now (SSD failed, still trying to get it replaced) so not able to check.
thorncorona 6 days ago [-]
None of this requires a multimodal LLM.
You can do this with traditional CNNs. For place, use Geospy or another geospatial model.
visarga 6 days ago [-]
use the json mode in ollama
pilooch 6 days ago [-]
Paligemma proves easy to train and useful in fine-tuning. It's main drawback was not being able to handle multiple images without being partly retrained. This new version dies not seem to support multiple images as input at once.
Qwen2vl does. This is useful for vision rag typically.
sigmar 6 days ago [-]
It is probably hard to come up with good benchmarks for VLMs like this, but I feel like the "Non entailment sentences" benchmark seems ill-suited. The examples for sentences that were non entailment included[1]: "There is a pile of horse manure in front of the horse." Which is true if you mean "in front of the [photosubject] from the perspective of the camera," but I think they marked it as non entailment because the pile is not in front of the horse's face(?)
does anyone know if they can output bonding box coordinates? like "where is the ball" -> [50, 75, 150, 175].
so far cogvlm is the only one I've seen that works but it's a bit of a pain to run.
xnx 6 days ago [-]
Yes: "The initial four location tokens represent the coordinate of the bounding box, ranging from 0 to 1023. These coordinates are independent of the aspect ratio, as the image is assumed to be resized to 1024 x 1024."
I created a prompt to try to get an LLM to do high-level organization:
> Help categorize this photo for organization. Please output in JSON.
> First, add a field called "type" that is one of: Wildlife, Architecture, Landscape, People or other. Pick the one that most appropriately reflects the subject of the photo. Use Other if you don't feel confident in your answer.
> Next, if it is Wildlife, please add another field called "animal" that gives a general type of animal that is the focus of the photo. Use large, common types like Elephant, Bird, Lion, Fish, Antelope, etc. Do not add this field if your confidence is low.
> If the type of animal is Bird, add a field called "bird" that gives the common type of bird, if you can clearly determine it. For example: Eagle, Hummingbird, Vulture, Crow, etc.
> If it is an Architecture photo, and you can determine with good confidence what specific building (or city) it is a photo of, please add a field called "place" with that name. (Name only, please -- no description).
I've tried with llama-vision using Ollama and it worked reasonably well for the top-level categories. A little less-well for identifying specific birds or places. And it didn't always generate proper JSON (and sometimes added new fields to JSON.)
I also tried with Claude's API -- and it seemed to work perfectly (for a small sample size).
It will be interesting to try with PaliGemma and see what I get.
I have like 50k photos, so I don't want to pay $$$ for the Claude API to categorize them all. It will be cool someday (soon?) for an open-source DAM to have something like one of these models available to call locally.
I think a combination of CLIP and some face-recognition may solve your issues! It just takes a path to your directory, and can index all the images while preserving your folder hierarchy along with a high quality face-clustering. Each image indexing takes about 100ms on a cpu. Every combination can then be mixed and matched, from a single interface, It doesn't take much to try as dependencies are very minimal . There is a self contained app for windows too. I have been looking for feedback, so just plugging it here in case some one has a use case.
[0] https://github.com/eagledot/hachi
promising but i want to see more before i try it - could you invest a little in your readme to list out features, maybe do a little loom demo?
For demo, i don't have much resources to host it, would a video showcasing features help ?
Also for windows there is a portable app at https://github.com/eagledot/hachi/releases/download/v1.3/hac...
I agree it'll still be cool to be able to do it all locally.
[0] https://ente.io/ml
Worked pretty well, and if the model fit in G-RAM decently fast.
Main issue was prompt adherence. In my experience prompt adherence goes down significantly when you reduce the model size.
Llama 3.2 Vision seems to be tuned hard to provide a summary at the end, usually with some social commentary, and was difficult to get the 8B model to avoid outputting it. Also adding multiple if-this-then-that clauses, like in your prompt, was often ignored in the 8B and smaller model compared to 90B.
I've tried the Gemma 2 model before for assistant tasks, and was very pleased with the 9B performance. It had good prompt adherence and performed well on various tasks. So looking forward to trying this.
I think the main part that's missing is how the classifications are not exposed to you in the UI. But I haven't used this feature much and my instance isn't running right now (SSD failed, still trying to get it replaced) so not able to check.
You can do this with traditional CNNs. For place, use Geospy or another geospatial model.
[1] page 20 https://arxiv.org/pdf/2412.03555
so far cogvlm is the only one I've seen that works but it's a bit of a pain to run.
https://developers.googleblog.com/en/gemma-explained-paligem...
They also have a colab notebook with more examples linked in the article.
Unfortunately, without fine tunes you can't have it just detect everything and return all detected objects, afaict.
https://huggingface.co/spaces/big-vision/paligemma