Trying Out Image Summarization with Ollama

2025-05-18 Sun

ai cv

A few months ago I got more serious doing AI things at home and switched from using online services (Gemini and OpenAI) to locally-hosted models. While the change was partially motivated by a desire to ween myself off the big AI companies, I've also been wanting to do something with the unused GPU that my son left behind when he went off to college. I initially went down the HuggingFace path of manually downloading models and using the Transformers library to code up everything. More recently I've switched to using Ollama, which automates the whole process and gives a nice OpenAI API that I can talk to with code or hook up to OpenWebUI. My son's 12GB RTX3080 barely qualifies as an entry-level GPU for AI work, but it's good enough that 3-5B parameter models are fairly responsive.

Lately I've been looking into how well vision models like Qwen2.5-VL and Gemma3 can summarize images. My motivation for this is that eventually I'd like to scan my home photos and have a vision LLM write a one-line text summary of each picture so I could then search for things via grep. Since I won't be getting a larger GPU anytime soon, I've been experimenting with smaller models like qwen2.5vl:7b, gemma3:12b-it-qat, and llama3.2-vision:11b. While Qwen is the only one that really works well with 12GB of VRAM, the other are small enough that my son's 8-core CPU can process prompts in 10-30 seconds. In this post I wanted to capture some of my observations on quality when using these smaller models.

Simple Prompting

After some trial and error, I settled on the below prompt. There's a battle story in every clause of that prompt. I initially started with something similar to "describe this picture", but quickly found that all the models generated lengthy paragraphs about the image, usually with irrelevant details, extra formatting, and lead-in sentences (e.g., "Here is a description of the image:"). The "no formatting" helped remove all the markdown formatting and took away some of the writing primitives for organizing content into bullet lists. Next, adding "single sentence" usually helped constrain the models to limit their responses. Finally, adding concisely helped some of the models choose their details more carefully. While I noticed that concisely sometimes caused a model to omit key details it had reported in previous responses, I appreciate having the answer boiled down to as short a form as possible.

Write a single sentence with no formatting that concisely
describes this picture.

Front Page Images

For testing, I used a few of the images that I've previously posted on the front page of this website. I wanted to pick images that the models had not seen before, as I've seen cases where a bad vision model can extract just enough information that a good LLM can build out a long and convincing text response. Below is a collage of the pictures I used. For each image I ran the above prompt multiple times to get a general idea of how well the models were working, and then extracted the responses for all the models in the last run.

Rex the Inflatable Cowboy

The first image I tried out was Rex, an inflatable cowboy that they use for the Livermore football team. It's a good test image because it's a common thing you'd see at sporting events, but it's not quite inflated in a way that makes it clear what's happening. Below are the answers the models gave me. Qwen and Gemma both did well, though Gemma thought this was a tennis court. In runs before I added "concise", Qwen properly identified the cowboy was named Rex. I'm disappointed in Llama's answer. It didn't mention cowboy or inflatable, and thought Rex was holding a fishing rod.

qwen2.5vl:7b
  A large inflatable character resembling a cowboy is lying on the
  ground at a sports field during dusk.

gemma3:12b-it-qat
  A large, deflated inflatable cowboy mascot labeled "Rex" lies on
  a tennis court during sunset, surrounded by people.

llama3.2-vision:11b
  The image shows a cartoon-style illustration of a man with a large
  head and small body, wearing a brown hat and holding a fishing rod.

Fuzzy Bee

I expected the bee on a flower picture to be easy for the models to figure out because it's basic nature. While Qwen and Gemma did fine, I was disappointed that Llama did not identify the bee.

qwen2.5vl:7b
  A bee hovers near delicate pink flowers on a branch. 

gemma3:12b-it-qat
  A honeybee is perched amongst a cluster of delicate pink flowers.

llama3.2-vision:11b
  The image depicts a close-up of a pink flower with yellow stamens,
  showcasing its intricate details and vibrant colors.

Project Gate

I like the Los Alamos Project Main Gate picture as a test because it has some easy-to-read text that can provide historical context. All three models extracted the text off the sign, though Llama only gave the first three words in all caps. Interestingly, Qwen and Gemma identified this as a black and white photo, while Llama identified specific features as being black or white.

qwen2.5vl:7b
  A black and white photo shows a small building with a sign
  reading "Los Alamos Project Main Gate" and a car parked nearby
  under a vast sky.

gemma3:12b-it-qat
  This black and white photograph shows a guardhouse and sign
  reading "Los Alamos Project Main Gate" with a car parked nearby
  against a backdrop of mountains and desert landscape.

llama3.2-vision:11b
  The image features a white sign with black text, which reads
  "LOS ALAMOS PROJECT" and is situated in front of a white building
  with a large window.

Curious as to whether the models could detect anachronisms, I asked some follow up questions involving dates. The models consistently reported that the photo was likely from the 1950s or 1960s, citing the history of Los Alamos, the style of the buildings, and the use of black and white photography. Qwen noted "the design of the car, with its rounded edges and distinct shape, aligns with vehicles from that time period." so I asked the below question about the car. I think it's interesting that when it focused on it more, it identified it as a minivan that is out of place.

What year was the car in this picture made?

  The image you've provided is a black and white photograph of a
  building with a sign that reads "Los Alamos Project Main Gate."
  This is a historical image, and the car in the picture appears
  to be a minivan, which was not a common vehicle in the early
  years of the Los Alamos Project. The Los Alamos National
  Laboratory was established in 1943, and the earliest minivans
  were introduced in the 1980s. Therefore, the car in the picture
  is likely from the 1980s or later, as it does not match the
  design of vehicles from the 1940s or 1950s, which were the
  time periods when the Los Alamos Project was active.

Jellyfish

The jellyfish picture seemed like a good test because the velella velella isn't all that common, but the beach should provide enough context for a model to estimate that it's probably some kind of jellyfish. Gemma did great. Qwen misidentified the jellyfish as glass. Llama didn't even realize this was the beach.

qwen2.5vl:7b
  Two pieces of glass lie on a sandy beach, reflecting light and
  surrounded by small grains of sand.

gemma3:12b-it-qat
  The picture shows several translucent blue jellyfish washed up on
  a sandy beach.

llama3.2-vision:11b
  This image depicts a close-up of a small, dark gray stone with a
  rough texture and a hole in the center, featuring a lighter
  gray interior.

Thoughts

Nontechnical people have seen or used enough AI tools that most people would shrug their shoulders and say "so what?" about these open models. The thing that is significant to me is that you can download them for free and use low-end GPU to perform image understanding tasks that would have been difficult to get done just a year ago. The commercial sites are much better and faster, but I don't want to send them my images or help the companies build up their empires more than I already have.

I was surprised at how much of a difference I saw in these small LLMs. Last Fall I started using Llama 3.2 because it was the next big step after CLIP. Llama's been my go-to model for image work because the text end has always felt solid and you can scale up to a massive number of parameters when you have the hardware. In the above tests though, I see that Llama was noticeably worse than Qwen and Gemma. It also took wandering walks that were off topic, which is a pain when you're trying to feed results into a large database.

I should emphasize that there's unfairness in these tests because the Qwen I used is only 7B parameters, while Gemma was 12B and Llama is 11B. The giant models for Llama and Gemma likely give better results, but again, I want to use the smallest model that I can. Looking forward, it'll be interesting to see how well Qwen3-VL does when it's released. When I plug the project gate picture into Qwen's online chat, the 32B parameter models automatically spot the minivan and estimate the picture was from the 1990's (update: this also worked in the 32B 2.5VL). Hopefully Qwen3 vision will have a smaller model that can do the same.