The idea that the CS field is about to be flooded with know-nothing vibe coders worries me. Last year when I tried out Codeium in VS Code, I thought it was neat but nothing to worry about since I had to do a lot of corrections to the code to get it to work properly. There have been a lot of stories in the news lately about Cursor and Claude Code though that made me wonder if things have gotten substantially better since then. The thing that pushed me over the edge this week though was that I started thinking about how long it's taking me to refactor some data products using Pandas. It's fun tinkering with Pandas, but almost always I'm writing some throwaway code that nobody will ever see. It takes a long time to do simple things and I still catch myself making dumb mistakes. It all seems like the kind of stuff an AI tool could do better.
On Friday I caved and downloaded a copy of Cursor for Linux on my Chromebook. Youtube is flooded with smiling videos on how to install and use Cursor. My summary is, they forked VSCode and modded it so there's an integrated prompt session on the side. You tell it what you want it to do next with your project, the prompt and code go to their servers, they make their own prompts, and then it gets sent out to high-end AI engines like Claude and Gemini Pro. You pay them $20/month to use the service with throttling, or $200/month (sheesh) to get more prompts. The interface is very good- it does all the project creation stuff you want VSCode to do (eg create a venv, install packages, and add them to requirements), plus it checkpoints your state so you can undo things when things go wrong.
Bouncy Ball Game
I didn't have a great idea about what to build when trying out Cursor, so I fell back on the 1980s arcade game approach. I started by asking it to draw some bouncing vector point balls similar to the ones you see in 1990s Amiga demos. It created a project, installed pygame, and wrote some basic code that ran without any nudging. The balls vector dots, but they were 2D so I asked it to switch to 3D balls that rotated and spun. This actually worked, though it drew lines between points that I then had it remove with another prompt. I went on to add some pin ball style bumpers, which took some more prompts to refine. From there I added a line at the bottom with one hole, then two, to make it more of a game. With additional prompts I had it add a score count, use more and smaller balls, and flash a game over menu when the last ball went into a hole.
Admittedly, this should be an easy problem for AI because there are a ton of examples out there and pygame is well equipped to do a lot of the heavy lifting for you. I did get one Python error in the process, but copying the message into the chat motivated the AI to look at the code and realize there was a use before initialize error that it needed to fix. The code and comments seem pretty reasonable. I think the only annoyance was that there were some subtle collision problems in the game that didn't seem to be easy to fix (either through chat or by debugging the code). I hear that quirks like this often pop up with vibe code and that it's hard for anyone to go back and figure out how to fix them. My worry here is that the future will be filled with apps that look like they're working, but they'll have random mistakes in them that cause significant problems when you do nonstandard things.
Platformer
I showed Cursor to my older son and asked him to write his own game so he knows what other people his age will be using. He chose to write a simple Mario-style platformer. The initial prompt yielded an Atari 2600 style game where a blue rectangle could wander around and jump from one green bar to another. It worked but he quickly realized that some platforms were unreachable while others didn't stop you from falling through to the ground. He issued prompts to fix the problems, then went about adding coins, a scoring system, and a level clearing message that included fireworks. He then asked it to switch the blob character to a Mario. Interestingly, the AI created a separate program that used pygame's shape primitives to draw a Mario character and save it to a png the game program could load and use. When he asked it to animate the character when it walked, it modified the image generator to make multiple images. At first it didn't look like the new images did anything different, but when we zoomed in on the pictures we realized it had put a single blue dot for a foot and toggled where it was when the character walked.
My son play tested the game and found that it mostly worked. There were a few frustrating bugs where if you hit a specific platform in a particular way you'd go through it. It reminded me of how the original Joust had a big with the pterodactyl that people loved. Maybe the quirks of a game are what make it special. Still, it's tough figuring out where all the bugs might be in a program like this, let alone fix them all.
PIAware Flight Receiver
My son was willing to put another 5 minutes into prompting things, so I had him ask Cursor to build some Python code to connect to a socket on my PiAware station. We gave it a specific IP/socket and told it that the port dumps out plane update information in a standard CSV format. When we told it to print out info every time it saw a new plane, it spewed out a lot of data that didn't look right. I realized it was getting misaligned because it wasn't handling the end of line properly. Once we explained the problem it fixed the code and properly printed out new plane IDs every few seconds.
While ADSB traffic is pretty well documented, I was impressed that the AI wrote code to open a socket, parse CSV lines, and print out the useful information. Writing this kind of code by hand is fun, but there are enough tricky spots that by the time I finish it I don't want to do anything else with it.
Thoughts
So, yes, just like other people are saying, Cursor is appealing because you can steer it to write code that would otherwise be tedious to write. It feels like cheating though, and requires you to keep a close eye on what it is actually building. The more people use Cursor, Claude, and the like, the better I think these tools will get. I do wonder how much developers will be willing to pay for these services. I think a lot of new developers will get hooked on the AI tools and then flounder when either the AI companies jack their rates. I'm still going through the 2 week trial process, but I'm thinking I wouldn't mind paying $20/month to hack on personal projects faster, especially if it means canceling one of the streaming services I have but never use.
The other week while working in the back yard I noticed there were a surprising number of crows watching me from the neighboring yard. There were enough of them that I thought I should count them, but didn't because I knew they'd fly away before I'd finished. I took a picture instead and vowed that I'd ask the vision LLMs to tell me how many crows they could see. This task is interesting to me because it involves both object recognition and simple counting. I suspected that the models would easily recognize there were birds in the picture, given that the previous post found visual LLMs could identify people, places, and even inflatable mascots. However, I doubted that the models would be able to give me an accurate count, because they often have problems with math-related tasks.
How many crows do you see? 25?
I've gone over the picture and put red circles on all the crows I see (no, there aren't two by the pole. Zoom in on the bigger image to see its old wiring). My crow count is 25.
Traditional YOLO Struggled
Counting items in a picture is something that Computer Vision (CV) has done well for many years, so I started out by just running YOLO (You Only Look Once) as a baseline. I didn't want to spend much time working this out so I asked Gemini 2.5 Flash to generate some python code that uses a YOLO library to count the number of crows in an image. Interestingly, Gemini was smart enough to know that the stock YOLOv8 model's training data (COCO) didn't include a specific "crow" class, so it changed the search term to "bird". The initial code it generated ran fine, but didn't find any birds in my picture. I bumped up the model from nano to medium to extra large (x). The largest model only found one bird with the default settings, so I lowered the confidence threshold from 0.5 to 0.1. The change enabled YOLO to spot 17 out of 25 birds, though some of the lower-confidence boxes looked questionable.
YOLO Birds at 0.10 Confidence
Previously I've had success with YOLO-World when using uncommon objects because it is much more flexible with search terms. However, when I asked Gemini to change the code over to YOLO-World, the extra large model with 0.10 confidence only identified 10 birds. Looking at the code, the only significant difference was that the search terms had been expanded to include both bird and crow. Switching to crow alone only had 4 hits. YOLO-world took longer to run than plain YOLO (eg 8s instead of 4s) so it's disappointing that it provided worse results. To be fair though, a lot of the birds just look like black bundles of something floating around the sky.
Gemini 2.5 Answer
Before going into how well local vision LLMs could perform, I went to Gemini and asked 2.5 Flash how many crows were in the picture. It initially answered "20-25" birds, but when I asked for a specific number it increased the count to 30-35 crows. When I then asked it to count individual crows it said 34 individual crows. It sounded reasonable, so I asked it to put a bounding box around each one. As the below image shows, it added the boxes, but at the same time it also inserted a bunch of new crows into the image. This answer is pretty awful.
Crow Count, Now with More Crows
Switching to Gemini 2.5 Pro changed the number to 23 crows. However, asking it to place boxes around the crows resulted in crows being added to the picture. Additionally, the boxes contained multiple crows in each box.
Local Models
Next, I used Ollama to do some local testing of Qwen2.5-VL, Gemma3, and Llama3.2-vision on my test image. I tried a few different prompts to see if I could trick the models into giving better answers:
A. How many crows are in this picture?
B. Very carefully count the number of crows in this picture.
C. Very carefully count how many birds are in this picture.
A. B. C.
qwen2.5vl:7b 14 20 20
gemma3:12b-it-qat 14 19 17
llama3.2-vision:11b 3 2 0
Qwen and Gemma did ok in these tests, especially when I asked it to put a little more effort into the problem. Theorizing that the libraries might be downsizing the images and making the small crows even harder to recognized, I cropped the original test image and tried the prompts again (note: the cropped image is at the top of the page. The original image was zoomed out like the boxed image above). As seen below, the cropped image resulted in higher counts for both Qwen and Gemma.
A. How many crows are in this picture?
B. Very carefully count the number of crows in this picture.
C. Very carefully count how many birds are in this picture.
A. B. C.
qwen2.5vl:7b 20 25 25
qwen2.5vl:32b 24 30 24
gemma3:12b-it-qat 25 32 30
llama3.2-vision:11b 0 0 2
The frustrating part about working with counting questions is that you usually don't get any explanation into how the model came up with that number. I did notice that when I switched to the larger, 32B Qwen model, the LLM would sometimes give me some positional information (e.g., "1 crow near the top left corner", "2 crows near the center top", etc). However, the number of crows in the list was never the same as the number that it reported back as its final answer (e.g., 21 vs. 30). When I compared the answers with what I see in the picture, I think the locations are definitely swizzled (e.g., it should be 2 in the top left, none in the center).
Thoughts
So, not great but not terrible. I'm impressed that the home versions of Qwen and Gemma were able to come up with answers that were in the right ballpark. It is troubling though that the models generated believable-looking evidence for their answer that was just wrong. I'm more bothered though by the visual results where crows were added to better match the answer.
One saving feature for Qwen that I'll have to write about later though is that it is designed to be able to produce bounding boxes (or grounding) for detected objects. I've verified that these boxes are usually correct, or at least include multiple instances of the desired item. I feel like this should be an essential capability of any Visual LLM. We don't need any hallucinations, uh, around here.
A few months ago I got more serious doing AI things at home and switched from using online services (Gemini and OpenAI) to locally-hosted models. While the change was partially motivated by a desire to ween myself off the big AI companies, I've also been wanting to do something with the unused GPU that my son left behind when he went off to college. I initially went down the HuggingFace path of manually downloading models and using the Transformers library to code up everything. More recently I've switched to using Ollama, which automates the whole process and gives a nice OpenAI API that I can talk to with code or hook up to OpenWebUI. My son's 12GB RTX3080 barely qualifies as an entry-level GPU for AI work, but it's good enough that 3-5B parameter models are fairly responsive.
Lately I've been looking into how well vision models like Qwen2.5-VL and Gemma3 can summarize images. My motivation for this is that eventually I'd like to scan my home photos and have a vision LLM write a one-line text summary of each picture so I could then search for things via grep. Since I won't be getting a larger GPU anytime soon, I've been experimenting with smaller models like qwen2.5vl:7b, gemma3:12b-it-qat, and llama3.2-vision:11b. While Qwen is the only one that really works well with 12GB of VRAM, the other are small enough that my son's 8-core CPU can process prompts in 10-30 seconds. In this post I wanted to capture some of my observations on quality when using these smaller models.
Simple Prompting
After some trial and error, I settled on the below prompt. There's a battle story in every clause of that prompt. I initially started with something similar to "describe this picture", but quickly found that all the models generated lengthy paragraphs about the image, usually with irrelevant details, extra formatting, and lead-in sentences (e.g., "Here is a description of the image:"). The "no formatting" helped remove all the markdown formatting and took away some of the writing primitives for organizing content into bullet lists. Next, adding "single sentence" usually helped constrain the models to limit their responses. Finally, adding concisely helped some of the models choose their details more carefully. While I noticed that concisely sometimes caused a model to omit key details it had reported in previous responses, I appreciate having the answer boiled down to as short a form as possible.
Write a single sentence with no formatting that concisely
describes this picture.
Front Page Images
For testing, I used a few of the images that I've previously posted on the front page of this website. I wanted to pick images that the models had not seen before, as I've seen cases where a bad vision model can extract just enough information that a good LLM can build out a long and convincing text response. Below is a collage of the pictures I used. For each image I ran the above prompt multiple times to get a general idea of how well the models were working, and then extracted the responses for all the models in the last run.
Rex the Inflatable Cowboy
The first image I tried out was Rex, an inflatable cowboy that they use for the Livermore football team. It's a good test image because it's a common thing you'd see at sporting events, but it's not quite inflated in a way that makes it clear what's happening. Below are the answers the models gave me. Qwen and Gemma both did well, though Gemma thought this was a tennis court. In runs before I added "concise", Qwen properly identified the cowboy was named Rex. I'm disappointed in Llama's answer. It didn't mention cowboy or inflatable, and thought Rex was holding a fishing rod.
qwen2.5vl:7b
A large inflatable character resembling a cowboy is lying on the
ground at a sports field during dusk.
gemma3:12b-it-qat
A large, deflated inflatable cowboy mascot labeled "Rex" lies on
a tennis court during sunset, surrounded by people.
llama3.2-vision:11b
The image shows a cartoon-style illustration of a man with a large
head and small body, wearing a brown hat and holding a fishing rod.
Fuzzy Bee
I expected the bee on a flower picture to be easy for the models to figure out because it's basic nature. While Qwen and Gemma did fine, I was disappointed that Llama did not identify the bee.
qwen2.5vl:7b
A bee hovers near delicate pink flowers on a branch.
gemma3:12b-it-qat
A honeybee is perched amongst a cluster of delicate pink flowers.
llama3.2-vision:11b
The image depicts a close-up of a pink flower with yellow stamens,
showcasing its intricate details and vibrant colors.
Project Gate
I like the Los Alamos Project Main Gate picture as a test because it has some easy-to-read text that can provide historical context. All three models extracted the text off the sign, though Llama only gave the first three words in all caps. Interestingly, Qwen and Gemma identified this as a black and white photo, while Llama identified specific features as being black or white.
qwen2.5vl:7b
A black and white photo shows a small building with a sign
reading "Los Alamos Project Main Gate" and a car parked nearby
under a vast sky.
gemma3:12b-it-qat
This black and white photograph shows a guardhouse and sign
reading "Los Alamos Project Main Gate" with a car parked nearby
against a backdrop of mountains and desert landscape.
llama3.2-vision:11b
The image features a white sign with black text, which reads
"LOS ALAMOS PROJECT" and is situated in front of a white building
with a large window.
Curious as to whether the models could detect anachronisms, I asked some follow up questions involving dates. The models consistently reported that the photo was likely from the 1950s or 1960s, citing the history of Los Alamos, the style of the buildings, and the use of black and white photography. Qwen noted "the design of the car, with its rounded edges and distinct shape, aligns with vehicles from that time period." so I asked the below question about the car. I think it's interesting that when it focused on it more, it identified it as a minivan that is out of place.
What year was the car in this picture made?
The image you've provided is a black and white photograph of a
building with a sign that reads "Los Alamos Project Main Gate."
This is a historical image, and the car in the picture appears
to be a minivan, which was not a common vehicle in the early
years of the Los Alamos Project. The Los Alamos National
Laboratory was established in 1943, and the earliest minivans
were introduced in the 1980s. Therefore, the car in the picture
is likely from the 1980s or later, as it does not match the
design of vehicles from the 1940s or 1950s, which were the
time periods when the Los Alamos Project was active.
Jellyfish
The jellyfish picture seemed like a good test because the velella velella isn't all that common, but the beach should provide enough context for a model to estimate that it's probably some kind of jellyfish. Gemma did great. Qwen misidentified the jellyfish as glass. Llama didn't even realize this was the beach.
qwen2.5vl:7b
Two pieces of glass lie on a sandy beach, reflecting light and
surrounded by small grains of sand.
gemma3:12b-it-qat
The picture shows several translucent blue jellyfish washed up on
a sandy beach.
llama3.2-vision:11b
This image depicts a close-up of a small, dark gray stone with a
rough texture and a hole in the center, featuring a lighter
gray interior.
Thoughts
Nontechnical people have seen or used enough AI tools that most people would shrug their shoulders and say "so what?" about these open models. The thing that is significant to me is that you can download them for free and use low-end GPU to perform image understanding tasks that would have been difficult to get done just a year ago. The commercial sites are much better and faster, but I don't want to send them my images or help the companies build up their empires more than I already have.
I was surprised at how much of a difference I saw in these small LLMs. Last Fall I started using Llama 3.2 because it was the next big step after CLIP. Llama's been my go-to model for image work because the text end has always felt solid and you can scale up to a massive number of parameters when you have the hardware. In the above tests though, I see that Llama was noticeably worse than Qwen and Gemma. It also took wandering walks that were off topic, which is a pain when you're trying to feed results into a large database.
I should emphasize that there's unfairness in these tests because the Qwen I used is only 7B parameters, while Gemma was 12B and Llama is 11B. The giant models for Llama and Gemma likely give better results, but again, I want to use the smallest model that I can. Looking forward, it'll be interesting to see how well Qwen3-VL does when it's released. When I plug the project gate picture into Qwen's online chat, the 32B parameter models automatically spot the minivan and estimate the picture was from the 1990's (update: this also worked in the 32B 2.5VL). Hopefully Qwen3 vision will have a smaller model that can do the same.
I hit a weird problem with Google's Gemini 2.0 this morning- I asked it to transcribe some images of hand-written notes into a CSV file and had it swap some of the lines. As someone that does a lot of data engineering, I find this especially dangerous because the output was convincing enough I almost didn't catch the error in the actual numbers. This makes me wonder how much bad data is going to flow into important records in the near future.
Transcribing Power Bills
This morning my wife and I were talking about how our power bills have gone up a bit in the last few years. When I asked if we had this in a spreadsheet somewhere, she said no, but pointed me at 5 pages of handwritten notes she had made over the years that concisely summarized all the data I'd need to make some plots. It occurred to me that this was exactly the kind of thing that AI should be able to do so I pulled up Gemini, typed the below prompt, and proceeded to take pictures of the notes.
I would like to convert 5 pictures of notes about my pg&e
gas bills into a csv file with the following columns: date,
dollar amount, electric amount, gas amount, PG&E code, Human
notes, and conversion notes. The human notes should be any
comments that were written in the entry. The conversion
notes should be any problems you came across while translating
a particular entry.
I then sent a picture of each page, with a statement about how this was the nth page. The response from the first picture was very reassuring: it printed out a CSV file that looked like everything was in the right place. The next few pictures appended more data to the CSV. Gemini said something went wrong on the last two pictures, so I switched to my chromebook, started a new conversation, and repeated the process until I got to the end. It even gave me a link for downloading the CSV file so I didn't have to copy paste it. I moved over to Google sheets, imported the CSV, and started asking Gemini to remind me how to do a pivot in sheets so I could look at the years in a monthly timeline.
Transcription Problem
The Sheets spreadsheet plots had some issues because of missing cell entries, so I went back and started manually inserting values. This is fine, as I'm happy to use any tool that automates 90% of the brute force work. However, as I went through the table, I noticed two problems. First, the gas numbers had some bad values because of my wife's handwriting (eg "g .55" became 8.55). This Brazil-like mistake is still fine, as the errors would be caught with some range checking. The bigger problem was that Gemini swapped a few of the data values in the lines:
A few mistakes
This problem is significant because it's hard to detect. It's also very puzzling because the image has a lot of guides to help keep things aligned: the original text is structured, each line has blue lines to guide the reader, and there isn't anything obvious to me that would trip it up. In fact if I just send the one page, it transcribes the data just fine. That makes me wonder if it's something to do with appending knowledge from one prompt to the next.
Dealing with Djinnis
This kind of problem is what leads to the endless prompt engineering cycle that is what makes AI so difficult to use these days in tasks that require accuracy. Should I just send 5 tasks to transcribe the images and merge them myself? Should I tell it valid ranges for the fields? Should I take a video and scan all the pages in one shot?
Prompt engineering feels like you're dealing with a Djinni that's granted you three wishes. There's always a loophole that wrecks your commands, so you spend more and more time perfecting additional clauses to your prompt to try to convince it to do the thing you want. In the end I'm not sure you can ever be certain that it's not stabbing you in the back in some unexpected way.
If everything is working right, you should be seeing this website rendered with a new backend that I've had on the back burner for a few years now. It's still missing some features, but I decided to switch over to it so I can move on to other projects. It's written in C++ this time, but still uses Emacs Org mode files for the content.
Evolution
Over the last 20+ years or I've implemented my website in a few different ways. Back in the 1990's I wrote a few web pages in HTML to talk about the things I was doing in college. HTML was tedious so I wrote some Perl scripts to generate static web pages from plain text files. Writing in plain text really helped speed up the writing, but after a few dozen posts it became a chore to regenerate and upload the content whenever there was a change. The next step was to port the Perl script over to a cgi script that could generate the pages on demand. While people scoff at it now, Perl was a pretty decent language for generating html because it was straightforward to hack together a bunch of regexs to convert chunks of text into HTML templates. It was hard to re-read the code and make sense of it, but since it was an interpreted language you could just iterate on it until things looked right.
The next big step for me was to reformat all the text content into an Emacs Org-mode file. Org mode made it easy to put each post as its own section in a single-file outline. Merging everything into one file meant I could insert extra posts into the timeline without having to renumber everything. Plus, the Emacs spell checker has been great, when I remember to run it. I updated the Perl cgi script to read from the org file and dump it to HTML. Looking back through my posts, I see I made that change 10 years ago.
Needing to Change
The Perl version of this website was ok, but a few things bugged me. First, the HTML looked a little old and didn't do the CSS magic to make the site readable on smaller devices. It was pretty awful on a cell phone because it would shrink several a wide desktop display of content down to a narrow device. Also, Google said at some point that pages that were missing a proper viewport tag (like mine) would be pushed down in their search results. Second, I noticed that generating my webpages with Perl took longer than I wanted. It could have just been the cheap webhosting plan I'd been using, but I suspected that all those complicated regexs I'd written were adding up. Finally, I wanted to do more things with the website but I don't write anything in Perl anymore.
So why C++? Most of my coding these days is in C++ so I had a good idea of how to split the code up into useful classes and implement some of the harder things like string formatting. The main hardship of going with C++ is that you have to recompile the code every time you make a major change. It took a while before I found the right balance in the code. Initially I went all-in on jamming everything except the org-mode content into C++. I wasted a lot of time working out clever ways to have CMake generate custom header files with constants for the different sites I wanted to run (ie, devel at home and production on the site). I later came to my senses and switched to a system that has hooks to read in a template file at runtime, so I could change the look of the website without having to recompile everything.
Changes
I don't think the site looks too different, but there are some changes:
- Quicker: I haven't done any speed testing, but the site feels a lot peppier now.
- Mobile Friendly: Things should look ok on a cell phone now. The new CSS should just scroll vertically and be readable.
- Simple Header: I'm still doing a horizontal scroller, but I've shrunk the navigation controls to a command bar.
- Separation of Content: Someone suggested that I split the site content so that it was clear what I do for work and what I do in my personal time. The new software made it easy to run each one as its own blog, so I created a "Publications" section and a "Hobbies" section.
- Longer URLs: Sadly, I had to break the way I was linking to posts. One problem was that since there wasn't a date in the previous version of the URLs, I couldn't have the same title more than once. Changing URLs was a sin in the old web, but seriously, how many people linked here anyways?
There's still a lot missing (eg index and tags pages), but it's good enough that I've switched things over. If you can't find a page you're looking for, flip around some and use the date to find the new post.