Following up on yesterday's Datasette post, here are some example plots that I made using the tags my parser generates when converting the data. In each plot, I added a clause like "WHERE IsLivermore='1'" to the SQL to the Datasette command.
Passenger Airlines
The main planes I see are passenger airplanes. They're boring and get in the way of more interesting flights, so it's useful to tag as many of the big airlines as I can. Below is a plat for just the passenger airlines. The interesting part about this is that you can see how they follow regular corridors in the bay area. If you click on the picture you'll see that planes going west funnel into one of four lanes after they cross the Sierra mountains. It's like Missile Command.
Passenger Airlines
Little Guys
In contrast, if you pick out all the independent planes that are just tagged with a tail fin starting with N, you see the little guys wander all over the bay area. Sometimes the make loops or circle a lake.
N Tail Fins
Pass Through
The data parse I wrote also has some geospatial filters that tag any plane that crosses through a specific region. I have a few setup to tag flight around Livermore, including one that looks for low-flying flights over the city core. Usually these flights are connected to takeoffs and landings at the local airport, but sometimes they're just small planes that don't care. I have another filter that tags flights over Tahoe/Reno. I'm always impressed that my SDR sees that far out. Sometimes the planes aren't even traveling to the bay area,
Livermore and Tahoe
A year ago I went on a side quest to refactor how I'm collecting and analyzing airplane data at home. Like most of my home projects, I spent a lot of time learning something new, but lost momentum before I could close things out. As part of a new year's resolution to finish loose ends, I'm writing a quick summary about what I learned and putting a snapshot of the code in github so I can revisit it later.
Viewing Airplane Tracks in Datasette
Airplane Logger
I've been running a PiAware server at home for several years that collects airplane positions from an SDR receiver I have in the garage. When I originally set this up, I wrote a tool called d1090-companion in Go to help record positions, generate daily track files, and allow me to visualize the results through a webserver. The tool has been running for years now, happily generating daily gzipped TSVs that I almost never look at because the maps API key I setup expired. I've always had thoughts about rewriting the companion, but my Go knowledge is out of date and I don't want to refresh it. Ultimately, I'd like something that lets me visualize the data better, filtering out boring stuff and highlighting unique planes.
Trying Out Datasette
While thinking about alternatives I came across Simon Willison's Datasette, which is a tool for making a mix tape of data that you can inspect with a browser. Willison's original idea was that a data provider should pack up their data into a SQLite file, put it on a read-only website, and then run a common web frontend that lets analysts do SQL queries on the data to slice it the way they want and plot the results. There are a number of plugins available for visualizing the data, including a map plotter that looked like it would be great for my airplane data. I found some other useful plugins, like reading from Parquet and GeoJSON files.
I initially started with the cluster-map plugin which lets you show individual Lon/Lat points on a map (with built-in clustering during zoom-out). The examples were great, but I didn't see an easy way to render tracks. I switched to using the GeoJSON plugins, which take a GeoJSON file and convert it to a SQLite DB that a map tool can plot. It took some tinkering, but I eventually confirmed that if I could export my tracks to GeoJSON, the tools would plot them ok. I wrote some Pandas code to parse my TSVs, clean them up, and convert the tracks to GeoJSON.
Good and Bad
There were a lot of things I liked about this approach. First, it meant I could just generate data files and hand them off to something else to render. Second, the SQL interface for Datasette is pretty powerful- once the page is up, I can put in queries to filter the tracks down (eg all ICAO's that start with AE). This feature motivated me to add columns at generation time (eg, IsMilitary) to make downstream search easier. Finally, there were just enough plugins I could do useful things without too much frontend knowledge.
I stalled out on the project though because there were a few things that weren't great for my use case. First, the Datasette version I used ran a rest API on the server to interface with the data. My PiAware is already loaded down with work, so I'm hesitant to put another API on it that isn't heavily used. Datasette-Lite looked like a promising alternative- it looks like it embeds the file processing in webassembly, so the client fetches the data and runs it on the client. Second, while the GeoJSON plotters were great, the plots were a little clunky and I had trouble figuring out the right way to render all the views I wanted. Datasette can probably do what I want, it just takes more effort than I want to put into it right now. Finally, there's still some data balance problems with the whole system. I probably need to downsample the tracks to get things to be more responsive.
Code
I've put a snapshot of the code at snippets/250221_planes_datasette so I can come revisit this someday. Looking through it again, I realize that another reason I stalled out was because I got obsessive about trying to identify all the big airline codes so I could filter them out. The private charter planes seem to be getting sneakier for how they label themselves so they blend in with everyone else. There's plenty of room for hanging yarn with Datasette though, so at some point I'll need to come back to this.
The idea that the CS field is about to be flooded with know-nothing vibe coders worries me. Last year when I tried out Codeium in VS Code, I thought it was neat but nothing to worry about since I had to do a lot of corrections to the code to get it to work properly. There have been a lot of stories in the news lately about Cursor and Claude Code though that made me wonder if things have gotten substantially better since then. The thing that pushed me over the edge this week though was that I started thinking about how long it's taking me to refactor some data products using Pandas. It's fun tinkering with Pandas, but almost always I'm writing some throwaway code that nobody will ever see. It takes a long time to do simple things and I still catch myself making dumb mistakes. It all seems like the kind of stuff an AI tool could do better.
On Friday I caved and downloaded a copy of Cursor for Linux on my Chromebook. Youtube is flooded with smiling videos on how to install and use Cursor. My summary is, they forked VSCode and modded it so there's an integrated prompt session on the side. You tell it what you want it to do next with your project, the prompt and code go to their servers, they make their own prompts, and then it gets sent out to high-end AI engines like Claude and Gemini Pro. You pay them $20/month to use the service with throttling, or $200/month (sheesh) to get more prompts. The interface is very good- it does all the project creation stuff you want VSCode to do (eg create a venv, install packages, and add them to requirements), plus it checkpoints your state so you can undo things when things go wrong.
Bouncy Ball Game
I didn't have a great idea about what to build when trying out Cursor, so I fell back on the 1980s arcade game approach. I started by asking it to draw some bouncing vector point balls similar to the ones you see in 1990s Amiga demos. It created a project, installed pygame, and wrote some basic code that ran without any nudging. The balls vector dots, but they were 2D so I asked it to switch to 3D balls that rotated and spun. This actually worked, though it drew lines between points that I then had it remove with another prompt. I went on to add some pin ball style bumpers, which took some more prompts to refine. From there I added a line at the bottom with one hole, then two, to make it more of a game. With additional prompts I had it add a score count, use more and smaller balls, and flash a game over menu when the last ball went into a hole.
Admittedly, this should be an easy problem for AI because there are a ton of examples out there and pygame is well equipped to do a lot of the heavy lifting for you. I did get one Python error in the process, but copying the message into the chat motivated the AI to look at the code and realize there was a use before initialize error that it needed to fix. The code and comments seem pretty reasonable. I think the only annoyance was that there were some subtle collision problems in the game that didn't seem to be easy to fix (either through chat or by debugging the code). I hear that quirks like this often pop up with vibe code and that it's hard for anyone to go back and figure out how to fix them. My worry here is that the future will be filled with apps that look like they're working, but they'll have random mistakes in them that cause significant problems when you do nonstandard things.
Platformer
I showed Cursor to my older son and asked him to write his own game so he knows what other people his age will be using. He chose to write a simple Mario-style platformer. The initial prompt yielded an Atari 2600 style game where a blue rectangle could wander around and jump from one green bar to another. It worked but he quickly realized that some platforms were unreachable while others didn't stop you from falling through to the ground. He issued prompts to fix the problems, then went about adding coins, a scoring system, and a level clearing message that included fireworks. He then asked it to switch the blob character to a Mario. Interestingly, the AI created a separate program that used pygame's shape primitives to draw a Mario character and save it to a png the game program could load and use. When he asked it to animate the character when it walked, it modified the image generator to make multiple images. At first it didn't look like the new images did anything different, but when we zoomed in on the pictures we realized it had put a single blue dot for a foot and toggled where it was when the character walked.
My son play tested the game and found that it mostly worked. There were a few frustrating bugs where if you hit a specific platform in a particular way you'd go through it. It reminded me of how the original Joust had a big with the pterodactyl that people loved. Maybe the quirks of a game are what make it special. Still, it's tough figuring out where all the bugs might be in a program like this, let alone fix them all.
PIAware Flight Receiver
My son was willing to put another 5 minutes into prompting things, so I had him ask Cursor to build some Python code to connect to a socket on my PiAware station. We gave it a specific IP/socket and told it that the port dumps out plane update information in a standard CSV format. When we told it to print out info every time it saw a new plane, it spewed out a lot of data that didn't look right. I realized it was getting misaligned because it wasn't handling the end of line properly. Once we explained the problem it fixed the code and properly printed out new plane IDs every few seconds.
While ADSB traffic is pretty well documented, I was impressed that the AI wrote code to open a socket, parse CSV lines, and print out the useful information. Writing this kind of code by hand is fun, but there are enough tricky spots that by the time I finish it I don't want to do anything else with it.
Thoughts
So, yes, just like other people are saying, Cursor is appealing because you can steer it to write code that would otherwise be tedious to write. It feels like cheating though, and requires you to keep a close eye on what it is actually building. The more people use Cursor, Claude, and the like, the better I think these tools will get. I do wonder how much developers will be willing to pay for these services. I think a lot of new developers will get hooked on the AI tools and then flounder when either the AI companies jack their rates. I'm still going through the 2 week trial process, but I'm thinking I wouldn't mind paying $20/month to hack on personal projects faster, especially if it means canceling one of the streaming services I have but never use.
The other week while working in the back yard I noticed there were a surprising number of crows watching me from the neighboring yard. There were enough of them that I thought I should count them, but didn't because I knew they'd fly away before I'd finished. I took a picture instead and vowed that I'd ask the vision LLMs to tell me how many crows they could see. This task is interesting to me because it involves both object recognition and simple counting. I suspected that the models would easily recognize there were birds in the picture, given that the previous post found visual LLMs could identify people, places, and even inflatable mascots. However, I doubted that the models would be able to give me an accurate count, because they often have problems with math-related tasks.
How many crows do you see? 25?
I've gone over the picture and put red circles on all the crows I see (no, there aren't two by the pole. Zoom in on the bigger image to see its old wiring). My crow count is 25.
Traditional YOLO Struggled
Counting items in a picture is something that Computer Vision (CV) has done well for many years, so I started out by just running YOLO (You Only Look Once) as a baseline. I didn't want to spend much time working this out so I asked Gemini 2.5 Flash to generate some python code that uses a YOLO library to count the number of crows in an image. Interestingly, Gemini was smart enough to know that the stock YOLOv8 model's training data (COCO) didn't include a specific "crow" class, so it changed the search term to "bird". The initial code it generated ran fine, but didn't find any birds in my picture. I bumped up the model from nano to medium to extra large (x). The largest model only found one bird with the default settings, so I lowered the confidence threshold from 0.5 to 0.1. The change enabled YOLO to spot 17 out of 25 birds, though some of the lower-confidence boxes looked questionable.
YOLO Birds at 0.10 Confidence
Previously I've had success with YOLO-World when using uncommon objects because it is much more flexible with search terms. However, when I asked Gemini to change the code over to YOLO-World, the extra large model with 0.10 confidence only identified 10 birds. Looking at the code, the only significant difference was that the search terms had been expanded to include both bird and crow. Switching to crow alone only had 4 hits. YOLO-world took longer to run than plain YOLO (eg 8s instead of 4s) so it's disappointing that it provided worse results. To be fair though, a lot of the birds just look like black bundles of something floating around the sky.
Gemini 2.5 Answer
Before going into how well local vision LLMs could perform, I went to Gemini and asked 2.5 Flash how many crows were in the picture. It initially answered "20-25" birds, but when I asked for a specific number it increased the count to 30-35 crows. When I then asked it to count individual crows it said 34 individual crows. It sounded reasonable, so I asked it to put a bounding box around each one. As the below image shows, it added the boxes, but at the same time it also inserted a bunch of new crows into the image. This answer is pretty awful.
Crow Count, Now with More Crows
Switching to Gemini 2.5 Pro changed the number to 23 crows. However, asking it to place boxes around the crows resulted in crows being added to the picture. Additionally, the boxes contained multiple crows in each box.
Local Models
Next, I used Ollama to do some local testing of Qwen2.5-VL, Gemma3, and Llama3.2-vision on my test image. I tried a few different prompts to see if I could trick the models into giving better answers:
A. How many crows are in this picture?
B. Very carefully count the number of crows in this picture.
C. Very carefully count how many birds are in this picture.
A. B. C.
qwen2.5vl:7b 14 20 20
gemma3:12b-it-qat 14 19 17
llama3.2-vision:11b 3 2 0
Qwen and Gemma did ok in these tests, especially when I asked it to put a little more effort into the problem. Theorizing that the libraries might be downsizing the images and making the small crows even harder to recognized, I cropped the original test image and tried the prompts again (note: the cropped image is at the top of the page. The original image was zoomed out like the boxed image above). As seen below, the cropped image resulted in higher counts for both Qwen and Gemma.
A. How many crows are in this picture?
B. Very carefully count the number of crows in this picture.
C. Very carefully count how many birds are in this picture.
A. B. C.
qwen2.5vl:7b 20 25 25
qwen2.5vl:32b 24 30 24
gemma3:12b-it-qat 25 32 30
llama3.2-vision:11b 0 0 2
The frustrating part about working with counting questions is that you usually don't get any explanation into how the model came up with that number. I did notice that when I switched to the larger, 32B Qwen model, the LLM would sometimes give me some positional information (e.g., "1 crow near the top left corner", "2 crows near the center top", etc). However, the number of crows in the list was never the same as the number that it reported back as its final answer (e.g., 21 vs. 30). When I compared the answers with what I see in the picture, I think the locations are definitely swizzled (e.g., it should be 2 in the top left, none in the center).
Thoughts
So, not great but not terrible. I'm impressed that the home versions of Qwen and Gemma were able to come up with answers that were in the right ballpark. It is troubling though that the models generated believable-looking evidence for their answer that was just wrong. I'm more bothered though by the visual results where crows were added to better match the answer.
One saving feature for Qwen that I'll have to write about later though is that it is designed to be able to produce bounding boxes (or grounding) for detected objects. I've verified that these boxes are usually correct, or at least include multiple instances of the desired item. I feel like this should be an essential capability of any Visual LLM. We don't need any hallucinations, uh, around here.
A few months ago I got more serious doing AI things at home and switched from using online services (Gemini and OpenAI) to locally-hosted models. While the change was partially motivated by a desire to ween myself off the big AI companies, I've also been wanting to do something with the unused GPU that my son left behind when he went off to college. I initially went down the HuggingFace path of manually downloading models and using the Transformers library to code up everything. More recently I've switched to using Ollama, which automates the whole process and gives a nice OpenAI API that I can talk to with code or hook up to OpenWebUI. My son's 12GB RTX3080 barely qualifies as an entry-level GPU for AI work, but it's good enough that 3-5B parameter models are fairly responsive.
Lately I've been looking into how well vision models like Qwen2.5-VL and Gemma3 can summarize images. My motivation for this is that eventually I'd like to scan my home photos and have a vision LLM write a one-line text summary of each picture so I could then search for things via grep. Since I won't be getting a larger GPU anytime soon, I've been experimenting with smaller models like qwen2.5vl:7b, gemma3:12b-it-qat, and llama3.2-vision:11b. While Qwen is the only one that really works well with 12GB of VRAM, the other are small enough that my son's 8-core CPU can process prompts in 10-30 seconds. In this post I wanted to capture some of my observations on quality when using these smaller models.
Simple Prompting
After some trial and error, I settled on the below prompt. There's a battle story in every clause of that prompt. I initially started with something similar to "describe this picture", but quickly found that all the models generated lengthy paragraphs about the image, usually with irrelevant details, extra formatting, and lead-in sentences (e.g., "Here is a description of the image:"). The "no formatting" helped remove all the markdown formatting and took away some of the writing primitives for organizing content into bullet lists. Next, adding "single sentence" usually helped constrain the models to limit their responses. Finally, adding concisely helped some of the models choose their details more carefully. While I noticed that concisely sometimes caused a model to omit key details it had reported in previous responses, I appreciate having the answer boiled down to as short a form as possible.
Write a single sentence with no formatting that concisely
describes this picture.
Front Page Images
For testing, I used a few of the images that I've previously posted on the front page of this website. I wanted to pick images that the models had not seen before, as I've seen cases where a bad vision model can extract just enough information that a good LLM can build out a long and convincing text response. Below is a collage of the pictures I used. For each image I ran the above prompt multiple times to get a general idea of how well the models were working, and then extracted the responses for all the models in the last run.
Rex the Inflatable Cowboy
The first image I tried out was Rex, an inflatable cowboy that they use for the Livermore football team. It's a good test image because it's a common thing you'd see at sporting events, but it's not quite inflated in a way that makes it clear what's happening. Below are the answers the models gave me. Qwen and Gemma both did well, though Gemma thought this was a tennis court. In runs before I added "concise", Qwen properly identified the cowboy was named Rex. I'm disappointed in Llama's answer. It didn't mention cowboy or inflatable, and thought Rex was holding a fishing rod.
qwen2.5vl:7b
A large inflatable character resembling a cowboy is lying on the
ground at a sports field during dusk.
gemma3:12b-it-qat
A large, deflated inflatable cowboy mascot labeled "Rex" lies on
a tennis court during sunset, surrounded by people.
llama3.2-vision:11b
The image shows a cartoon-style illustration of a man with a large
head and small body, wearing a brown hat and holding a fishing rod.
Fuzzy Bee
I expected the bee on a flower picture to be easy for the models to figure out because it's basic nature. While Qwen and Gemma did fine, I was disappointed that Llama did not identify the bee.
qwen2.5vl:7b
A bee hovers near delicate pink flowers on a branch.
gemma3:12b-it-qat
A honeybee is perched amongst a cluster of delicate pink flowers.
llama3.2-vision:11b
The image depicts a close-up of a pink flower with yellow stamens,
showcasing its intricate details and vibrant colors.
Project Gate
I like the Los Alamos Project Main Gate picture as a test because it has some easy-to-read text that can provide historical context. All three models extracted the text off the sign, though Llama only gave the first three words in all caps. Interestingly, Qwen and Gemma identified this as a black and white photo, while Llama identified specific features as being black or white.
qwen2.5vl:7b
A black and white photo shows a small building with a sign
reading "Los Alamos Project Main Gate" and a car parked nearby
under a vast sky.
gemma3:12b-it-qat
This black and white photograph shows a guardhouse and sign
reading "Los Alamos Project Main Gate" with a car parked nearby
against a backdrop of mountains and desert landscape.
llama3.2-vision:11b
The image features a white sign with black text, which reads
"LOS ALAMOS PROJECT" and is situated in front of a white building
with a large window.
Curious as to whether the models could detect anachronisms, I asked some follow up questions involving dates. The models consistently reported that the photo was likely from the 1950s or 1960s, citing the history of Los Alamos, the style of the buildings, and the use of black and white photography. Qwen noted "the design of the car, with its rounded edges and distinct shape, aligns with vehicles from that time period." so I asked the below question about the car. I think it's interesting that when it focused on it more, it identified it as a minivan that is out of place.
What year was the car in this picture made?
The image you've provided is a black and white photograph of a
building with a sign that reads "Los Alamos Project Main Gate."
This is a historical image, and the car in the picture appears
to be a minivan, which was not a common vehicle in the early
years of the Los Alamos Project. The Los Alamos National
Laboratory was established in 1943, and the earliest minivans
were introduced in the 1980s. Therefore, the car in the picture
is likely from the 1980s or later, as it does not match the
design of vehicles from the 1940s or 1950s, which were the
time periods when the Los Alamos Project was active.
Jellyfish
The jellyfish picture seemed like a good test because the velella velella isn't all that common, but the beach should provide enough context for a model to estimate that it's probably some kind of jellyfish. Gemma did great. Qwen misidentified the jellyfish as glass. Llama didn't even realize this was the beach.
qwen2.5vl:7b
Two pieces of glass lie on a sandy beach, reflecting light and
surrounded by small grains of sand.
gemma3:12b-it-qat
The picture shows several translucent blue jellyfish washed up on
a sandy beach.
llama3.2-vision:11b
This image depicts a close-up of a small, dark gray stone with a
rough texture and a hole in the center, featuring a lighter
gray interior.
Thoughts
Nontechnical people have seen or used enough AI tools that most people would shrug their shoulders and say "so what?" about these open models. The thing that is significant to me is that you can download them for free and use low-end GPU to perform image understanding tasks that would have been difficult to get done just a year ago. The commercial sites are much better and faster, but I don't want to send them my images or help the companies build up their empires more than I already have.
I was surprised at how much of a difference I saw in these small LLMs. Last Fall I started using Llama 3.2 because it was the next big step after CLIP. Llama's been my go-to model for image work because the text end has always felt solid and you can scale up to a massive number of parameters when you have the hardware. In the above tests though, I see that Llama was noticeably worse than Qwen and Gemma. It also took wandering walks that were off topic, which is a pain when you're trying to feed results into a large database.
I should emphasize that there's unfairness in these tests because the Qwen I used is only 7B parameters, while Gemma was 12B and Llama is 11B. The giant models for Llama and Gemma likely give better results, but again, I want to use the smallest model that I can. Looking forward, it'll be interesting to see how well Qwen3-VL does when it's released. When I plug the project gate picture into Qwen's online chat, the 32B parameter models automatically spot the minivan and estimate the picture was from the 1990's (update: this also worked in the 32B 2.5VL). Hopefully Qwen3 vision will have a smaller model that can do the same.