Craigslist is an interesting source of text data. In addition to providing a continuous stream of user postings from around the country in organized categories, the website stubbornly favors a plain-old-web format that's easy to retrieve and parse. I believe craigslist gets a lot of traffic from different kinds of scrapers. In addition to all the search engine crawlers, you hear stories about how individuals run scripts to continuously watch their local boards so they can be the first to snatch up free items. Craigslist blocks people that aggressively crawl the site, but otherwise let you wander around if you put in some rational delays.
Back in September I wrote some utilities to go off and scrape job postings from craigslist, because I thought it would be interesting to see what kind of people Bay Area companies wanted. After working out how to grab the data in an unobtrusive way, I updated my script to grab tech job postings from different cities around the country. I run the script about once a week, which over the last 9 months has given me about 32k postings, totaling 470MB in text data. This post just focuses on the scraping. I'll get to the analysis later.
Craigslist puts each post as a separate web page, and uses a city/topic/post directory structure to keep things organized. While the post part of the url is unique and non-sequential, they provide an easy-to-parse index page for each topic that will give you all the urls for the posts in reverse chronological order. All one has to do is pick a city and a topic, walk through the index, and retrieve the individual posts. I put some delay in after every page I fetched to be polite. I also randomized the city list on each run to even out the data if the grabs were taking too long and needed to be cut off (though always getting ATL would have been fine for me). To help with statistics, I had the script store basic information about runs in a local sqlite database. The database helps avoid downloading the same post twice, and gives me a place to store the dates of when I first and last saw a particular post.
Grabs Per Day
Below is a breakdown of how many posts I grabbed for each city when I ran the scraper. Since the script only grabs posts that it hasn't seen before, the per day grabs go up and down based on how frequently I ran the script (eg, when I missed a week or two, there was more data available to grab). For this time period, the cities seem to be fairly proportional. The big job cities seem to be San Francisco, Seattle, New York, and Boston (not unexpected). C'mon Atlanta. It's like you're not even trying.
Number of Active Days per Post
Another interesting statistic for me was how long job postings remain active on craigslist. I used the "first seen" and "last seen" dates stored in my meta data to estimate the amount of time I post stays alive. The numbers are off due to the initial posts I pulled (ie, I looked at the grab date instead of the post date) and the most recent posts (ie, which have not expired yet). As the below (logscale!) plot shows, most posts stick around for about a month. However, there are a few the last as long as 80 days.
It isn't much but I put the code for this on github: