This was one of the first projects I started for this book. But I put it off and didn't revisit it until after I had rewritten and revised the content and structure of the book.
I haven't gotten around to finishing the analysis or drawing any useful conclusions. I think early on it was clear that this data has a lot of quality issues. While there may be few solid statistical conclusions that can be made from scraping the data, it at least raises questions of whether this is at all a useful metric and what California can do to improve the quality of the data. It's unclear if there is any intent on even collecting the data, as the current set ends at the year 2009. Given that the state's transparency website was recently shuttered and California was never a bastion of open records laws to begin with, don't hold your breath.
At the very least, it's a useful example of web-crawling and data-analysis for those who are just learning programming.
This is a multi-part project intended for beginners that will walk through the concept, exploration, and programming stages of a web-data gathering process. You will learn how to scrape data from a less-than-intuitive government website, compile it into a (SQLite) database, and create visualizations from the data.
The dataset comes from the California Office of Statewide Health Planning and Development (OSHPD) and contains the median costs-per-stay of scheduled (i.e. non-ER) surgeries, according to reports submitted by all California hospitals.
The Concepts
This project is intended for those who've read through the first parts of this book but don't quite understand how everything fits together. I will go through the steps slowly, but won't take as much time to explain each line as I did in the Fundamentals section of the book.
If you're new to data journalism, I will also try to cover some of the caveats, concerns, and opportunities that arise from analyzing public data sets.
Most of all, I hope to convince you of how damned useful it is to know enough code to quickly slice and arrange data as you please to answer (or begin asking) important questions.
Here are the technical concepts that this project will exercise:
I will move slowly through the code. But I'm working under the expectation that you're familiar with the concepts and aren't afraid to use Google or Stack Overflow to help you with what I don't take time to clarify.
The Context
Compared to other states in the Union, California puts an large lode of data about its health system online. A particularly interesting source is the dry-sounding Office of Statewide Health Planning and Development. My ProPublica colleague Charles Ornstein says this about the OSHPD in a recent how-to webinar:
Despite its bureaucratic and jargony name, it is a terrific resource for facts and figures about health care (particularly hospital care) in the state. A look through its website (www.oshpd.ca.gov) can help you discover which hospitals in your community are most profitable, how much they charge for specific procedures, which perform the most C-sections and which are at the greatest risk of collapsing in powerful earthquakes. OSHPD keeps track of data from every region of California--the most urban and most rural.
This project will examine the OSHPD's Common Surgeries and Charges Comparison database. It contains reports from all California hospitals regarding how many scheduled (i.e. arranged at least 24 hours in advance) times they performed a given procedure, their median charge for that procedure, and the median number of days patients spent in the hospital for that selected procedure.
The records are broken out by year and by hospital. The OSHPD also sorts out the records by county and city.
Several things keep this data from being as interesting as it could be. As the OSHPD's instructions say, the median charge amount "does not reflect how much the hospital actually received" and excludes "charges for physician services." Also, some hospitals have a big 0 in this field because they do not report the charge amount to OSHPD.
That said, it's still interesting to see how listed charges can vary wildly for procedures as common as C-sections. We'll see how programming can allow for quick comparisons between hospitals, counties, and any other kinds of categories.
Even if this dataset can't be used to make easy conclusions (i.e. Hospital Y is really overcharging for X procedure), the preliminary answers it provides a starting point to further examine the complexities and variance in health care billing.
The Steps
This is a relatively easy project to do because the data is straightforward. But I've broken it up into subchapters to make it easier to follow along: