Colorado Interactive Map

May 10, 2017

So it’s been a while since I’ve updated my blog. My initial intent was to have this grow into something fairly sustainable, but then I got a job at a software startup in Downtown Denver. Then I moved next to Coors field, and being the stats nerd I am I’ve been attending way too many Rockies games these days. Needless to say, the early stages of my job has been more on the clerical side of things - cleaning up datasets, and stabilizing existing BI Reports - and as such I was wanting to get back into some of the real hard analysis like I did in graduate school.

Background

Anyway, one data point that really stood out to me in the months after the election was a blurb in the Washington Post. The opinion piece was attempting to debunk the popular sentiment that Democratic voters were lazy, when in reality a significant majority of people with jobs in the country vote Democratic. The data they used for this was a study by the Brookings Institute (see the full study here)showing that Hillary Clinton-voting districts accounted for 64% of all economic activity in the country, while Donald Trump voters only accounted for 36%. Naturally I was skeptical.

This is not a political science research project, but this study came back into my mind as I stumbled across some quality data that I thought I would analyze – for fun - on age, income, and zip codes in my home state of Colorado. It does seem to have implications from the point in the WaPo article and if you compare these maps to election maps side-by-side, it seems to agree with the initial point in the article, even if it doesn’t definitively reject the null hypothesis. Morerigorous scientific analysis would require another API, a join of some sort, and new R script that regresses the response variables at a level of granularity that I don’t have access to at the moment. It wont be that hard to find, but it’s just a question of how much I want to nerd out with my time :D

More data stuff

One great thing about scripting in Tableau is that you can set up things like simple exponential regressions, correlation matrices, chord diagrams, and even basic cosine distance clusters in a way that allows you to rapidly subset and re-subset data based on your findings/visualizations. If there are clear outliers that are affecting your model, or large clusters of sets in your Euclidian plane, you can easily take out these subsets – or analyze them separately by using something as basic as a slider. Needless to say, in these cases there are clearly trends – and while correlation doesn’t necessarily mean causation, you can usually eliminate response variables as potential causative effects by running the model over and over and over again.

This is why Big Data has such huge implications in the world of medicine – because not only is it easier to eliminate causes and interactions (or in the case of preventative medicine, find new interactions) in pathology, you can also look at an almost infinite number of potential interactions between categorical variables, as computers today can analyze billions of rows of data in minutes, or even seconds.

But I’m getting way ahead of myself on a simple data set on zip code and income data in Colorado. It’s just pretty interesting what these data sets can lead to if you can get the right type of join.

One example of Tableau’s dynamism is in my model below. There is a clear cluster of data points on the lower end of population, i.e. a lot of zip codes with populations of less than 5,000 that are clearly not correlated with income. While the model is still statistically significant (p = .001), it isn’t statistically consequential with the default quadratic regression function at f(i) = -6.1^2p + .67p +38040. Keep in mind this negative correlation reflects the entire model, and you can eliminate many rural zip codes without eliminating the population as a whole, which can make the correlation positive. Another example of getting data to say whatever you want! haha

But sometimes you have to manipulate the data to get a true reflection of reality. With Tableau, you can set a dynamic parameter that filters out the number of variables contributing to the model (because populations are aggregated on zip), while still keeping most of your data (hence the "percent of population included in dataset" point). Additionally, by playing with the parameters, you might find – by simply looking at the visualizations – that there are particular inflection points (local minima/maxima in calculus terminology) that suggest whatever response variable you’re looking at will have strong “pack” associations. One clear interaction by looking at the model is that rural zip codes have a wider range of incomes, while more urban zip codes are more consistent. With that in mind, we can call the model heteroscedastic. When models show strong levels of heteroscedasticicity, that makes it appropriate to divide the model into two - or multiple - models. I will follow up this blog with a further dive into the levels of covariance and changes in heteroscedasticicity based on the clusters in the model.

For example, if America is becoming more divided as people are saying, than naturally the response variables will become more perpendicular within clusters given categories based on various social barometers. A simple regression model wont catch these trends unless you can easily visualize them and subset them. (A google image search for 'binary heteroscedasticicity' will show what a model like this would look like.)

Another, perhaps more overall more accurate way (in this particular case) of doing this would be to granulate the data – if possible - by individuals. However, this would completely negate the benefit of correlating any sort of categorical variable with the concept of economic/urban/rural areas with any sort of behavioral or demographical trend. You might be able to find a polynomial correlation between income and age and/or voting patterns... but it would have nothing to do with the population density of the zip code that your subjects live in.

Initial finds in this set

This was data that I found and wanted to analyze. This set jogged my memory about the heteroscedasticity, interactivity and covariance talked about above, but it perhaps this isn’t an ideal example to demonstrate the point. However, it’s a cool look, and this initial script can be expanded upon if there is any desire to further explore further any demographic or behavioral categories. Please let me know how you want this data to be split!

I made this thing interactive, and you’re encouraged to play with the model yourself and list your findings in the comments section below. Here is what I gathered from the model:

When factoring out small zip codes there is a slight positive correlation between economic production per capita and population density (this is intuitive)
The age function tends to be polynomial, with the middle ages producing the most economic activity, and the older and younger binaries producing less (also intuitive)
Some cool outliers: 80249 seems to be well above the regression line in income, but also one of the youngest ZIP codes… must be a pretty hip area. (EDIT – I looked it up… it’s Green Valley Ranch – what the hell is that about haha)
On the flip side, 80115 seems to be the opposite, well below the predicative income given population density and age… the folks in 81005 evidently can’t get ahead in life. (81005 is located in Pueblo, CO.)
My district in Lodo – 80205 is young, but it’s much lower than I thought it would be in income. I think this has a lot to do with the fact that the north side of this zip code includes most of Denver's impoverished “Five Points” Neighborhood (which is rapidly gentrifying, I might add).
The wealthiest zip code in Colorado is 80016 - and is listed as being in Aurora. Political lines have changed since 2014... this zip code is actually in Centennial/Lone Tree now.

A couple things to point out:

A huge limitation of the data is that people almost never live in the same zip code that they work in. If we were able to set nominal GDP by working zip code, I suspect Colorado's largest city - there are about 900,000 households with Denver zip codes - would go from slightly below the national median to slightly above, and I suspect some of the wealthy suburbs like Littleton and Parker would go down slightly.
Even when the data is at full, it only contains about 93% of the total population. This is because I built in a filter in the .tde file where IF ((median income = 0) = TRUE) THEN Exclude. This is something you have to watch out for - NULL values being reported as zero. 'NULL' and 'zero' are not the same thing in data analysis.
Filters apply across all sheets. To see a new view, go back to the original dashboard (hit revert on the bottom pane, for some reason the HTML isn't fitting the whole view and interactivity requires some scrolling - I'm working on this). The reason I did multiple tabs was that I had initially wanted to make this a single dash, but the vizualizations were too small.

Finally, I don’t normally get a lot of comments here… I average about 40-80 non-bounced page views, but I really would like to know what sort of things you would like to add to this model, and any other findings you might find.

Enjoy!

NLA

NLA