The State of Mobile Broadband in New York State
For my final project for Data Without Borders and Understanding Networks, I decided to analyze broadband speed data and per capita personal income data and see if I could find a relationship between the two. Initially, I had planned to analyze data for the entire country but as it turns out, that’s a lot of data. The mobile broadband data set for the nation comes out to about 11GB, which is impractically large to use with R or most mapping solutions. I decided to limit my scope to mobile broadband networks in the state of New York, just to make things easier.
For the broadband data, I went to the FCC, which provides broadband speed data broken out by state and type (e.g. fixed vs. wireless). I had hoped to use actual observed speeds but the columns that were supposed to contain that data were blank, so I had to settle for maximum advertised speeds (which, as you probably know, tend to be quite optimistic). For the income data, I used the New York State census results from 2010. Despite some inconsistencies in format, both data sets contained a 5-digit FIPS number, which is used to identify census tracts. Using R, I merged the two sets using the FIPS number.
Before I went any further, I decided to do a bit of analysis. I went ahead and created a plot of income vs. wireless download speed:

As you can see, the data separated pretty cleanly into “buckets”. That’s because advertised broadband speeds generally tend to be whole numbers—most carriers will advertise 8Mbps down as opposed to, say, 8.43Mbps. I ran a jitter on this graph but that didn’t make it any easier to read. So I went ahead and plotted a linear regression line on the graph:

As you can see, the linear regression line shows a positive correlation between income and broadband speed. But is the relationship legit? Here’s the summary stats that R gave me on the regression line:
Call:
lm(formula = newmerged$wirelessdl ~ newmerged$income)
Residuals:
Min 1Q Median 3Q Max
-6.0631 -1.7524 0.1637 1.8632 4.4579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.019e+00 2.389e-02 168.2 <2e-16 ***
newmerged$income 8.262e-05 6.350e-07 130.1 <2e-16 ***
—-
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.992 on 217558 degrees of freedom
Multiple R-squared: 0.07219, Adjusted R-squared: 0.07219
F-statistic: 1.693e+04 on 1 and 217558 DF, p-value: < 2.2e-16
Based on my extremely rudimentary understanding of statistics, that really low P-value (2.2e-16) suggests that this correlation is probably the real deal.
After establishing the connection between income and speed, I decided to visualize the data using CartoDB. I fed my data into CartoDB and had CartoDB resolve geocodes (latitude and longitude) from the city names in the income data. I’m guessing that they simply make a call to the Google Maps API for each point, in order to get the geodata.

Finally, I plotted all of the points on the map and used CartoCSS to style the map, which took a lot longer than I had anticipated. Not only are there two variables for each row in the table (income and download speed) but for many locations on the map, there are multiple data points, as most areas are served by multiple wireless providers. I tried to tweak the styles in such a way that would make the relationship between the variables clear, which was pretty challenging. Here is the end result.
After finishing the map, I realized that it might have made more sense to just display the highest download speed for each point rather than all of the various speeds, just to make the map a bit more readable. I decided to go back to R and limit the data to just the fastest download speed for each census tract. In order to do this, I used a command in the “plyr” library called “ddply”:
fcc4 = ddply(fcc3,~FIPS,function(x){x[which.max(x$fcc2.maxaddown),]})
Currently, R is chugging along, trying to apply that function to my 340,144 rows of data. After that’s done, I’ll have to feed the new table to CartoDB and have it re-geocode all of the data, which could also take a while. I’ll update this post if I’m able to finish the second map in time—it would be nice to see them side-by-side to see which one is more readable.



