Monday, September 1, 2014

Science: variations on a theme

As two researchers at Carnegie Mellon said so eloquently in a recent paper, "One of the major pursuits of science is to explain complex phenomenon in simple terms."

Over the past month I've seen this "simple" theme play out in a number of ways. Even in entirely different situations the end goal is always the same: take a highly complicated slew of information and mold it into something usable. A prime example of this presents with high dimensional data sets, but I've also heard it discussed at length in the context of clinical applicability.

John P. Cunningham and Byron M. Yu's paper gives a great explanation of when dimensionality reduction can cut through an enormous cloud of data and pull out the meaningful features. Their research is in neuroscience, so the need to condense input without losing the connections between the neurons is key. In the paper, Dimensionality reduction for large-scale neural recordings (Nature Neuroscience, August 2014), the authors discuss why and how dimensionality reduction can pull out underlying explanatory variables, often referred to as latent variables. The process allows for sanity-checking visualizations as well as data-driven hypothesis development.

Interestingly, I also saw a very similar rationale exercised by the Agency for Healthcare Research and Quality's (AHRQ) in developing a condensed socioeconomic index score. Given a huge number of potential demographic variables from which to choose, the AHRQ turned to Principal Components Analysis (PCA) to dimensionally reduce the selection and then weight the resulting variables. Very different application, but very similar desired outcome.

I mentioned clinical applicability because it has very strong parallels to dimensionality reduction: take a very dynamic picture and distill down to a simple result. For example, given the dose-response relationship and pharmacokinetics of drug X, how much should be administered? Some situations call for categorical or dichotomized dependent variables, while others involve more sophisticated modeling that comes down to p-values. It's an interesting dialogue that doesn't have one right answer. Nevertheless, it's an important theme to keep in mind when thinking about how we ask questions.

Tuesday, July 29, 2014

Proof: It's summer.

It's really hot in Tennessee.  I made this graph using data from the National Oceanic and Atmospheric Association (click here) to reassure myself that an end was in sight. The summer temperatures don't look quite as high as I expected, but these are averages throughout the day. I couldn't find very extensive data on daily highs and lows. 
Notice how much variation there is in January weather! January has a mean temperature of 33.09*F with a standard deviation of 13*. July 1st, on the other hand, has an average temperature of 71.52* and a standard deviation of 4.8*. Yes, quantile plots were normal.

Monday, July 28, 2014

Stay in school.

Here are some choropleths I made that show the percent of people in Tennessee with only a high school degree or at least bachelor's degree. I used census data from the American Fact Finder and ggplot's map_data() to get the coordinates for each county. Dekalb County didn't have any census data. 

This plot (below) is slightly more sophisticated in that it uses shape files from the census bureau at the tract level. It shows median income per census tract.

Friday, June 27, 2014

Shiny is always better.

After dabbling in Objective-C for a few days, I realized that app programming is pretty different from the database managing I do during the day. Thus, I was really excited learn about a thing called Shiny. Shiny is an app-development package in R. Woah! It seemed like a great way to accustom myself to app development without having to simultaneously learn a new language.

I decided to actually learn a few things and take the Developing Data Products course offered by Coursera. Shiny apps are a major component of the course, including the course project. Here's a link to mine:  The Shiny servers have been having some issues keeping up with the deluge of Coursera users, so it if doesn't work then try back later.

Of course, this was an exercise in form over content. I've included the code in the comments if you're curious how to make a slider bar that changes the size of an image. More information on how to host your apps can be found at

Monday, April 14, 2014

Randomly Generated Friends

Here's a little network plot I made to illustrate a few of the concepts I talked about in the previous post.  The entire network is made of three randomly-generated smaller networks that I connected with a few more randomly generated links. My code is included as a comment... but if you don't look that far, at least know that the igraph package has a lot of really neat options for network plotting.

The network as a whole has a modularity of 0.567. The smaller clusters have 0 modularity because everyone knows everyone else; they're very small world and have average path lengths that are exceedingly close to 1.

Finally, the figure illustrates an assortativity of 0.29. Positive assortativity (which can fall between -1 and 1) indicates that nodes tend to be connected to nodes with similar numbers of connections. That seems reasonable here because most nodes are connected to others in their group, but some aren't.

Happy networking!

Saturday, March 8, 2014

The Vikings: Social Networking since 1200 AD

Within reason, dropping academic vocabulary bombs can be a downright satisfying experience. Thanks to a fascinating paper from the December edition of the Royal Statistical Society, I am pleased to provide you with a tidy little nugget to detonate: Network Theory.

Padraig Mac Carron, a PhD student from Coventry University in London, and professor/colleague Ralph Kenna, had the intriguing idea to use statistical analysis to quantitatively study the characteristics of the social networks built into the Icelandic Sagas. They surmised that realistic social structure would be a strong indication that the sagas were at least loosely based on real people or real groups of people. While there isn't any reliable way to validate their work, the results of their investigation point to the sagas being realistic if not partially factual.

A few definitions are helpful to understand what they did:

Nodes: connection points, or in the case of social networks, people

Links: connections between points, or relationships

Degree: the number of relationships a node has

Degree distribution: the probability distribution that describes how many degrees a node is likely to have

Mean path length: the mean distance to connect any two nodes in a network

Clustering coefficient: the fraction of friend-of-a-friend relationships that are realized for a node

Small world: used to describe a network that has a short mean path length and a large clustering coefficient. 

Assortative: nodes or people with similar numbers of friends are also connected to each other. Interestingly, assortativity is quantified by the correlation coefficient, r, that falls between -1 and +1. Usually we see r as a way to describe how tightly correlated points are on, say, the XY plane. 

Disassortative: some nodes have many more links than other nodes

Modularity: if you didn't know anything about the network a priori, how many internal communities would it seem to contain? If the network is only one network, the modularity would be 0. If the community could be split up into many smaller, distantly linked communities, the modularity would approach 1. 

What, then, did the authors do with all of these ideas... and Vikings? They did what any level-headed statistician would do: compare the sagas to each other and to something known. In this case, they used modern society as the frame of reference.  Relative to each other, the sagas had distinct features that reflected the genre of the tales. When clumped together into meta-networks (not a word the authors employed), the 5 largest sagas showed strong overlap. The stories were conceivably written in reference to each other. When all 18 sagas were grouped into one giant network, the authors found that the meta-network was structurally similar to modern society.  The mean path length was an achievable 5.5, and the network was small world and assortative. Even more, the huge community had modularity of 0.7 and 9 communities, meaning that the overlap in the sagas would make it impossible to split them back into 18 discrete bunches.

But Nancy, what does it mean?

It means that you and I both now have a general understanding of Network Theory. And, it means that Viking sagas are more realistic than you ever knew. I found a cool packages in R to make network plots, so that will be my next post!

Sunday, January 26, 2014

Statistics in a box!

From the plotly website. *not my work*
I came across an interesting site recently that I thought was worth mentioning. The site is called plotly, and can be found here. It's a free online analysis and visualization application that flaunts easy usability and beautiful graphics.  While I wasn't entirely convinced that it was fantastically intuitive, the demo plots are beautiful.  I think I'd like it better if I used the R option... but then might as well just stay in R.

My main point, however, is that I think plotly does a great job at demonstrating how stunning visualization can be. Just browse through the website-- it's mesmerizing. Without getting too sappy, I love the idea that a display's aesthetic can be as cool as the information itself. Actually, it seems natural that that be the case.

I do, however, have one particular qualm with the site. It seems almost risky to make analysis and visualization so accessible and aesthetic-driven. Particularly if the interface is designed for non-statisticians, what are the risks of processing data according to what "looks right" as opposed to what is definitely (probably) right? Food for thought if anything...


Sunday, December 15, 2013


The irony of intensive problem solving is that the best answers seem the most obvious in retrospect. I'll use my week as a case study, employing a much smaller version of my task as an illustrative example. Essentially, I was given a matrix of categorical variables and a corresponding matrix of "populations" that are in each category per row. The task was place each population in a column designated for the given category. I was working in a 220 x 51, so my example is scaled down to 8 x 2.

The data started out looking like this:
2 23  0 NA  1 44  3 17  
4 62  3 91  0 NA  0 NA  

I needed it to look like this:
44      23       17    0
0       0        91    62

Take a look at the bolded numbers above. In the original matrix, the "1" category had a population of 44. The final result placed the 44 in the first entry of the row. See?

Of course, the first step was to properly sort the categories such that the population stuck with the right number. Thus, I pasted the category and population columns together, sorted, and then re-split the columns to restore the original dimensions. Next, I poured the data back into two matrices: one category matrix, and one population matrix. They looked like this:

Category matrix:
1   2   3   0 
3   4   0   0  

Population matrix:
44   23   17  0 
91   62   0   0 

Super. The problem, however, remained that the populations needed to be in the corresponding category column. Instead, all the values were squished to the left side of the matrix. The solution? Use each category's number as an index for where each population should go. I decided to use an intermediate step with 1's and 0's as placeholders so that I could monitor the process more easily without confusing any data. 

Intermediate step:
1  1  1  0
0  0  1  1

achieved by this R code:
poof = matrix(rep(0,prod(dim(population_matrix) 
#make an empty matrix to be filled with category-indexed population values

for (i in 1:dim(poof)[1]){
  poof[i,categories_matrix[i,]] = 1

#Put a 1 in the column indicated by the category number

I then made both matrices into vectors for simplicity's sake.
popvec = as.vector(t(population_matrix))
indvec= as.vector(t(poof))

Finally! The code below reads, "For each nth 1 in the categories vector, replace it with the nth nonzero population in the population vector."
indvec[which(indvec == 1)] = (popvec[popvec != 0]

Rebuild the matrix from the vector, and voilĂ . Donezo.

Sunday, December 1, 2013

The Cochran-Armitage trend test

Histograms of the data both before (left) and after (right) binning.
An esteemed and fearless superior at work recently asked me if I knew anything about the Cochran-Armitage test for trend. I didn't, so here's a blog post to substantiate my response. 

In a nutshell, the Cochran-Armitage test decides whether increasing doses of an independent variable are associated with a trend in a binary dependent variable outcome. In other words, does adding more and more of something result in a higher probability of a certain result (with two possibilities)? As simple and crucial as this question may seem, I was a bit surprised at how elusive the Cochran-Armitage test was online. I ultimately
found it most informative to compare the test to a few other better-known techniques. Vanderbilt University's Biostatistics Department has some great practice data sets, so I snagged one that included data on heart disease (yes or no) and patients' ages (more or less continuous). See the end of this post for references and due credit.

Here's the question: Does heart disease have a dose-response relationship with age? The heart disease data was already conveniently coded with patient ages and 1's or 0's for disease or no disease, respectively. The ages, however, ranged from 17 to 82. Thus, I "binned" the age data into 7 groups to give myself a more manageable number of "doses".  

Below are a few plots showing different ways to model the data. The R code that describes the models is included beside each image. Rather than making you read through all the code, however, I'll tell you this: the p-values associated with each method are resoundingly the same. Each model resulted in a p-value of 2.2 e-16. My conclusion? The Cochran-Armitage test is nothing other than a modified regression model. If there's a trend, the p-value will be small. 

Linear Regression

lm(formula = acath$sigdz ~ acath$age)


Significant Coronary Disease by Cardiac Cath 
    Min      1Q  Median      3Q     Max 
-0.9939 -0.5446  0.2308  0.3490  0.6564 


             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0480370  0.0413568   1.162    0.246    
acath$age   0.0118231  0.0007772  15.212   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4569 on 3502 degrees of freedom

Multiple R-squared:  0.06198, Adjusted R-squared:  0.06171 
F-statistic: 231.4 on 1 and 3502 DF,  p-value: < 2.2e-16 

Linear Regression
lm(formula = acath$sigdz ~ age_bins)


Significant Coronary Disease by Cardiac Cath 
    Min      1Q  Median      3Q     Max 
-0.9083 -0.5833  0.2000  0.3084  0.6334 


             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.1499592  0.0367136   4.085 4.51e-05 ***
age_bins    0.0108335  0.0007533  14.382  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4584 on 3502 degrees of freedom

Multiple R-squared:  0.05577, Adjusted R-squared:  0.0555 
F-statistic: 206.8 on 1 and 3502 DF,  p-value: < 2.2e-16

Logistic Regression
glm(formula = acath$sigdz ~ acath$age)

Deviance Residuals: 

    Min       1Q   Median       3Q      Max  
-0.9939  -0.5446   0.2308   0.3490   0.6564  


             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0480370  0.0413568   1.162    0.246    
acath$age   0.0118231  0.0007772  15.212   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.208746)

    Null deviance: 779.33  on 3503  degrees of freedom

Residual deviance: 731.03  on 3502  degrees of freedom
AIC: 4458.4

Number of Fisher Scoring iterations: 2

Analysis of Variance Table

Response: acath$sigdz

                      Df Sum Sq Mean Sq F value    Pr(>F)    
as.factor(age_bins)    6  45.66  7.6092  36.269 < 2.2e-16 ***
Residuals           3497 733.68  0.2098                      
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Asymptotic General Independence Test
Here it is! This is what R considers Cochran-Armitage testing. Alas, the output is similar as ANOVA, but the guts of the model are left to our imagination.

data:  sigdz by age

chi-squared = 217.1192, df = 1, p-value < 2.2e-16

Source of Data:
I used the data set called "Cardiac Catheterization Diagnostic Data from Vanderbilt University Biostatistics Department website. The same data set can be found at

Friday, November 29, 2013

Advocating for the Devil

Statistics gets a bad rap for being intentionally misleading. Women feel that they are unfairly underrepresented in technical fields. I'll withhold my own opinions for the moment and offer a digestible article that covers both bases. This article on Science Daily gives a nice summary of an analysis conducted by the American Institute of Physics Statistical Research Center that uses basic stats to show that the bias may not be as clear-cut as we think.

Happy holidays! 

American Institute of Physics (AIP). "All-male physics departments are not proof of bias against hiring women, analysis finds." ScienceDaily, 19 Jul. 2013. Web. 29 Nov. 2013.