Friday, June 27, 2014

This one is cooler.

Instead of spending more time making inspiring apps like the one in my previous post, I decided to actually learn a few things and take the Developing Data Products course offered by Coursera. Shiny apps are a major component of the course, including the course project. Here's a link to mine! The Shiny servers have been having some issues keeping up with the deluge of Coursera users, so it if doesn't work then try back later.


Of course, this was an exercise in form over content; I'm intentionally not including the code because it counts as coursework.

Saturday, May 17, 2014

This is post worthy.

I spent the morning on what is quite possibly the strangest piece of work I've ever produced. That's saying something. 

After dabbling in Objective-C for a few days, I realized that app programming is pretty different from the database managing I do during the day. Thus, I was really excited learn about a thing called Shiny. Shiny is an app-development package in R, the language I use nonstop for everything. Super! It seemed like a great way to accustom myself to app development without having to simultaneously learn a new language.

Thank you to the folks on StackOverflow for helping me make a crucial tweak to my code. I've included the code in the comments if you're curious how to make a slider bar that changes the size of an image. More information on how to host your apps can be found at

Monday, April 14, 2014

Randomly Generated Friends

Here's a little network plot I made to illustrate a few of the concepts I talked about in the previous post.  The entire network is made of three randomly-generated smaller networks that I connected with a few more randomly generated links. My code is included as a comment... but if you don't look that far, at least know that the igraph package has a lot of really neat options for network plotting.

The network as a whole has a modularity of 0.567. The smaller clusters have 0 modularity because everyone knows everyone else; they're very small world and have average path lengths that are exceedingly close to 1.

Finally, the figure illustrates an assortativity of 0.29. Positive assortativity (which can fall between -1 and 1) indicates that nodes tend to be connected to nodes with similar numbers of connections. That seems reasonable here because most nodes are connected to others in their group, but some aren't.

Happy networking!

Saturday, March 8, 2014

The Vikings: Social Networking since 1200 AD

Within reason, dropping academic vocabulary bombs can be a downright satisfying experience. Thanks to a fascinating paper from the December edition of the Royal Statistical Society, I am pleased to provide you with a tidy little nugget to detonate: Network Theory.

Padraig Mac Carron, a PhD student from Coventry University in London, and professor/colleague Ralph Kenna, had the intriguing idea to use statistical analysis to quantitatively study the characteristics of the social networks built into the Icelandic Sagas. They surmised that realistic social structure would be a strong indication that the sagas were at least loosely based on real people or real groups of people. While there isn't any reliable way to validate their work, the results of their investigation point to the sagas being realistic if not partially factual.

A few definitions are helpful to understand what they did:

Nodes: connection points, or in the case of social networks, people

Links: connections between points, or relationships

Degree: the number of relationships a node has

Degree distribution: the probability distribution that describes how many degrees a node is likely to have

Mean path length: the mean distance to connect any two nodes in a network

Clustering coefficient: the fraction of friend-of-a-friend relationships that are realized for a node

Small world: used to describe a network that has a short mean path length and a large clustering coefficient. 

Assortative: nodes or people with similar numbers of friends are also connected to each other. Interestingly, assortativity is quantified by the correlation coefficient, r, that falls between -1 and +1. Usually we see r as a way to describe how tightly correlated points are on, say, the XY plane. 

Disassortative: some nodes have many more links than other nodes

Modularity: if you didn't know anything about the network a priori, how many internal communities would it seem to contain? If the network is only one network, the modularity would be 0. If the community could be split up into many smaller, distantly linked communities, the modularity would approach 1. 

What, then, did the authors do with all of these ideas... and Vikings? They did what any level-headed statistician would do: compare the sagas to each other and to something known. In this case, they used modern society as the frame of reference.  Relative to each other, the sagas had distinct features that reflected the genre of the tales. When clumped together into meta-networks (not a word the authors employed), the 5 largest sagas showed strong overlap. The stories were conceivably written in reference to each other. When all 18 sagas were grouped into one giant network, the authors found that the meta-network was structurally similar to modern society.  The mean path length was an achievable 5.5, and the network was small world and assortative. Even more, the huge community had modularity of 0.7 and 9 communities, meaning that the overlap in the sagas would make it impossible to split them back into 18 discrete bunches.

But Nancy, what does it mean?

It means that you and I both now have a general understanding of Network Theory. And, it means that Viking sagas are more realistic than you ever knew. I found a cool packages in R to make network plots, so that will be my next post!

Sunday, January 26, 2014

Statistics in a box!

From the plotly website. *not my work*
I came across an interesting site recently that I thought was worth mentioning. The site is called plotly, and can be found here. It's a free online analysis and visualization application that flaunts easy usability and beautiful graphics.  While I wasn't entirely convinced that it was fantastically intuitive, the demo plots are beautiful.  I think I'd like it better if I used the R option... but then might as well just stay in R.

My main point, however, is that I think plotly does a great job at demonstrating how stunning visualization can be. Just browse through the website-- it's mesmerizing. Without getting too sappy, I love the idea that a display's aesthetic can be as cool as the information itself. Actually, it seems natural that that be the case.

I do, however, have one particular qualm with the site. It seems almost risky to make analysis and visualization so accessible and aesthetic-driven. Particularly if the interface is designed for non-statisticians, what are the risks of processing data according to what "looks right" as opposed to what is definitely (probably) right? Food for thought if anything...


Sunday, December 15, 2013


The bittersweetness of intensive problem solving is that the best answers seem the most obvious in retrospect. I'll use my week as a case study, employing a much smaller version of my task as an illustrative example. Essentially, I was given a matrix of categorical variables and a corresponding matrix of "populations" that are in each category per row. The task was place each population in a column designated for the given category. I was working in a 220 x 51, so my example is scaled down to 8 x 2.

The data started out looking like this:
2 23  0 NA  1 44  3 17  
4 62  3 91  0 NA  0 NA  

I needed it to look like this:
44      23       17    0
0       0        91    62

Take a look at the bolded numbers above. In the original matrix, the "1" category had a population of 44. The final result placed the 44 in the first entry of the row. See?

Of course, the first step was to properly sort the categories such that the population stuck with the right number. Thus, I pasted the category and population columns together, sorted, and then re-split the columns to restore the original dimensions. Next, I poured the data back into two matrices: one category matrix, and one population matrix. They looked like this:

Category matrix:
1   2   3   0 
3   4   0   0  

Population matrix:
44   23   17  0 
91   62   0   0 

Super duper. The problem, however, remained that the populations needed to be in the corresponding category column. Instead, all the values were squished to the left side of the matrix. The solution? Use each category's number as an index for where each population should go. So elegant! I decided to use an intermediate step with 1's and 0's as placeholders so that I could monitor the process more easily without confusing any data. 

Intermediate step:
1  1  1  0
0  0  1  1

achieved by this R code:
poof = matrix(rep(0,prod(dim(population_matrix) 
#make an empty matrix to be filled with category-indexed population values

for (i in 1:dim(poof)[1]){
  poof[i,categories_matrix[i,]] = 1

#Put a 1 in the column indicated by the category number

I then made both matrices into vectors for simplicity's sake.
popvec = as.vector(t(population_matrix))
indvec= as.vector(t(poof))

Finally! The code below reads, "For each nth 1 in the categories vector, replace it with the nth nonzero population in the population vector."
indvec[which(indvec == 1)] = (popvec[popvec != 0]

Rebuild the matrix from the vector, and voilĂ . Donezo.


Sunday, December 1, 2013

The Cochran-Armitage trend test

Histograms of the data both before (left) and after (right) binning.
An esteemed and fearless superior at work recently asked me if I knew anything about the Cochran-Armitage test for trend. I didn't, so here's a blog post to substantiate my response. 

In a nutshell, the Cochran-Armitage test decides whether increasing doses of an independent variable are associated with a trend in a binary dependent variable outcome. In other words, does adding more and more of something result in a higher probability of a certain result (with two possibilities)? As simple and crucial as this question may seem, I was a bit surprised at how elusive the Cochran-Armitage test was online. I ultimately
found it most informative to compare the test to a few other better-known techniques. Vanderbilt University's Biostatistics Department has some great practice data sets, so I snagged one that included data on heart disease (yes or no) and patients' ages (more or less continuous). See the end of this post for references and due credit.

Here's the question: Does heart disease have a dose-response relationship with age? The heart disease data was already conveniently coded with patient ages and 1's or 0's for disease or no disease, respectively. The ages, however, ranged from 17 to 82. Thus, I "binned" the age data into 7 groups to give myself a more manageable number of "doses".  

Below are a few plots showing different ways to model the data. The R code that describes the models is included beside each image. Rather than making you read through all the code, however, I'll tell you this: the p-values associated with each method are resoundingly the same. Each model resulted in a p-value of 2.2 e-16. My conclusion? The Cochran-Armitage test is nothing other than a modified regression model. If there's a trend, the p-value will be small. 

Linear Regression

lm(formula = acath$sigdz ~ acath$age)


Significant Coronary Disease by Cardiac Cath 
    Min      1Q  Median      3Q     Max 
-0.9939 -0.5446  0.2308  0.3490  0.6564 


             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0480370  0.0413568   1.162    0.246    
acath$age   0.0118231  0.0007772  15.212   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4569 on 3502 degrees of freedom

Multiple R-squared:  0.06198, Adjusted R-squared:  0.06171 
F-statistic: 231.4 on 1 and 3502 DF,  p-value: < 2.2e-16 

Linear Regression
lm(formula = acath$sigdz ~ age_bins)


Significant Coronary Disease by Cardiac Cath 
    Min      1Q  Median      3Q     Max 
-0.9083 -0.5833  0.2000  0.3084  0.6334 


             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.1499592  0.0367136   4.085 4.51e-05 ***
age_bins    0.0108335  0.0007533  14.382  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4584 on 3502 degrees of freedom

Multiple R-squared:  0.05577, Adjusted R-squared:  0.0555 
F-statistic: 206.8 on 1 and 3502 DF,  p-value: < 2.2e-16

Logistic Regression
glm(formula = acath$sigdz ~ acath$age)

Deviance Residuals: 

    Min       1Q   Median       3Q      Max  
-0.9939  -0.5446   0.2308   0.3490   0.6564  


             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0480370  0.0413568   1.162    0.246    
acath$age   0.0118231  0.0007772  15.212   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.208746)

    Null deviance: 779.33  on 3503  degrees of freedom

Residual deviance: 731.03  on 3502  degrees of freedom
AIC: 4458.4

Number of Fisher Scoring iterations: 2

Analysis of Variance Table

Response: acath$sigdz

                      Df Sum Sq Mean Sq F value    Pr(>F)    
as.factor(age_bins)    6  45.66  7.6092  36.269 < 2.2e-16 ***
Residuals           3497 733.68  0.2098                      
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Asymptotic General Independence Test
Here it is! This is what R considers Cochran-Armitage testing. Alas, the output is similar as ANOVA, but the guts of the model are left to our imagination.

data:  sigdz by age

chi-squared = 217.1192, df = 1, p-value < 2.2e-16

Source of Data:
I used the data set called "Cardiac Catheterization Diagnostic Data from Vanderbilt University Biostatistics Department website. The same data set can be found at