Monday, April 14, 2014

Randomly Generated Friends

Here's a little network plot I made to illustrate a few of the concepts I talked about in the previous post.  The entire network is made of three randomly-generated smaller networks that I connected with a few more randomly generated links. My code is included as a comment... but if you don't look that far, at least know that the igraph package has a lot of really neat options for network plotting.

The network as a whole has a modularity of 0.567. The smaller clusters have 0 modularity because everyone knows everyone else; they're very small world and have average path lengths that are exceedingly close to 1.

Finally, the figure illustrates an assortativity of 0.29. Positive assortativity (which can fall between -1 and 1) indicates that nodes tend to be connected to nodes with similar numbers of connections. That seems reasonable here because most nodes are connected to others in their group, but some aren't.

Happy networking!

Saturday, March 8, 2014

The Vikings: Social Networking since 1200 AD

Within reason, dropping academic vocabulary bombs can be a downright satisfying experience. Thanks to a fascinating paper from the December edition of the Royal Statistical Society, I am pleased to provide you with a tidy little nugget to detonate: Network Theory.

Padraig Mac Carron, a PhD student from Coventry University in London, and professor/colleague Ralph Kenna, had the intriguing idea to use statistical analysis to quantitatively study the characteristics of the social networks built into the Icelandic Sagas. They surmised that realistic social structure would be a strong indication that the sagas were at least loosely based on real people or real groups of people. While there isn't any reliable way to validate their work, the results of their investigation point to the sagas being realistic if not partially factual.

A few definitions are helpful to understand what they did:

Nodes: connection points, or in the case of social networks, people

Links: connections between points, or relationships

Degree: the number of relationships a node has

Degree distribution: the probability distribution that describes how many degrees a node is likely to have

Mean path length: the mean distance to connect any two nodes in a network

Clustering coefficient: the fraction of friend-of-a-friend relationships that are realized for a node

Small world: used to describe a network that has a short mean path length and a large clustering coefficient. 

Assortative: nodes or people with similar numbers of friends are also connected to each other. Interestingly, assortativity is quantified by the correlation coefficient, r, that falls between -1 and +1. Usually we see r as a way to describe how tightly correlated points are on, say, the XY plane. 

Disassortative: some nodes have many more links than other nodes

Modularity: if you didn't know anything about the network a priori, how many internal communities would it seem to contain? If the network is only one network, the modularity would be 0. If the community could be split up into many smaller, distantly linked communities, the modularity would approach 1. 

What, then, did the authors do with all of these ideas... and Vikings? They did what any level-headed statistician would do: compare the sagas to each other and to something known. In this case, they used modern society as the frame of reference.  Relative to each other, the sagas had distinct features that reflected the genre of the tales. When clumped together into meta-networks (not a word the authors employed), the 5 largest sagas showed strong overlap. The stories were conceivably written in reference to each other. When all 18 sagas were grouped into one giant network, the authors found that the meta-network was structurally similar to modern society.  The mean path length was an achievable 5.5, and the network was small world and assortative. Even more, the huge community had modularity of 0.7 and 9 communities, meaning that the overlap in the sagas would make it impossible to split them back into 18 discrete bunches.

But Nancy, what does it mean?

It means that you and I both now have a general understanding of Network Theory. And, it means that Viking sagas are more realistic than you ever knew. I found a cool packages in R to make network plots, so that will be my next post!

Sunday, January 26, 2014

Statistics in a box!

From the plotly website. *not my work*
I came across an interesting site recently that I thought was worth mentioning. The site is called plotly, and can be found here. It's a free online analysis and visualization application that flaunts easy usability and beautiful graphics.  While I wasn't entirely convinced that it was fantastically intuitive, the demo plots are beautiful.  I think I'd like it better if I used the R option... but then might as well just stay in R.

My main point, however, is that I think plotly does a great job at demonstrating how stunning visualization can be. Just browse through the website-- it's mesmerizing. Without getting too sappy, I love the idea that a display's aesthetic can be as cool as the information itself. Actually, it seems natural that that be the case.

I do, however, have one particular qualm with the site. It seems almost risky to make analysis and visualization so accessible and aesthetic-driven. Particularly if the interface is designed for non-statisticians, what are the risks of processing data according to what "looks right" as opposed to what is definitely (probably) right? Food for thought if anything...

Enjoy!




Sunday, December 15, 2013

Success!

The bittersweetness of intensive problem solving is that the best answers seem the most obvious in retrospect. I'll use my week as a case study, employing a much smaller version of my task as an illustrative example. Essentially, I was given a matrix of categorical variables and a corresponding matrix of "populations" that are in each category per row. The task was place each population in a column designated for the given category. I was working in a 220 x 51, so my example is scaled down to 8 x 2.

The data started out looking like this:
2 23  0 NA  1 44  3 17  
4 62  3 91  0 NA  0 NA  

I needed it to look like this:
44      23       17    0
0       0        91    62

Take a look at the bolded numbers above. In the original matrix, the "1" category had a population of 44. The final result placed the 44 in the first entry of the row. See?

Of course, the first step was to properly sort the categories such that the population stuck with the right number. Thus, I pasted the category and population columns together, sorted, and then re-split the columns to restore the original dimensions. Next, I poured the data back into two matrices: one category matrix, and one population matrix. They looked like this:

Category matrix:
1   2   3   0 
3   4   0   0  

Population matrix:
44   23   17  0 
91   62   0   0 

Super duper. The problem, however, remained that the populations needed to be in the corresponding category column. Instead, all the values were squished to the left side of the matrix. The solution? Use each category's number as an index for where each population should go. So elegant! I decided to use an intermediate step with 1's and 0's as placeholders so that I could monitor the process more easily without confusing any data. 

Intermediate step:
1  1  1  0
0  0  1  1

achieved by this R code:
poof = matrix(rep(0,prod(dim(population_matrix) 
#make an empty matrix to be filled with category-indexed population values


for (i in 1:dim(poof)[1]){
  poof[i,categories_matrix[i,]] = 1
}

#Put a 1 in the column indicated by the category number

I then made both matrices into vectors for simplicity's sake.
popvec = as.vector(t(population_matrix))
indvec= as.vector(t(poof))

Finally! The code below reads, "For each nth 1 in the categories vector, replace it with the nth nonzero population in the population vector."
indvec[which(indvec == 1)] = (popvec[popvec != 0]

Rebuild the matrix from the vector, and voilĂ . Donezo.

Thrilling!

Sunday, December 1, 2013

The Cochran-Armitage trend test

Histograms of the data both before (left) and after (right) binning.
An esteemed and fearless superior at work recently asked me if I knew anything about the Cochran-Armitage test for trend. I didn't, so here's a blog post to substantiate my response. 

In a nutshell, the Cochran-Armitage test decides whether increasing doses of an independent variable are associated with a trend in a binary dependent variable outcome. In other words, does adding more and more of something result in a higher probability of a certain result (with two possibilities)? As simple and crucial as this question may seem, I was a bit surprised at how elusive the Cochran-Armitage test was online. I ultimately
found it most informative to compare the test to a few other better-known techniques. Vanderbilt University's Biostatistics Department has some great practice data sets, so I snagged one that included data on heart disease (yes or no) and patients' ages (more or less continuous). See the end of this post for references and due credit.

Here's the question: Does heart disease have a dose-response relationship with age? The heart disease data was already conveniently coded with patient ages and 1's or 0's for disease or no disease, respectively. The ages, however, ranged from 17 to 82. Thus, I "binned" the age data into 7 groups to give myself a more manageable number of "doses".  

Below are a few plots showing different ways to model the data. The R code that describes the models is included beside each image. Rather than making you read through all the code, however, I'll tell you this: the p-values associated with each method are resoundingly the same. Each model resulted in a p-value of 2.2 e-16. My conclusion? The Cochran-Armitage test is nothing other than a modified regression model. If there's a trend, the p-value will be small. 


Linear Regression

lm(formula = acath$sigdz ~ acath$age)

Residuals:

Significant Coronary Disease by Cardiac Cath 
    Min      1Q  Median      3Q     Max 
-0.9939 -0.5446  0.2308  0.3490  0.6564 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0480370  0.0413568   1.162    0.246    
acath$age   0.0118231  0.0007772  15.212   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4569 on 3502 degrees of freedom

Multiple R-squared:  0.06198, Adjusted R-squared:  0.06171 
F-statistic: 231.4 on 1 and 3502 DF,  p-value: < 2.2e-16 



Linear Regression
lm(formula = acath$sigdz ~ age_bins)

Residuals:

Significant Coronary Disease by Cardiac Cath 
    Min      1Q  Median      3Q     Max 
-0.9083 -0.5833  0.2000  0.3084  0.6334 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.1499592  0.0367136   4.085 4.51e-05 ***
age_bins    0.0108335  0.0007533  14.382  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4584 on 3502 degrees of freedom

Multiple R-squared:  0.05577, Adjusted R-squared:  0.0555 
F-statistic: 206.8 on 1 and 3502 DF,  p-value: < 2.2e-16


Logistic Regression
glm(formula = acath$sigdz ~ acath$age)

Deviance Residuals: 

    Min       1Q   Median       3Q      Max  
-0.9939  -0.5446   0.2308   0.3490   0.6564  

Coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0480370  0.0413568   1.162    0.246    
acath$age   0.0118231  0.0007772  15.212   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.208746)


    Null deviance: 779.33  on 3503  degrees of freedom

Residual deviance: 731.03  on 3502  degrees of freedom
AIC: 4458.4

Number of Fisher Scoring iterations: 2


Analysis of Variance Table

Response: acath$sigdz

                      Df Sum Sq Mean Sq F value    Pr(>F)    
as.factor(age_bins)    6  45.66  7.6092  36.269 < 2.2e-16 ***
Residuals           3497 733.68  0.2098                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Asymptotic General Independence Test
Here it is! This is what R considers Cochran-Armitage testing. Alas, the output is similar as ANOVA, but the guts of the model are left to our imagination.

data:  sigdz by age

chi-squared = 217.1192, df = 1, p-value < 2.2e-16


Source of Data:
I used the data set called "Cardiac Catheterization Diagnostic Data from Vanderbilt University Biostatistics Department website. The same data set can be found at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets.

Friday, November 29, 2013

Advocating for the Devil

Statistics gets a bad rap for being intentionally misleading. Women feel that they are unfairly underrepresented in technical fields. I'll withhold my own opinions for the moment and offer a digestible article that covers both bases. This article on Science Daily gives a nice summary of an analysis conducted by the American Institute of Physics Statistical Research Center that uses basic stats to show that the bias may not be as clear-cut as we think.

Happy holidays! 


American Institute of Physics (AIP). "All-male physics departments are not proof of bias against hiring women, analysis finds." ScienceDaily, 19 Jul. 2013. Web. 29 Nov. 2013.

Sunday, November 10, 2013

It gets dicey.


Suppose that, hypothetically speaking, you find yourself playing dice against a few faithful buddies. Because this is all hypothetical, let's say that it's also late on a Saturday night and you don't have RStudio handy-- there's no app for that. The game gets heated, and it becomes increasingly difficult to convince yourself that you are not in fact endowed with some superior skill in the game. I hereby present a few calculations to make it easier to reconcile yourself with reality.

I made this 3D plot in R.
It's sweet.
First, let's lay down the rules of the game. Start by rolling three dice until you have one of the following combinations:

  • Any single number and a pair of another number, ie (3, 1, 1). Your "score" is the value of the single number. In this example your score is 3.
  • Any triple, like (4,4,4). Your score would be 4.
  • The specific combination (1,2,3). This is the lowest possible score, which we'll arbitrarily assign the value of 0.
  • The specific combination (4,5,6). This is the highest possible score, which we'll call 7.


Now for the sobering part. Let's look at the probabilities of the possible outcomes on a few different rolls.

Outcome Probability Final Fraction Percent
Double \[ {6 \choose 1} {3 \choose 2} \frac{1}{6} \frac{1}{6} \frac{5}{6}\] \[\frac{90}{216}\] 41.67%
Double 6's \[ {3\choose2} \frac{1}{6} \frac{1}{6} \frac{5}{6}\] \[\frac{15}{216}\] 6.944%
Triple \[ \frac{6}{6} \frac{1}{6} \frac{1}{6}\] \[\frac{6}{216}\] 2.78%
Triple 6's \[ \frac{1}{6} \frac{1}{6} \frac{1}{6}\] \[\frac{1}{216}\] 0.463%
1,2,3 \[ \frac{3}{6} \frac{2}{6} \frac{1}{6}\] \[\frac{6}{216}\] 2.78%
4,5,6 \[ \frac{3}{6} \frac{2}{6} \frac{1}{6}\] \[\frac{6}{216}\] 2.78%

See something strange? It might seem unexpected that you're more likely to get the sacred 4,5,6 than roll a 6,6,6. Think of it this way: in order to roll all 6's, there's only one choice for each die. Each die therefore has a probability of 1/6 of getting it right. Rolling a 4,5,6, however, means that the first die could be any of 3 options. The second die in the set could be any of the 2 remaining options. Finally, the third die must be the missing link and has a probability of 1/6 of fulfilling the set. The same goes for 1,2,3.

I haven't calculated the probability of trumping specific rolls because the number of possibilities are so huge and depend on the number of players involved. Hopefully the above percentages are sufficient proof that when it comes down to the math, we're the one's being played by the game.