## Saturday, March 8, 2014

### The Vikings: Social Networking since 1200 AD

Within reason, dropping academic vocabulary bombs can be a downright satisfying experience. Thanks to a fascinating paper from the December edition of the Royal Statistical Society, I am pleased to provide you with a tidy little nugget to detonate: Network Theory.

Padraig Mac Carron, a PhD student from Coventry University in London, and professor/colleague Ralph Kenna, had the intriguing idea to use statistical analysis to quantitatively study the characteristics of the social networks built into the Icelandic Sagas. They surmised that realistic social structure would be a strong indication that the sagas were at least loosely based on real people or real groups of people. While there isn't any reliable way to validate their work, the results of their investigation point to the sagas being realistic if not partially factual.

A few definitions are helpful to understand what they did:

Nodes: connection points, or in the case of social networks, people

Links: connections between points, or relationships

Degree: the number of relationships a node has

Degree distribution: the probability distribution that describes how many degrees a node is likely to have

Mean path length: the mean distance to connect any two nodes in a network

Clustering coefficient: the fraction of friend-of-a-friend relationships that are realized for a node

Small world: used to describe a network that has a short mean path length and a large clustering coefficient.

Assortative: nodes or people with similar numbers of friends are also connected to each other. Interestingly, assortativity is quantified by the correlation coefficient, r, that falls between -1 and +1. Usually we see r as a way to describe how tightly correlated points are on, say, the XY plane.

Disassortative: some nodes have many more links than other nodes

Modularity: if you didn't know anything about the network a priori, how many internal communities would it seem to contain? If the network is only one network, the modularity would be 0. If the community could be split up into many smaller, distantly linked communities, the modularity would approach 1.

What, then, did the authors do with all of these ideas... and Vikings? They did what any level-headed statistician would do: compare the sagas to each other and to something known. In this case, they used modern society as the frame of reference.  Relative to each other, the sagas had distinct features that reflected the genre of the tales. When clumped together into meta-networks (not a word the authors employed), the 5 largest sagas showed strong overlap. The stories were conceivably written in reference to each other. When all 18 sagas were grouped into one giant network, the authors found that the meta-network was structurally similar to modern society.  The mean path length was an achievable 5.5, and the network was small world and assortative. Even more, the huge community had modularity of 0.7 and 9 communities, meaning that the overlap in the sagas would make it impossible to split them back into 18 discrete bunches.

But Nancy, what does it mean?

It means that you and I both now have a general understanding of Network Theory. And, it means that Viking sagas are more realistic than you ever knew. I found a cool packages in R to make network plots, so that will be my next post!

Cheers!

## Sunday, January 26, 2014

### Statistics in a box!

 From the plotly website. *not my work*
I came across an interesting site recently that I thought was worth mentioning. The site is called plotly, and can be found here. It's a free online analysis and visualization application that flaunts easy usability and beautiful graphics.  While I wasn't entirely convinced that it was fantastically intuitive, the demo plots are beautiful.  I think I'd like it better if I used the R option... but then might as well just stay in R.

My main point, however, is that I think plotly does a great job at demonstrating how stunning visualization can be. Just browse through the website-- it's mesmerizing. Without getting too sappy, I love the idea that a display's aesthetic can be as cool as the information itself. Actually, it seems natural that that be the case.

I do, however, have one particular qualm with the site. It seems almost risky to make analysis and visualization so accessible and aesthetic-driven. Particularly if the interface is designed for non-statisticians, what are the risks of processing data according to what "looks right" as opposed to what is definitely (probably) right? Food for thought if anything...

Enjoy!

## Sunday, December 15, 2013

### Success!

The bittersweetness of intensive problem solving is that the best answers seem the most obvious in retrospect. I'll use my week as a case study, employing a much smaller version of my task as an illustrative example. Essentially, I was given a matrix of categorical variables and a corresponding matrix of "populations" that are in each category per row. The task was place each population in a column designated for the given category. I was working in a 220 x 51, so my example is scaled down to 8 x 2.

The data started out looking like this:
2 23  0 NA  1 44  3 17
4 62  3 91  0 NA  0 NA

I needed it to look like this:
44      23       17    0
0       0        91    62

Take a look at the bolded numbers above. In the original matrix, the "1" category had a population of 44. The final result placed the 44 in the first entry of the row. See?

Of course, the first step was to properly sort the categories such that the population stuck with the right number. Thus, I pasted the category and population columns together, sorted, and then re-split the columns to restore the original dimensions. Next, I poured the data back into two matrices: one category matrix, and one population matrix. They looked like this:

Category matrix:
1   2   3   0
3   4   0   0

Population matrix:
44   23   17  0
91   62   0   0

Super duper. The problem, however, remained that the populations needed to be in the corresponding category column. Instead, all the values were squished to the left side of the matrix. The solution? Use each category's number as an index for where each population should go. So elegant! I decided to use an intermediate step with 1's and 0's as placeholders so that I could monitor the process more easily without confusing any data.

Intermediate step:
1  1  1  0
0  0  1  1

achieved by this R code:
poof = matrix(rep(0,prod(dim(population_matrix)
#make an empty matrix to be filled with category-indexed population values

for (i in 1:dim(poof)[1]){
poof[i,categories_matrix[i,]] = 1
}

#Put a 1 in the column indicated by the category number

I then made both matrices into vectors for simplicity's sake.
popvec = as.vector(t(population_matrix))
indvec= as.vector(t(poof))

Finally! The code below reads, "For each nth 1 in the categories vector, replace it with the nth nonzero population in the population vector."
indvec[which(indvec == 1)] = (popvec[popvec != 0]

Rebuild the matrix from the vector, and voilà. Donezo.

Thrilling!

## Sunday, December 1, 2013

### The Cochran-Armitage trend test

 Histograms of the data both before (left) and after (right) binning.
An esteemed and fearless superior at work recently asked me if I knew anything about the Cochran-Armitage test for trend. I didn't, so here's a blog post to substantiate my response.

In a nutshell, the Cochran-Armitage test decides whether increasing doses of an independent variable are associated with a trend in a binary dependent variable outcome. In other words, does adding more and more of something result in a higher probability of a certain result (with two possibilities)? As simple and crucial as this question may seem, I was a bit surprised at how elusive the Cochran-Armitage test was online. I ultimately
found it most informative to compare the test to a few other better-known techniques. Vanderbilt University's Biostatistics Department has some great practice data sets, so I snagged one that included data on heart disease (yes or no) and patients' ages (more or less continuous). See the end of this post for references and due credit.

Here's the question: Does heart disease have a dose-response relationship with age? The heart disease data was already conveniently coded with patient ages and 1's or 0's for disease or no disease, respectively. The ages, however, ranged from 17 to 82. Thus, I "binned" the age data into 7 groups to give myself a more manageable number of "doses".

Below are a few plots showing different ways to model the data. The R code that describes the models is included beside each image. Rather than making you read through all the code, however, I'll tell you this: the p-values associated with each method are resoundingly the same. Each model resulted in a p-value of 2.2 e-16. My conclusion? The Cochran-Armitage test is nothing other than a modified regression model. If there's a trend, the p-value will be small.

Linear Regression

lm(formula = acath$sigdz ~ acath$age)

Residuals:

Significant Coronary Disease by Cardiac Cath
Min      1Q  Median      3Q     Max
-0.9939 -0.5446  0.2308  0.3490  0.6564

Coefficients:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0480370  0.0413568   1.162    0.246
acath$age 0.0118231 0.0007772 15.212 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4569 on 3502 degrees of freedom Multiple R-squared: 0.06198, Adjusted R-squared: 0.06171 F-statistic: 231.4 on 1 and 3502 DF, p-value: < 2.2e-16 Linear Regression lm(formula = acath$sigdz ~ age_bins)

Residuals:

Significant Coronary Disease by Cardiac Cath
Min      1Q  Median      3Q     Max
-0.9083 -0.5833  0.2000  0.3084  0.6334

Coefficients:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1499592  0.0367136   4.085 4.51e-05 ***
age_bins    0.0108335  0.0007533  14.382  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4584 on 3502 degrees of freedom

Multiple R-squared:  0.05577, Adjusted R-squared:  0.0555
F-statistic: 206.8 on 1 and 3502 DF,  p-value: < 2.2e-16

Logistic Regression
glm(formula = acath$sigdz ~ acath$age)

Deviance Residuals:

Min       1Q   Median       3Q      Max
-0.9939  -0.5446   0.2308   0.3490   0.6564

Coefficients:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0480370  0.0413568   1.162    0.246
acath$age 0.0118231 0.0007772 15.212 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.208746) Null deviance: 779.33 on 3503 degrees of freedom Residual deviance: 731.03 on 3502 degrees of freedom AIC: 4458.4 Number of Fisher Scoring iterations: 2 Analysis of Variance Table Response: acath$sigdz

Df Sum Sq Mean Sq F value    Pr(>F)
as.factor(age_bins)    6  45.66  7.6092  36.269 < 2.2e-16 ***
Residuals           3497 733.68  0.2098
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Asymptotic General Independence Test
Here it is! This is what R considers Cochran-Armitage testing. Alas, the output is similar as ANOVA, but the guts of the model are left to our imagination.

data:  sigdz by age

chi-squared = 217.1192, df = 1, p-value < 2.2e-16

Source of Data:
I used the data set called "Cardiac Catheterization Diagnostic Data from Vanderbilt University Biostatistics Department website. The same data set can be found at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets.

## Friday, November 29, 2013

Statistics gets a bad rap for being intentionally misleading. Women feel that they are unfairly underrepresented in technical fields. I'll withhold my own opinions for the moment and offer a digestible article that covers both bases. This article on Science Daily gives a nice summary of an analysis conducted by the American Institute of Physics Statistical Research Center that uses basic stats to show that the bias may not be as clear-cut as we think.

Happy holidays!

American Institute of Physics (AIP). "All-male physics departments are not proof of bias against hiring women, analysis finds." ScienceDaily, 19 Jul. 2013. Web. 29 Nov. 2013.

## Sunday, November 10, 2013

### It gets dicey.

Suppose that, hypothetically speaking, you find yourself playing dice against a few faithful buddies. Because this is all hypothetical, let's say that it's also late on a Saturday night and you don't have RStudio handy-- there's no app for that. The game gets heated, and it becomes increasingly difficult to convince yourself that you are not in fact endowed with some superior skill in the game. I hereby present a few calculations to make it easier to reconcile yourself with reality.

 I made this 3D plot in R.It's sweet.
First, let's lay down the rules of the game. Start by rolling three dice until you have one of the following combinations:

• Any single number and a pair of another number, ie (3, 1, 1). Your "score" is the value of the single number. In this example your score is 3.
• Any triple, like (4,4,4). Your score would be 4.
• The specific combination (1,2,3). This is the lowest possible score, which we'll arbitrarily assign the value of 0.
• The specific combination (4,5,6). This is the highest possible score, which we'll call 7.

Now for the sobering part. Let's look at the probabilities of the possible outcomes on a few different rolls.

 Outcome Probability Final Fraction Percent Double ${6 \choose 1} {3 \choose 2} \frac{1}{6} \frac{1}{6} \frac{5}{6}$ $\frac{90}{216}$ 41.67% Double 6's ${3\choose2} \frac{1}{6} \frac{1}{6} \frac{5}{6}$ $\frac{15}{216}$ 6.944% Triple $\frac{6}{6} \frac{1}{6} \frac{1}{6}$ $\frac{6}{216}$ 2.78% Triple 6's $\frac{1}{6} \frac{1}{6} \frac{1}{6}$ $\frac{1}{216}$ 0.463% 1,2,3 $\frac{3}{6} \frac{2}{6} \frac{1}{6}$ $\frac{6}{216}$ 2.78% 4,5,6 $\frac{3}{6} \frac{2}{6} \frac{1}{6}$ $\frac{6}{216}$ 2.78%

See something strange? It might seem unexpected that you're more likely to get the sacred 4,5,6 than roll a 6,6,6. Think of it this way: in order to roll all 6's, there's only one choice for each die. Each die therefore has a probability of 1/6 of getting it right. Rolling a 4,5,6, however, means that the first die could be any of 3 options. The second die in the set could be any of the 2 remaining options. Finally, the third die must be the missing link and has a probability of 1/6 of fulfilling the set. The same goes for 1,2,3.

I haven't calculated the probability of trumping specific rolls because the number of possibilities are so huge and depend on the number of players involved. Hopefully the above percentages are sufficient proof that when it comes down to the math, we're the one's being played by the game.

## Saturday, November 9, 2013

### 3+3 Dose Escalation Method... or not.

Yesterday I had yet another great opportunity to learn directly from an expert. Dr Tatsuki Koyama presented a Clinical Research Center talk on the benefits and failings of 3+3 dose escalation for phase I clinical trials.  The talk was fascinating, and introduced me to a new layer of complexity to the statistician-clinician research interface. Oh how the saga continues!

Phase I clinical trials are conducted on a small group of participants to determine the acceptable dose range for a research drug.  By some estimates, 98% of clinical trials use the 3+3 method in phase I before moving on to later phases. What, then, is 3+3 dose escalation? According to "Dose Escalation Methods in Phase I Cancer Clinical Trials" in the Journal of the National Cancer Institute (May 2009,  Tourneau, Lee,  Siu) and Dr Koyama, the method proceeds as follows:

Three patients receive a drug at a starting dose determined by an animal model. Now...
Option 1: If none of the patients experience unacceptable indications of toxicity (ie death or liver failure), increase the dose to the next predetermined level.
Option 2: If one patient out of the three experience toxicity, test three more patients. If 2 out of 6 reach dose-limiting toxicity, stop here; this will be the dose for the phase II trial.
Option 3: If more than one patients experience toxicity, drop down a level. This may mean going to "level 0".

There are some obvious advantages to the 3+3 method. It's simple. It's straightforward to follow. In an ideal world, it will use very few patients which is logistically, ethically, and economically less complicated.

The problems arise when it's time to analyze the results. Because dose changes depend only on the preceding (very small) cohort's outcome, there is a high risk of unintentionally settling on less-than-ideal dosage. The model is also unequipped to deal with side effects that take hold after a long time. From a stats standpoint, it's hard to imagine a convincing way to put a confidence interval around the result, meaning that we're left to blindly accept that 3+3 has scissor-rock-papered us to the optimal dose.

So what now? How can we as statisticians and scientists reconcile the fact that clinical research is dominated by a subpar method that just so happens to be FDA approved? The answer may be in a technique called continual reassessment methods, or CRM. Bayesians rejoice, because CRM is simply a Bayesian model that consists of a prior dose-toxicity curve that is updated with data from consecutive patients' outcomes. The prior model is decided by preclinical data. Dosages start at the expected maximum tolerated dose, and each sequential dose depends on the new model. In practice, Bayesian models behave very much like we think: our actions reflect the sum of our previous experiences, but at the beginning we act off of our prior knowledge. You didn't ask, but I find this a refreshingly obvious way of functioning. Even more, CRM allows us to calculate a confidence interval around the resulting dose. P-values are both satisfying and reassuring to the general public.

To be fair, there are legitimate problems with CRM. Clinical research has to be accessible to doctors, drug companies, and the FDA. Thus, the concern that Bayesian models are not transparent to clinicians is a valid issue. Additionally, patient safety depends on a good prior model. Otherwise, the patient(s) may receive deleteriously high or low levels of the study drug. Safety measures and cut-off points need to be firmly in place before the trial begins, and it may make sense to make other modifications-- like expanding the cohort size to increase clarity at certain dosages-- along the way.

Alas, the real challenge is rooted in the implementation of CRM over 3+3 in a clinical setting. Theory versus practice, enter stage left.