## Friday, November 29, 2013

Statistics gets a bad rap for being intentionally misleading. Women feel that they are unfairly underrepresented in technical fields. I'll withhold my own opinions for the moment and offer a digestible article that covers both bases. This article on Science Daily gives a nice summary of an analysis conducted by the American Institute of Physics Statistical Research Center that uses basic stats to show that the bias may not be as clear-cut as we think.

Happy holidays!

American Institute of Physics (AIP). "All-male physics departments are not proof of bias against hiring women, analysis finds." ScienceDaily, 19 Jul. 2013. Web. 29 Nov. 2013.

## Sunday, November 10, 2013

### It gets dicey.

Suppose that, hypothetically speaking, you find yourself playing dice against a few faithful buddies. Because this is all hypothetical, let's say that it's also late on a Saturday night and you don't have RStudio handy-- there's no app for that. The game gets heated, and it becomes increasingly difficult to convince yourself that you are not in fact endowed with some superior skill in the game. I hereby present a few calculations to make it easier to reconcile yourself with reality.

 I made this 3D plot in R.It's sweet.
First, let's lay down the rules of the game. Start by rolling three dice until you have one of the following combinations:

• Any single number and a pair of another number, ie (3, 1, 1). Your "score" is the value of the single number. In this example your score is 3.
• Any triple, like (4,4,4). Your score would be 4.
• The specific combination (1,2,3). This is the lowest possible score, which we'll arbitrarily assign the value of 0.
• The specific combination (4,5,6). This is the highest possible score, which we'll call 7.

Now for the sobering part. Let's look at the probabilities of the possible outcomes on a few different rolls.

 Outcome Probability Final Fraction Percent Double ${6 \choose 1} {3 \choose 2} \frac{1}{6} \frac{1}{6} \frac{5}{6}$ $\frac{90}{216}$ 41.67% Double 6's ${3\choose2} \frac{1}{6} \frac{1}{6} \frac{5}{6}$ $\frac{15}{216}$ 6.944% Triple $\frac{6}{6} \frac{1}{6} \frac{1}{6}$ $\frac{6}{216}$ 2.78% Triple 6's $\frac{1}{6} \frac{1}{6} \frac{1}{6}$ $\frac{1}{216}$ 0.463% 1,2,3 $\frac{3}{6} \frac{2}{6} \frac{1}{6}$ $\frac{6}{216}$ 2.78% 4,5,6 $\frac{3}{6} \frac{2}{6} \frac{1}{6}$ $\frac{6}{216}$ 2.78%

See something strange? It might seem unexpected that you're more likely to get the sacred 4,5,6 than roll a 6,6,6. Think of it this way: in order to roll all 6's, there's only one choice for each die. Each die therefore has a probability of 1/6 of getting it right. Rolling a 4,5,6, however, means that the first die could be any of 3 options. The second die in the set could be any of the 2 remaining options. Finally, the third die must be the missing link and has a probability of 1/6 of fulfilling the set. The same goes for 1,2,3.

I haven't calculated the probability of trumping specific rolls because the number of possibilities are so huge and depend on the number of players involved. Hopefully the above percentages are sufficient proof that when it comes down to the math, we're the one's being played by the game.

## Saturday, November 9, 2013

### 3+3 Dose Escalation Method... or not.

Yesterday I had yet another great opportunity to learn directly from an expert. Dr Tatsuki Koyama presented a Clinical Research Center talk on the benefits and failings of 3+3 dose escalation for phase I clinical trials.  The talk was fascinating, and introduced me to a new layer of complexity to the statistician-clinician research interface. Oh how the saga continues!

Phase I clinical trials are conducted on a small group of participants to determine the acceptable dose range for a research drug.  By some estimates, 98% of clinical trials use the 3+3 method in phase I before moving on to later phases. What, then, is 3+3 dose escalation? According to "Dose Escalation Methods in Phase I Cancer Clinical Trials" in the Journal of the National Cancer Institute (May 2009,  Tourneau, Lee,  Siu) and Dr Koyama, the method proceeds as follows:

Three patients receive a drug at a starting dose determined by an animal model. Now...
Option 1: If none of the patients experience unacceptable indications of toxicity (ie death or liver failure), increase the dose to the next predetermined level.
Option 2: If one patient out of the three experience toxicity, test three more patients. If 2 out of 6 reach dose-limiting toxicity, stop here; this will be the dose for the phase II trial.
Option 3: If more than one patients experience toxicity, drop down a level. This may mean going to "level 0".

There are some obvious advantages to the 3+3 method. It's simple. It's straightforward to follow. In an ideal world, it will use very few patients which is logistically, ethically, and economically less complicated.

The problems arise when it's time to analyze the results. Because dose changes depend only on the preceding (very small) cohort's outcome, there is a high risk of unintentionally settling on less-than-ideal dosage. The model is also unequipped to deal with side effects that take hold after a long time. From a stats standpoint, it's hard to imagine a convincing way to put a confidence interval around the result, meaning that we're left to blindly accept that 3+3 has scissor-rock-papered us to the optimal dose.

So what now? How can we as statisticians and scientists reconcile the fact that clinical research is dominated by a subpar method that just so happens to be FDA approved? The answer may be in a technique called continual reassessment methods, or CRM. Bayesians rejoice, because CRM is simply a Bayesian model that consists of a prior dose-toxicity curve that is updated with data from consecutive patients' outcomes. The prior model is decided by preclinical data. Dosages start at the expected maximum tolerated dose, and each sequential dose depends on the new model. In practice, Bayesian models behave very much like we think: our actions reflect the sum of our previous experiences, but at the beginning we act off of our prior knowledge. You didn't ask, but I find this a refreshingly obvious way of functioning. Even more, CRM allows us to calculate a confidence interval around the resulting dose. P-values are both satisfying and reassuring to the general public.

To be fair, there are legitimate problems with CRM. Clinical research has to be accessible to doctors, drug companies, and the FDA. Thus, the concern that Bayesian models are not transparent to clinicians is a valid issue. Additionally, patient safety depends on a good prior model. Otherwise, the patient(s) may receive deleteriously high or low levels of the study drug. Safety measures and cut-off points need to be firmly in place before the trial begins, and it may make sense to make other modifications-- like expanding the cohort size to increase clarity at certain dosages-- along the way.

Alas, the real challenge is rooted in the implementation of CRM over 3+3 in a clinical setting. Theory versus practice, enter stage left.

## Friday, November 8, 2013

### Topological Data Analysis

On Thursday I had the exciting chance to see a lecture by a professor from the University of Chicago, Dr. Shmuel Weinberger. The talk, hosted by the Vanderbilt University Mathematics Department, was about topological data analysis for high dimensional data. It was really interesting to hear a mathematician's perspective on modeling real data with visualizable models. More than anything, I left with a bit more insight on the math-stats landscape. Pun intended.

Just for fun, here's a picture of a Klein bottle. Like a Möbius strip it is unorientable surface that has no definable inside or outside.

## Friday, October 25, 2013

### Support Vector Machines

Here are some awesome plots of support vector machines in 3D! The first two show two categories; the last two are in three categories. Both are in 3D. The contour between the categories is the result of support vector machinery... notice that some points are misclassified.
 Two classes in three dimensions.

Three dimensions with three classes.

 This is the previous plot from the back.

## Monday, October 14, 2013

### Yes, I support vector machines

Today I'm going to try to clear the fog around support vector machines... because I'm excited about them. In the fewest words: SVMs are a way to classify data by putting walls between different categories of training data such that the walls divide the categories in both the training data and the new data as accurately as possible.

Before getting to actual SVMs, it's necessary to take a few steps back to understand the foundations. I'll start with a completely separable case in 2 dimensions. Imagine playing capture the flag in gym class: two teams (purple and green) on either side of a line down the middle of the gym floor. If someone from the purple team crosses into the green team's territory she risks getting captured and embarrassed-- it's safest to stay as far away from the line as possible. Disregarding the point of the game for a moment, we have now illustrated the idea of a maximal margin classifier. The best division between classes is as far as it can be from both. Technically, the maximal margin classifier is a hyperplane, or a dividing wall that is one dimension smaller than the data. The gym floor is two dimensions, so the hyperplane dividing line is one dimension. If everyone were standing on a line, the hyperplane would be a point. This idea extends up into high dimensions, too.

Terrific, but where does this hyperplane come from and how does it "know" to separate the data? The hyperplane is essentially a "line" in any given dimension, which is usually called "p" by convention. If a line is y = b1x + b0, then a hyperplane in a p-dimensional space would be y = b1x + b2x + ... + bpx + b0 where the b's maximize the margin. The details aren't as crucial here as the take-home point: a maximum margin hyperplane (or classifier) splits the classes by finding a hyperplane that is furthest from everyone.

However, the point of the game is to get flags from the other side. Thus the balancing act between crossing the line and staying safe begins. This is like a nonseparable case where it's impossible or undesirable to build a separating hyperplane that perfectly divides categories. It isn't hard to imagine when it'd be impossible to separate classes, but why might it be undesirable? Answer: overfitting. If we draw a line that separates the training data perfectly but has a tiny margin then it's likely that our test data will disobey the line. Sometimes it ultimately makes more sense to build the wall while taking the training data with a grain of salt... because it might not be the spitting image of our future test data.

Back to the PE metaphor. Remember those kids who would always run to the other side just so that they could get tagged and do nothing for the remainder of the hour? Let's let those kids represent the points who don't fall neatly into classes. And, let's use the whole concept of kamikaze misclassification as a way to frame support vector classifiers. If a few kids, or support vectors, didn't flee or stand dangerously close to the line then there'd be no game. Support vector classifiers find a maximum margin hyperplane while allowing some wiggle room for misclassification in the test data. The statistician has control over this wiggle room, either by discretion or more rigorous means of calibration (that I'll hold off on for now).

Great, we're getting there. Now let's change the game a little and pretend that the line down the gym floor could move and bend to keep the kids on the correct "side" even if they were scattered around. If a quadratic shape works best, so be it. Curved like a cubic line? Sure. This choice of line shape is called a kernel. A support vector classifier that has a non-linear kernel is a support vector machine. Remember the maximum margin classifier y = b1x + b2x + ... + bpx + b0 ? An SVM operates very similarly. Once we decide to use a different kernel, the math works out such that the b's are now functions of only the support vectors (who are either on or within the margin). Classification can proceed as always with new test data.

Phew.

ps: R has some interesting packages for building SVM models. I'm working on some cool plots for future posts.

## Sunday, October 13, 2013

### They got it right.

Hurray! I would like to recommend an overall awesome machine learning book.

The book is called An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer Texts in Statistics, 2013). It's the little sister to The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer Series in Statistics, 2009) and is beautifully written, beautifully illustrated, and contains excellent R code resources for practice. I bought it as a companion to the latter and have really enjoyed being able to access the topics without having to dig through superdense math. Five stars. Or six, really. For better or worse I'm reading it from back to front, and my next post is inspired by the chapter on support vector machines, or SVMs.