Wednesday, September 19, 2007

Bayesian Statistics

As i've moved from one domain to another during my computer software career, there are certain ideas that seem to follow along. The calculation of eigenvalues and eigenvectors pops up in many different places, as does singular value decomposition. Markov chains have shown up in everything from statistical mechanics to language analysis. I've used K-means clustering in drug discovery and for music personalization.

Another idea that seems to be universally applied these days is Bayesian statistics. When i first encountered the idea a decade or so ago, it was a fairly obscure concept known primarily to statistics geeks, but Paul Graham gave the Bayesian approach a boost when he suggested using it for spam filtering.

Conceptually, the idea of Bayesian statistics is fairly easy to comprehend: evidence (or belief) should factor into probabilities. An oft-cited example is the probability that the sun will rise tomorrow. Normal statistics doesn't say much about that probability, but the Bayesian approach contends that its highly likely because the sun has risen reliably for quite some time and so there's reason to believe that it will tomorrow. In practice, i don't find this example to be very useful. Another common example that i like better is this: suppose you have to determine the probability that somebody has cancer. With no further evidence of the person's habits, you could guess the probability by simply determining the percentage of people who have cancer among the whole population. But if you know that the person is a smoker, you might expect this probability to be higher.

So Bayesian probabilities are essentially conditional probabilities (the probability of event A given event B, usually written P(A/B). The so-called Bayes formula expresses this conditional probability like this:

P(A/B) = P(B/A) * P(A)/P(B)

This says the probability of event A given that event B has happened is equal to the probability of event B given that A has happened times the probability of event A divided by the marginal probability of event B. Which is completely non-intuitive to me.

Another way of thinking about this is that the probability of event A given event B is different that the probability of event A without the condition of event B by the factor P(B/A)/P(B). So for example, the probability that somebody has cancer given that they're a smoker, P(cancer/smoker), is different than the probability that they have cancer, P(cancer), without the evidence of smoking by the factor P(smoker/cancer)/P(smoker). If the latter factor is >> 1 then the idea that smokers are more likely to have cancer is supported.

This is still not particularly intuitive to me, but let's look at it a bit. Take the extreme case that P(smoker) = 1, that is, everybody smokes. Then this factor can at most be 1.0, so the knowledge that somebody is a smoker is not helpful. Similarly, if you were to poll a population of cancer patients and find that they were all smokers, then the numerator of this factor is 1. If P(smoker) is less than 1, then this factor tells us that being a smoker definitely influences the probability of cancer. In Bayesian terms the probability P(cancer/smoker), which is called the posterior probability, is greater than P(cancer), which is called the prior probability.

At a glance, it doesn't seem like the above equation does much for you, since the thing you're trying to find, P(A/B), has to be calculated in terms of an equally complex term, P(B/A). The trick is that P(B/A) is often easier to determine from known data. Put in terms of scientific method, it's often easier to determine the probability of the evidence given the truth of the hypothesis than to determine the probability of the hypothesis given the evidence.

An example from recommendation technology would be something like this: suppose that you have a large collection of ratings for both albums and artists, with values of 1 to 4. You want to determine the probability that a given user would give an artist S a 4 rating, P(S=4). Without any other information, you could estimate this probability by looking at the percentage of 4 ratings for that artist among the entire set of ratings.

However, suppose that the user has not rated the artist, but has rated one of the artist's albums. The value of that album rating can be considered evidence of the user's opinion about the artist. Let's say the rating for the album is M, then we can look at P(S=4/M) . Bayes formula gives us

P(S=4/M) = P(M/S=4) * P(S=4)/P(M)

Here, the P(M) term is more complicated than it looks. I glossed over the fact above that the denominator of this formula is a marginal probability. To understand the origin and concept of a marginal probability, it helps to know the relationship between conditional and joint probabilities. A joint probability is the probability of two events occurring together, usually written as P(A,B). Joint probability can be written in terms of conditional probability

P(A,B) = P(B/A) * P(A)

where P(A) is a marginal probability, meaning that it's the probability of the event A occurring regardless of event B. The reason it's called marginal is because if you consider a table listing the probabilities of all the possible joint events (assuming A and B take on discrete values), then if you sum up the joint probabilities across one row (meaning the total probability for a given value of A regardless of the value of B), then you'd write the sum in the margin. Get it? The marginal probability for P(M) can be written like this:

P(S,M) = P(1,M) + P(2,M) + P(3,M) + P(4,M)

or

P(M/S=1)*P(S=1) + P(M/S=2)*P(S=2) + P(M/S=3)*P(S=3) + P(M/S=4)*P(S=4)

The formula now is composed of terms that we can derive from the ratings data. Given an actual value for the rating M for a particular user, we could then predict the mostly likely artist rating for that user.

This is a contrived example, and i wouldn't really recommend it as a method for doing music recommendations, but i hope it does illustrate how this concept can be applied in various domains. There are far more interesting applications of Bayesian statistics in machine learning, the physical sciences, and even search and rescue.

No comments: