A function that maps data to labels. Informally, it's a mathematical model (or computer program, if you wish) that attempts to identify objects based on measurements or other data. It may return the most likely identification, or the probability distribution for all possible identifications, or a simple "no idea".

You can think of Peterson's Field Guide to Birds as a kind of classifier, in that it maps characteristics like size, color, body shape, and locale to species. These data aren't precise, and they're usually not part of the formal definition of a species. But even though "seldom seen in deserts" isn't part of the definition of geese, it's certainly useful in identifying what's likely to be a goose and what isn't. The characteristics used by a classifer are known as predictor variables.

A classifier is built using the statistics of these data. Several factors may be considered:

  • How much classifying information is there in each predictor variable? A variable such as "Does it have feathers?" doesn't help much.
  • Are there overlapping or correlated variables? In other words, if we know the value of one variable do we know the value of other variables? If we know that a bird has large, nasty claws, we probably don't have to ask whether it also has a large, nasty beak.
  • Are certain variables significant only in combination with each other? Height and weight aren't significant in classifying people as skinny or fat by themselves but they are in combination. (Sorry, I couldn't think of any bird examples.)
Once we know which variables are significant, we have to determine how to combine them. Again, there are many choices:
  • Convert the variables to numbers, multiply each by a weight, take the sum, then give labels to the ranges of results. This is known as regression, and if the sum ranges from 0 to 1 and represents a probability, we have logistic regression.
  • Make a decision tree. At the root, we use a variable to split the space of identities based on certain values of that variable, and at each split we apply other variables, and so on, until the leaves of the tree are unique identities. This is fields guides often do. (Formally, this is a classification tree.)
  • Train a neural network to accept the data and return classifications. Of course, we would have no clue as to which variables it was actually using or how it was combining them, but we'd have a decent classifier.
  • Compare new data with objects previously classified by an expert and choose the identity of the object that appears to be most similar. This is known as a nearest neighbor algorithm. ("This one sure looks like it...hey, the tag here says it's a goose".)
  • Calculate the conditional probability of the values of each variable given the identity, calculate the prior probability of each identity, then use Bayes' theorem to calculate the probability of the identity given the data. This is called a Bayesian classifier. If we make the strong assumption that the variables are independent (that is, the value of one variable doesn't tell you anything about the values of another variable), then we have a naive Bayesian classifier.
  • Create some wild-ass function that maps data to identities. This can get you published if you can pass the peer review.
Although the specific process of creating (or as it's sometimes called, inducing) each type of classifier is too long to be described here, there are is a strategy for creating classifiers in general:
  1. Create a set of pre-identified data and divide it randomly into a training set and a test set.
  2. Create a classifier using the training set.
  3. Apply the classifier to the test set.
  4. Compare the results of the classifier with the actual identities from the test set. In general, there are two kinds of mistakes. Assuming that we're really interested in identifying geese:
    • False positive – the classifier claims that it's found a goose when in fact it's not really a goose
    • False negative – the classifier claims it's not a goose when in fact it is a goose
The importance of false positives and false negatives are apparent when you think of important classifications, such as disease diagnosis. A false positive classification of appendicitis may mean someone gets an appendix extracted needlessly. A false negative may mean someone dies.