The branch of artificial intelligence or pattern recognition that deals getting computers to process and interpret images; in other words, given an image or sequence of images, describe the objects or the setting that produced them. Typical applications are mobile robotics, industrial robots, manufacturing (quality control), and defense.

Some people don't classify computer vision as a field of artificial intelligence] because there's really no intelligence involved with vision: you can't introspect how it works, and everyone who can see is an expert at it (although no one can explain why). (This argument is seldom used to disqualify language understanding, however. Perhaps it's because computer vision is usually formulated as a mathematical problem whereas language understanding is usually tied to knowledge representation and ontology.

Other people miss the point and equate computer vision with computer graphics. It's actually the inverse problem. Computer graphics deals with generating images from models of scenes, whereas computer vision deals with generating models of scenes from images. It's a much harder problem because unless you severely constrain the domain, there are usually an infinite number of solutions. (Although most of the solutions will be nonsensical, try telling a computer what's nonsensical.) For example, imagine a drawing of a cube. You can say qualitatively that this is a cube, but the cube could lie anywhere along your line of sight, assuming that you don't know its size or the focal length of the imaging apparatus.

This is why people put constraints on the problem. The scenes are usually limited, such as restricting them to a known set of objects. The position and optics of the camera are almost always known. Transparent and deformable objects are avoided. The light source is also known and controlled to avoid shadows. The few situtations where such environments occur are typically industrial, such as factories.

The "old school" of computer vision thinks in terms of math and geometry: "given a array of pixels, solve for the set of objects in the scene along with their positions and orientations". More recently researchers have concentrated on active vision, which asks "What does it take to navigate or perform similar tasks?" Animals can avoid walking into holes without knowing the exact distance between themselves and the hole, and people can pick up objects without (most likely) solving simultaneous equations in their head. Active vision considers streams of images instead of single images to be analyzed in isolation. It also considers the movement of the camera itself as an input for interpreting the scene. For example, as you move forward, are certain objects getting bigger faster than other objects? If so, they're likely to be closest to you.