data mining - Everything2.com

Reposted due to a node title typo

Data mining is the art of extracting (hopefully valuable or useful) information or knowledge from extremely large amounts of accumulated data (usually in the form of data warehouses) without necessarily having any prior knowledge of the kind of information you're looking for.

That last part is what makes data mining difficult and exciting and different from the kind of data analysis that people are used to. Normally, you already have an idea what you're looking for and then test this hypothesis by conducting experiments that are designed to make it clearly visible and gathering only the data needed to prove (or disprove) the hypothesis.

Data mining, however, is done by collecting all available data - large amounts of unordered records, often with a high dimensionality - and then looking for interesting patterns: correlations between certain parameters, periodic cycles, etc. Of course, the main problem is that it hard to find something without knowing what it is. Some things (such as correlations) are always interesting, so there are some standard tests. But other, more complex results often require luck and intuition to find.

The biggest problem is the high dimensionality of the data, which makes it impossible to visualize the entire data set and let the most capable known pattern analyzer - the human brain - do the work. Therefore, many methods in data mining aim at somehow reducing the dimensionality of the data without losing information.

To give a practical example, a company that produces sheet metal may notice that the hardness of their product fluctuates quite a lot. They want it to be hard, of course, but they don't know which constellation of the many parameters (temperature, pressure, the exact composition of the metal alloy, the presence of certain catalysts, the rates at which all of this changes etc.) produces the optimal result. Careful data mining of sensor readings during the production process may reveal that the sheet metal is hardest when it contains 15% copper that is added slowly after all the other ingredients have been heated to no more than 1600 degrees Celsius at at least 200 bar pressure (I made these values up, of course).

Data Warehouse	I dreamt a Muslim Harem	seminar paper: personal privacy	data matching
Strong AI vs Weak AI	Don't kill your invisible husband to see what he looks like or you'll sob your heart out. But don't worry about the millions of invisible men coming to attack your village because they won't kill you if you don't know how to fight them.	ASCII pin-ups of the forties	Classifier
information foraging	Supermarket discount cards	The HURD	partner mining
The death of privacy	Screen scraper	massaging data	Informix
Data Archaeology	K Means	Graph-Colouring Problem	PubGene
out of sample testing	Just a babysitter	Customer Relationship Management	Customer Response Marketing