is the act, sometimes art, of converting, correcting, or filter
ing a dataset in preparation for final processing. Data needs to be massage
d when its original form doesn't lend itself to easy or correct analysis.
Every dataset has its own peculiarities. Programs used to scrutinize the data have their own idiosyncrasies. Conflicting date formats, illegal characters, duplicate data -- the list of possible problems is virtually infinite, especially if you have no control over the input to the dataset.
Sometimes solutions are simple. You may have a set of temperatures from major cities around the world, some in Fahrenheit, some in Centigrade. A simple math conversion and all your numbers are soon on the same scale and ready for comparison.
Or you may be given millions of names with addresses and some personal information, collected from various sources, and asked to provide the names and addresses of a specific subset; let's say all married couples where one or both spouses are over age 64. You'll quickly find that data from Source A has a name format of < firstname middle_initial lastname> , Source B uses < lastname firstname middle_initial> , and Source C lists just < firstname lastname> . Before you can even begin to try and extract the data you want, all the fields are going to have to be examined and their data massaged, so that you can make valid comparisons.
When the information being dealt with is fairly simple of itself, massaging the data may be time consuming, but shouldn't be very difficult. But when the data becomes more complex - say the results of a space-based spectral image processing system - identifying good and bad data is a job only for specialists in the particular field.