slowly changing dimension (idea)
Return to slowly changing dimension (idea)
| One of the few poetically named concepts I have come across while working with huge blobs of data.
The concept of slowly changing dimensions, or SCD, was introduced by Ralph Kimball in his work on data warehousing.
Understanding what a slowly changing dimension is and how it works may require a bit of background, so here's an example, rather lengthy I am afraid, of
the kind of reasoning that leads to the concept's definition. +---------------------+-------------------+--------------------+--------------+------------------+--------------------+---------+ | timestamp |credit card number |owner name |owner zip code|transaction amount| item name |item type| +---------------------+-------------------+--------------------+--------------+------------------+--------------------+---------+ | 1971-01-09 20:32:00 | 45123123123123 | Mickey Mouse | 101 64 | 42.50 | egg nog blender | bar tool| | 1971-01-09 20:32:00 | 33444222333233 | August Mittagessen | 621 01 | 99.00 | tio pepe dry sherry| booze | ...model 1 Such a table allows for interesting purchase patterns reports to be created. For example we could use the zip code field together with the item type to, say, target a geographic area where a class of products is selling poorly with tailored advertisement. Another possible way to use the table could be to track the purchase amounts by geographic area over time, maybe in order to follow the purchase patten development in a shop's neighborhood. There are a few problems with this schema though. A pressing one is that transaction tables tend to contain a huge number of records. Actually their size tends to dwarf anything else in a typical database. A way to minimize the issue could be to replace fields whose values tend to repeat themselves with shorter placeholders. In our example we can see that a probably limited number of customers, using a likewise limited number of credit cards, chooses to purchase from a limited collection of items a possibly _unlimited_ number of times. A solution in this case would be to create a new table for the card-related fields, and another for the item-related fields. The results would be something like this: card table +---------+-------------------+--------------------+---------------+ |card_fk |credit card number |owner name |owner zip code | +---------+-------------------+--------------------+---------------+ | 1 |45123123123123 |Mickey Mouse | 101 64 | | 2 |33444222333233 |August Mittagessen | 621 01 | ... item table +--------+--------------------+----------+ |item_fk |item name |item type | +--------+--------------------+----------+ | 1 |egg nog blender |bar tool | | 2 |tio pepe dry sherry |booze | ... transaction table +---------------------+--------+--------+------------------+ | timestamp |card_fk |item_fk |transaction amount| +---------------------+-----------------+------------------+ | 1971-01-09 20:32:00 | 1 | 1 | 42.50 | | 1971-01-09 20:32:00 | 2 | 2 | 99.00 | ...model 2 Another representation of the same model may shed some light on the reason why such a schema is called a 'star':
+------------------+ +-------------+
|card | |item |
+------------------+ +-------------+
|card_fk |-|----+ +----|-|item_fk |
+------------------+ | | +-------------+
|credit card number| /|\ /|\ |item name |
|owner name | +--------------+ |item type |
|owner zip code | |transaction | +-------------+
+------------------+ +--------------+
|customer_fk |
|item_fk |
+--------------+
|timestamp |
|amount |
+--------------+
As the number of dimensions connected to our transaction table increases, the model start looking like a star of sorts, or at least so I am told. But I digress.
In 'model 2' above I replaced the card and item related fields in the transaction table with placeholders,
numbers that uniquely identify records in the card and item tables respectively.
Such placeholders are called 'foreign keys', hence the field names.
Such groupings of related category fields are called data 'dimensions'.
Updating dimension tables does have its quirks, though. Let's say that we were notified that a card owner moved to a new address and we wanted to update the zip code of her record.
We could simply update the record, which means that all of the customer's transactions, past and future, would be connected to the new zip number. This is called a slowly changing dimension of type 1 (finally! :^) ).
An alternative way to handle the situation is to create a new card owner record, a copy of the original excluding the zip code, which is given the new value.
The new record gets a new, unique foreign key, and new transaction records are connected to the card table via the new foreign key. Lo! Our query is not skewed any longer. A third way to handle our customer moving around is to create an alternative view of her record. For example we could design the card table something like this way: card table +---------+-------------------+--------------------+-----------------+------------------+ |card_fk |credit card number |owner name |current zip code |previous zip code | +---------+-------------------+--------------------+-----------------+------------------+ | 1 |45123123123123 |Mickey Mouse | 101 64 | 101 64 | | 2 |33444222333233 |August Mittagessen | 621 01 | 696 69 | ...This is a slowly changing dimension of type 3. By doing this we may ask our database questions like 'where do we find people that buy ungodly quantities of booze, where do they live _right now_?'. Knowing the time for the latest zip code update would also allow us to see how the answer to this question has changed over time.
There is of course more to SCDs, but this explanation is meant to be an introduction to the subject for people that are not actually involved in database design, and I will leave the more excruciatingly boring stuff out. | Existing: Non-Existing: |