Chapter 5 : Order Management
Order management consists of several critical business processes, including
order, shipment, and invoice processing. These processes spawn important
business metrics, such as sales volume and invoice revenue, that are key performance
indicators for any organization that sells products or services to
others.
It is natural and common, especially for customer-oriented dimensions, for a dimension
to simultaneously support multiple independent hierarchies. The hierarchies
may have different numbers of levels. Drilling up and drilling down within each of
these hierarchies must be supported in a data warehouse.
The one-to-one or many-to-one relationship may turn out to be a many-tomany
relationship. As we discussed earlier, if the many-to-many relationship
is an exceptional condition, then we may still be tempted to combine
the sales rep attributes into the ship-to dimension, knowing that we’d need
to treat these rare many-to-many occurrences by issuing another surrogate
ship-to key.
If the relationship between sales rep and customer ship-to varies over time
or under the influence of a fourth dimension such as product, then the
combined dimension is in reality some kind of fact table itself! In this
case, we’d likely create separate dimensions for the sales rep and the customer
ship-to.
If the sales rep and customer ship-to dimensions participate independently
in other business process fact tables, we’d likely keep the dimensions
separate. Creating a single customer ship-to dimension with sales rep
attributes exclusively around orders data may make some of the other
Role Playing Dimension
Role-playing in a data warehouse occurs when a single dimension simultaneously
appears several times in the same fact table. The underlying dimension may exist as
a single physical table, but each of the roles should be presented to the data access
tools in a separately labeled view.
Deal Dimension
The deal dimension is similar to the promotion dimension from Chapter 2. The
deal dimension describes the incentives that have been offered to the customer
that theoretically affect the customers’ desire to purchase products. This
dimension is also sometimes referred to as the contract. As shown in Figure 5.4,
the deal dimension describes the full combination of terms, allowances, and
incentives that pertain to the particular order line item.
Degenerate Dimension for Order Number
The deal dimension is similar to the promotion dimension from Chapter 2. The
deal dimension describes the incentives that have been offered to the customer
that theoretically affect the customers’ desire to purchase products. This
dimension is also sometimes referred to as the contract. As shown in Figure 5.4,
the deal dimension describes the full combination of terms, allowances, and
incentives that pertain to the particular order line item.
Degenerate Dimension for Order Number
Each line item row in the orders fact table includes the order number as a
degenerate dimension, as we introduced in Chapter 2. Unlike a transactional
parent-child database, the order number in our dimensional models is not tied
to an order header table. We have stripped all the interesting details from the
order header into separate dimensions such as the order date, customer ship-to,
and other interesting fields. The order number is still useful because it allows us
to group the separate line items on the order. It enables us to answer such questions
as the average number of line items on an order. In addition, the order
number is used occasionally to link the data warehouse back to the operational
world. Since the order number is left sitting by itself in the fact table without
joining to a dimension table, it is referred to as a degenerate dimension.
Degenerate dimensions typically are reserved for operational transaction identifiers.
They should not be used as an excuse to stick a cryptic code in the fact table without
joining to a descriptive decode in a dimension table.
Junk Dimensions
When we’re confronted with a complex operational data source, we typically
perform triage to quickly identify fields that are obviously related to dimensions,
such as date stamps or attributes. We then identify the numeric measurements
in the source data. At this point, we are often left with a number of
miscellaneous indicators and flags, each of which takes on a small range of discrete
values. The designer is faced with several rather unappealing options,
including:
Leave the flags and indicators unchanged in the fact table row. This could
cause the fact table row to swell alarmingly. It would be a shame to create a
nice tight dimensional design with five dimensions and five facts and then
leave a handful of uncompressed textual indicator columns in the row
Make each flag and indicator into its own separate dimension. Doing so
could cause our 5-dimension design to balloon into a 25-dimension design.
Strip out all the flags and indicators from the design. Of course, we ask the
obligatory question about removing these miscellaneous flags because they
seem rather insignificant, but this notion is often vetoed quickly because
someone might need them. It is worthwhile to examine this question carefully.
If the indicators are incomprehensible, noisy, inconsistently populated,
or only of operational significance, they should be left out
An appropriate approach for tackling these flags and indicators is to study
them carefully and then pack them into one or more junk dimensions.
A junk dimension is a convenient grouping of typically low-cardinality flags and indicators.
By creating an abstract dimension, we remove the flags from the fact table
while placing them into a useful dimensional framework.
We’ve illustrated sample rows from an order indicator dimension in Figure 5.5. A
subtle issue regarding junk dimensions is whether you create rows for all the
combinations beforehand or create junk dimension rows for the combinations as
you actually encounter them in the data. The answer depends on how many possible
combinations you expect and what the maximum number could be. Generally,
when the number of theoretical combinations is very high and you don’t
think you will encounter them all, you should build a junk dimension row at
extract time whenever you encounter a new combination of flags or indicators
It is quite common in parent-child transaction databases to encounter facts of differing
granularity. On an order, for example, there may be a shipping charge that
applies to the entire order that isn’t available at the individual product-level line
item in the operational system. The designer’s first response should be to try to
force all the facts down to the lowest level. We strive to flatten the parent-child
relationship so that all the rows are at the child level, including facts that are captured
operationally at the higher parent level, as illustrated in Figure 5.7. This procedure
is broadly referred to as allocating. Allocating the parent order facts to the
child line-item level is critical if we want the ability to slice and dice and roll up all
order facts by all dimensions, including product, which is a common requirement
We shouldn’t mix fact granularities (for example, order and order line facts) within a
single fact table. Instead, we need to either allocate the higher-level facts to a more
detailed level or create two separate fact tables to handle the differently grained
facts. Allocation is the preferred approach. Optimally, a finance or business team
(not the data warehouse team) spearheads the allocation effort.
Invoice Transaction
In the shipment invoice fact table we can see all the company’s products, all
the customers, all the contracts and deals, all the off-invoice discounts and
allowances, all the revenue generated by customers purchasing products, all
the variable and fixed costs associated with manufacturing and delivering
products (if available), all the money left over after delivery of product (contribution),
and customer satisfaction metrics such as on-time shipment.
For any company that ships products to customers or bills customers for services rendered,
the optimal place to start a data warehouse typically is with invoices. We often
refer to the data resulting from invoicing as the most powerful database because it
combines the company’s customers, products, and components of profitability.
Extended gross invoice amount. This is also know as extended list price
because it is the quantity shipped multiplied by the list unit price.
Extended allowance amount. This is the amount subtracted from the
invoice-line gross amount for deal-related allowances. The allowances are
described in the adjoined deal dimension. The allowance amount is often
called an off-invoice allowance
Extended discount amount. This is the amount subtracted on the invoice for
volume or payment-term discounts. The explanation of which discounts
are taken is also found in the deal dimension row that points to this fact
table row
All allowances and discounts in this fact table are represented at the line
item level. As we discussed earlier, some allowances and discounts may be
calculated operationally at the invoice level, not the line-item level. An
effort should be made to allocate them down to the line item. An invoice
P&L statement that does not include the product dimension poses a serious
limitation on our ability to present meaningful P&L slices of the business.
Extended net invoice amount. This is the amount the customer is expected to
pay for this line item before tax. It is equal to the gross invoice amount less
the allowances and discounts.
However, there are times when users are more interested in analyzing the
entire order fulfillment pipeline. They want to better understand product
velocity, or how quickly products move through the pipeline. The accumulating
snapshot fact table provides us with this perspective of the business, as
illustrated in Figure 5.10. It allows us to see an updated status and ultimately
the final disposition of each order
The accumulating snapshot complements our alternative perspectives of the
pipeline. If we’re interested in understanding the amount of product flowing
through the pipeline, such as the quantity ordered, produced, or shipped, we rely
on transaction schemas that monitor each of the pipeline’s major spigots. Periodic
snapshots give us insight into the amount of product sitting in the pipeline, such
as the backorder or finished goods inventories, or the amount of product flowing
through a spigot during a predefined time period. The accumulating snapshot
helps us better understand the current state of an order, as well as product movement
velocities to identify pipeline bottlenecks and inefficiencies.
We notice immediately that the accumulating snapshot looks different from
the other fact tables we’ve designed thus far. The reuse of conformed dimensions
is to be expected, but the number of date and fact columns is larger than
we’ve seen in the past. We capture a large number of dates and facts as the
order progresses through the pipeline. Each date represents a major milestone
of the fulfillment pipeline. We handle each of these dates as dimension roles by
creating either physically distinct tables or logically distinct views. It is critical
that a surrogate key is used for these date dimensions rather than a literal SQL
date stamp because many of the fact table date fields will be “Unknown” or
“To be determined” when we first load the row. Obviously, we don’t need to
declare all the date fields in the fact table’s primary key.
The fundamental difference between accumulating snapshots and other fact
tables is the notion that we revisit and update existing fact table rows as more
information becomes available. The grain of an accumulating snapshot fact
table is one row per the lowest level of detail captured as the pipeline is
entered. In our example, the grain would equal one row per order line item.
However, unlike the order transaction fact table we designed earlier with the
same granularity, the fact table row in the accumulating snapshot is modified
while the order moves through the pipeline as more information is collected
from every stage of the lifecycle.
Accumulating snapshots typically have multiple dates in the fact table representing the
major milestones of the process. However, just because a fact table has several dates
doesn’t dictate that it is an accumulating snapshot. The primary differentiator of an accumulating
snapshot is that we typically revisit the fact rows as activity takes place
Packaging all the facts and conversion factors together in the same fact table row
provides the safest guarantee that these factors will be used correctly. The converted
facts are presented in a view(s) to the users.
Beyond the Rear-View Mirror
Much of what we’ve discussed in this chapter focuses on effective ways to
analyze historical product movement performance. People sometimes refer to
these as rear-view mirror metrics because they allow us to look backward and
see where we’ve been.
Transaction Grain Real-Time Partition
If the static data warehouse fact table has a transaction grain, then it contains
exactly one record for each individual transaction in the source system from
the beginning of recorded history. If no activity occurs in a time period, there
are no transaction records. Conversely, there can be a blizzard of closely
related transaction records if the activity level is high. The real-time partition
has exactly the same dimensional structure as its underlying static fact table. It
only contains the transactions that have occurred since midnight, when we
loaded the regular data warehouse tables. The real-time partition may be completely
unindexed both because we need to maintain a continuously open
window for loading and because there is no time series (since we only keep
today’s data in this table). Finally, we avoid building aggregates on this table
because we want a minimalist administrative scenario during the day.
We attach the real-time partition to our existing applications by drilling across
from the static fact table to the real-time partition. Time-series aggregations
(for example, all sales for the current month) will need to send identical
queries to the two fact tables and add them together.
Forget indexes, except
for a B-Tree index on the fact table primary key to facilitate the most efficient
loading. Forget aggregations too. Our real-time partition can remain biased
toward very fast loading performance but at the same time provide speedy
query performance.
Periodic Snapshot Real-Time Partition
If the static data warehouse fact table has a periodic grain (say, monthly), then
the real-time partition can be viewed as the current hot-rolling month. Suppose
that we are working for a big retail bank with 15 million accounts. The static fact
table has the grain of account by month. A36-month time series would result in
540 million fact table records. Again, this table would be indexed extensively
and supported by aggregates to provide good query performance. The real-time
partition, on the other hand, is just an image of the current developing month,
updated continuously as the month progresses. Semiadditive balances and fully
additive facts are adjusted as frequently as they are reported. In a retail bank, the
core fact table spanning all account types is likely to be quite narrow, with perhaps
4 dimensions and 4 facts, resulting in a real-time partition of 480 MB. The
real-time partition again can be pinned in memory.
Accumulating Snapshot Real-Time Partition
Accumulating snapshots are used for short-lived processes such as orders and
shipments. A record is created for each line item on the order or shipment. In
the main fact table this record is updated repeatedly as activity occurs. We create
the record for a line item when the order is first placed, and then we update
it whenever the item is shipped, delivered to the final destination, paid for, or
maybe returned. Accumulating snapshot fact tables have a characteristic set of
date foreign keys corresponding to each of these steps
Queries against an accumulating snapshot with a real-time partition need to
fetch the appropriate line items from both the main fact table and the partition
and can either drill across the two tables by performing a sort merge (outer
join) on the identical row headers or perform a union of the rows from the two
tables, presenting the static view augmented with occasional supplemental
rows in the report representing today’s hot activity.
In this section we have made a case for satisfying the new real-time requirement
with specially constructed but nevertheless familiar extensions to our
existing fact tables. If you drop all the indexes (except for a basic B-Tree index
for updating) and aggregations on these special new tables and pin them in
memory, you should be able to get the combined update and query performance
needed.
No comments:
Post a Comment