Chapter 13 : Health Care
There are two main types of patient treatment records. The treatment billing
record corresponds to a line item on a patient bill from a provider’s office, a
clinic, a hospital, or a laboratory. The treatment medical record, on the other
hand, is more comprehensive and includes not only the treatments that result
in charges but also all the laboratory tests, findings, and provider’s notes during
the course of treatment. The issues involved in these two kinds of records
are quite different, and we will look at them in separate sections.
Our large health care consortium must be able to share treatment billing
records smoothly from organization to organization. Billing records from all
the different kinds of providers must have a complete set of common dimensions
in order to be processed by the insurance companies and medical bill
payers. As individuals move from location to location, employer to employer,
and insurance company to government health care program, a coherent picture
of that individual’s history needs to be creatable at any time. And finally,
on the scrimmage line of health care delivery, the medical records of a patient
need to be available on short notice for legitimate medical use by any of the
primary providers.
We will choose the accumulating snapshot grain for our health care bill. Asingle
row in our fact table will represent a single line item on a health care bill.
Furthermore, this single row will represent the accumulated history of that
line item from the moment of creation of the row to the current day. When anything
about the line item changes, we revisit the unique accumulating row and
modify the row appropriately. From the point of view of the billing organization,
we’ll assume that the standard scenario of a bill includes:
Treatment date
Primary insurance billing date
Secondary insurance billing date
Responsible party billing date
Last primary insurance payment date
Last secondary insurance payment date
Last responsible party payment date
We choose these dates to be an adequate description of a normal bill. An accumulating
snapshot does not attempt to describe unusual situations fully. If the
business users occasionally need to see all the details of a particularly messy
bill payment situation, then a companion transaction grained fact table would
be needed. The purpose of the accumulating snapshot grain is to place every
health care bill into a uniform framework so that the business objectives we
described earlier can be satisfied easily.
Now that we have a clear idea of what an individual fact table row represents
(for example, the accumulated history of a line item on a health care bill), we
can complete the list of dimensions by carefully listing everything we know to
be true in the context of this row. In our hypothetical billing organization, we
know the responsible party, employer, patient, provider, provider organization,
treatment performed, treatment location, diagnosis, primary insurance
organization, secondary insurance organization, and master bill ID number.
These become our dimensions, as shown in Figure 13.2.
The interesting facts that we choose to accumulate over the history of the line
item on the health care bill include the billed amount, primary insurance paid
amount, secondary insurance paid amount, responsible party paid amount,
total paid amount (calculated), amount sent to collections, amount written off,
amount remaining to be paid (calculated), number of treatment units (depending
on treatment type), treatment duration, number of days from billing to first
primary insurance payment, number of days from billing to first secondary
insurance payment, and number of days from billing to first responsible party
payment.
Roles Played By the Date Dimension
Accumulating snapshot fact tables always involve multiple date stamps. Our
example, which is typical, has seven foreign keys pointing to the date dimension.
This is a good place to reiterate several important points:
The foreign keys in the fact table cannot be actual date stamps because
they have to handle the “Not Applicable” case. The foreign keys should
be simple integers serving as surrogate keys.
The surrogate keys assigned in the date dimension should be assigned
consecutively in order of date. This is the only dimension where the
surrogate keys have any relationship to the underlying semantics of the
dimension. We do this so that physical partitioning of a fact table can be
accomplished by using one of the date-based foreign keys. In our example
we recommend that the treatment date key be used as the basis for physically
partitioning the fact table.
Surrogate keys corresponding to special conditions such as “Not Applicable,”
“Corrupted,” or “Hasn’t Happened Yet” should be assigned to the
top end of the numeric range so that these rows are physically partitioned
together in the hot partition with the most recent data. We do this if these
rows are ones that are expected to change.
We do not join the seven date-based foreign keys to a single instance of
the date dimension table. Such a join would demand that all seven dates
were the same date. Instead, we create seven views on the single underlying
date dimension table, and we join the fact table separately to these
seven views, just as if they were seven independent date dimension
tables. This allows the seven dates to be independent. We refer to these
seven views as roles played by the date dimension table.
The seven view definitions using the date dimension table should cosmetically
relabel the column names of each view to be distinguishable so that
query tools directly accessing the views will present the column names
through the user interface in a way that is understandable to the end user.
Although the role-playing behavior of the date dimension is very characteristic
of accumulating snapshot fact tables, other dimensions often play roles in
similar ways, such as the payer dimension in Figure 13.2. Later in this chapter
we will see how the physician dimension needs to have several roles in complex
surgical procedures depending on whether the physician is the primary
responsible physician, working in a consulting capacity, or working in an
assisting capacity.
If there were always a maximum of three diagnoses, for instance, we might be
tempted to create three diagnosis dimensions, almost as if they were roles.
However, diagnoses don’t behave like roles. Unfortunately, there are often
more than three diagnoses, especially for elderly patients who are hospitalized.
Real medical bill-paying organizations sometimes encounter patients
with more than 50 diagnoses! Also, the diagnoses don’t fit into well-defined
roles other than possibly admitting diagnosis and discharging diagnosis. The
role-playing dimensions we talked about in the preceding section are categorized
much more naturally and disjointly. Finally, the multiple-slots style of
design makes for very inefficient applications because the query doesn’t know
a priori which dimensional slot to constrain for a particular diagnosis.
We handle the open-ended nature of multiple diagnoses with the design
shown in Figure 13.3. We replace the diagnosis foreign key in the fact table
with a diagnosis group key. This diagnosis group key is connected by a manyto-
many join to a diagnosis group bridge table, which contains a separate row
for each diagnosis in a particular group.
If a patient has three diagnoses, then that patient is assigned a diagnosis group
with three diagnoses. We assign a numerical weighting factor to each diagnosis
in the group such that the sum of all the weighting factors in the group is exactly
1.00. We can then use the weighting factors to allocate any of the numeric additive
facts across individual diagnoses. In this way we can add up all billed
amounts by diagnosis, and the grand total will be the correct grand total billed
amount. This kind of report would be called a correctly weighted report.
Finally, if the many-to-many join in Figure 13.3 causes problems for your modeling
tool that insists on proper foreign-key-to-primary-key relationships, the
equivalent design of Figure 13.4 can be used. In this case an extra table whose
primary key is diagnosis group is inserted between the fact table and the bridge
table. Now both the fact table and the bridge table have conventional many-toone
joins in all directions. There is no new information in this extra table.
Fact Dimension for Sparse Facts
We handle the extreme variability of the medical record entry with a special
dimension we call a fact dimension. In Figure 13.8 the entry type is a fact dimension
that describes what the row means or, in other words, what the fact
represents. The entry type dimension also determines which of the four kinds
of fact fields (amount, flag, comment, and JPEG file name) are valid for the
specific entry and how to interpret each field. For example, the generic amount
column is used for every numeric entry. The unit of measure for a given
numeric entry is found in the attached entry type dimension row, along with
any additivity restrictions. If the entry is a flag (for example, Yes/No or High/
Medium/Low), the types of flag values are found in the entry type dimension.
If the entry is a free-text comment or a multimedia object such as JPEG graphic
image or photograph, the entry type dimension alerts the requesting application
to look in these fact table fields.
This approach is elegant because it is superbly flexible. We can add new measurement
types just by adding new rows in the fact dimension, not by altering
the structure of the fact table. We also eliminate the nulls in the classic positional
fact table design because a row exists only if the measurement exists.
However, there are some significant tradeoffs. Using a fact dimension may
generate lots of new fact table rows. If an event resulted in 10 numeric measurements,
we now have 10 rows in the fact table rather than a single row in
the classic design. For extremely sparse situations, such as clinical/laboratory
or manufacturing test environments, this is a reasonable compromise. However,
as the density of the facts grows, we end up spewing out too many fact
rows. At this point we no longer have sparse facts and should return to the
classic fact table approach.
Late-Arriving Fact Rows
Using our patient treatment scenario, suppose that we receive today a treatment
row that is several months old. In most operational data warehouses we
are willing to insert this late-arriving row into its correct historical position,
even though our summaries for the prior month will now change. However,
we must choose the old contemporary dimension rows that apply to this
treatment carefully. If we have been date stamping the dimension rows in our
type 2 SCDs, then our processing involves the following steps:
1. For each dimension, find the corresponding dimension row whose date
stamp is the latest date stamp less than or equal to the date of the treatment.
2. Using the surrogate keys found in the each of the dimension rows from
step 1, replace the natural keys of the late-arriving fact row with the
surrogate keys.
3. Insert the late-arriving fact row into the correct physical partition of the
database containing the other fact rows from the time of the late-arriving
treatment.
Late-Arriving Dimension Rows
A late-arriving dimension row presents an entirely different set of issues that,
in some ways, are more complex than a late-arriving fact row. Suppose that
John Doe’s patient dimension row contains a marital flag attribute that always
contained the value “Single.” We have a number of patient rows for John Doe
because this is a slowly changing dimension and other attributes such as
John’s address and employment status have changed over the past year or two.
Today we are notified that John Doe was married on July 15, 1999 and has been
married ever since. To add this new information to the data warehouse
requires the following steps:
1. Insert a fresh row, with a new surrogate key, for John Doe into the patient
dimension with the marital status attribute set to “Married” and the effective
date set to “July 15, 1999.”
2. Scan forward in the patient dimension table from July 15, 1999, finding
any other rows for John Doe, and destructively overwrite the marital
status field to “Married.”
3. Find all fact rows involving John Doe from July 15, 1999 to the first next
change for him in the dimension after July 15, 1999 and destructively
change the patient foreign key in those fact rows to the new surrogate key
created in step 1.
This is a fairly messy change, but you should be able to automate these steps
in a good programmable ETL environment. We have some subtle issues in this
case, too. First, we need to check to see if some other change took place for
John Doe on July 15, 1999. If so, then we only need to perform step 2. We don’t
need a new dimension row in this special case.
Second, since we are using a pair of date stamps in each product dimension
row, we need to find the closest previous to July 15 patient row for John Doe
and change its end date to July 15, 1999, and we also need to find the closest
subsequent to July 15 patient row for John Doe and set the end date for the July
15, 1999 entry to the begin date of that next row. Got it?
There are two main types of patient treatment records. The treatment billing
record corresponds to a line item on a patient bill from a provider’s office, a
clinic, a hospital, or a laboratory. The treatment medical record, on the other
hand, is more comprehensive and includes not only the treatments that result
in charges but also all the laboratory tests, findings, and provider’s notes during
the course of treatment. The issues involved in these two kinds of records
are quite different, and we will look at them in separate sections.
Our large health care consortium must be able to share treatment billing
records smoothly from organization to organization. Billing records from all
the different kinds of providers must have a complete set of common dimensions
in order to be processed by the insurance companies and medical bill
payers. As individuals move from location to location, employer to employer,
and insurance company to government health care program, a coherent picture
of that individual’s history needs to be creatable at any time. And finally,
on the scrimmage line of health care delivery, the medical records of a patient
need to be available on short notice for legitimate medical use by any of the
primary providers.
We will choose the accumulating snapshot grain for our health care bill. Asingle
row in our fact table will represent a single line item on a health care bill.
Furthermore, this single row will represent the accumulated history of that
line item from the moment of creation of the row to the current day. When anything
about the line item changes, we revisit the unique accumulating row and
modify the row appropriately. From the point of view of the billing organization,
we’ll assume that the standard scenario of a bill includes:
Treatment date
Primary insurance billing date
Secondary insurance billing date
Responsible party billing date
Last primary insurance payment date
Last secondary insurance payment date
Last responsible party payment date
We choose these dates to be an adequate description of a normal bill. An accumulating
snapshot does not attempt to describe unusual situations fully. If the
business users occasionally need to see all the details of a particularly messy
bill payment situation, then a companion transaction grained fact table would
be needed. The purpose of the accumulating snapshot grain is to place every
health care bill into a uniform framework so that the business objectives we
described earlier can be satisfied easily.
Now that we have a clear idea of what an individual fact table row represents
(for example, the accumulated history of a line item on a health care bill), we
can complete the list of dimensions by carefully listing everything we know to
be true in the context of this row. In our hypothetical billing organization, we
know the responsible party, employer, patient, provider, provider organization,
treatment performed, treatment location, diagnosis, primary insurance
organization, secondary insurance organization, and master bill ID number.
These become our dimensions, as shown in Figure 13.2.
The interesting facts that we choose to accumulate over the history of the line
item on the health care bill include the billed amount, primary insurance paid
amount, secondary insurance paid amount, responsible party paid amount,
total paid amount (calculated), amount sent to collections, amount written off,
amount remaining to be paid (calculated), number of treatment units (depending
on treatment type), treatment duration, number of days from billing to first
primary insurance payment, number of days from billing to first secondary
insurance payment, and number of days from billing to first responsible party
payment.
Roles Played By the Date Dimension
Accumulating snapshot fact tables always involve multiple date stamps. Our
example, which is typical, has seven foreign keys pointing to the date dimension.
This is a good place to reiterate several important points:
The foreign keys in the fact table cannot be actual date stamps because
they have to handle the “Not Applicable” case. The foreign keys should
be simple integers serving as surrogate keys.
The surrogate keys assigned in the date dimension should be assigned
consecutively in order of date. This is the only dimension where the
surrogate keys have any relationship to the underlying semantics of the
dimension. We do this so that physical partitioning of a fact table can be
accomplished by using one of the date-based foreign keys. In our example
we recommend that the treatment date key be used as the basis for physically
partitioning the fact table.
Surrogate keys corresponding to special conditions such as “Not Applicable,”
“Corrupted,” or “Hasn’t Happened Yet” should be assigned to the
top end of the numeric range so that these rows are physically partitioned
together in the hot partition with the most recent data. We do this if these
rows are ones that are expected to change.
We do not join the seven date-based foreign keys to a single instance of
the date dimension table. Such a join would demand that all seven dates
were the same date. Instead, we create seven views on the single underlying
date dimension table, and we join the fact table separately to these
seven views, just as if they were seven independent date dimension
tables. This allows the seven dates to be independent. We refer to these
seven views as roles played by the date dimension table.
The seven view definitions using the date dimension table should cosmetically
relabel the column names of each view to be distinguishable so that
query tools directly accessing the views will present the column names
through the user interface in a way that is understandable to the end user.
Although the role-playing behavior of the date dimension is very characteristic
of accumulating snapshot fact tables, other dimensions often play roles in
similar ways, such as the payer dimension in Figure 13.2. Later in this chapter
we will see how the physician dimension needs to have several roles in complex
surgical procedures depending on whether the physician is the primary
responsible physician, working in a consulting capacity, or working in an
assisting capacity.
If there were always a maximum of three diagnoses, for instance, we might be
tempted to create three diagnosis dimensions, almost as if they were roles.
However, diagnoses don’t behave like roles. Unfortunately, there are often
more than three diagnoses, especially for elderly patients who are hospitalized.
Real medical bill-paying organizations sometimes encounter patients
with more than 50 diagnoses! Also, the diagnoses don’t fit into well-defined
roles other than possibly admitting diagnosis and discharging diagnosis. The
role-playing dimensions we talked about in the preceding section are categorized
much more naturally and disjointly. Finally, the multiple-slots style of
design makes for very inefficient applications because the query doesn’t know
a priori which dimensional slot to constrain for a particular diagnosis.
We handle the open-ended nature of multiple diagnoses with the design
shown in Figure 13.3. We replace the diagnosis foreign key in the fact table
with a diagnosis group key. This diagnosis group key is connected by a manyto-
many join to a diagnosis group bridge table, which contains a separate row
for each diagnosis in a particular group.
If a patient has three diagnoses, then that patient is assigned a diagnosis group
with three diagnoses. We assign a numerical weighting factor to each diagnosis
in the group such that the sum of all the weighting factors in the group is exactly
1.00. We can then use the weighting factors to allocate any of the numeric additive
facts across individual diagnoses. In this way we can add up all billed
amounts by diagnosis, and the grand total will be the correct grand total billed
amount. This kind of report would be called a correctly weighted report.
Finally, if the many-to-many join in Figure 13.3 causes problems for your modeling
tool that insists on proper foreign-key-to-primary-key relationships, the
equivalent design of Figure 13.4 can be used. In this case an extra table whose
primary key is diagnosis group is inserted between the fact table and the bridge
table. Now both the fact table and the bridge table have conventional many-toone
joins in all directions. There is no new information in this extra table.
Fact Dimension for Sparse Facts
We handle the extreme variability of the medical record entry with a special
dimension we call a fact dimension. In Figure 13.8 the entry type is a fact dimension
that describes what the row means or, in other words, what the fact
represents. The entry type dimension also determines which of the four kinds
of fact fields (amount, flag, comment, and JPEG file name) are valid for the
specific entry and how to interpret each field. For example, the generic amount
column is used for every numeric entry. The unit of measure for a given
numeric entry is found in the attached entry type dimension row, along with
any additivity restrictions. If the entry is a flag (for example, Yes/No or High/
Medium/Low), the types of flag values are found in the entry type dimension.
If the entry is a free-text comment or a multimedia object such as JPEG graphic
image or photograph, the entry type dimension alerts the requesting application
to look in these fact table fields.
This approach is elegant because it is superbly flexible. We can add new measurement
types just by adding new rows in the fact dimension, not by altering
the structure of the fact table. We also eliminate the nulls in the classic positional
fact table design because a row exists only if the measurement exists.
However, there are some significant tradeoffs. Using a fact dimension may
generate lots of new fact table rows. If an event resulted in 10 numeric measurements,
we now have 10 rows in the fact table rather than a single row in
the classic design. For extremely sparse situations, such as clinical/laboratory
or manufacturing test environments, this is a reasonable compromise. However,
as the density of the facts grows, we end up spewing out too many fact
rows. At this point we no longer have sparse facts and should return to the
classic fact table approach.
Late-Arriving Fact Rows
Using our patient treatment scenario, suppose that we receive today a treatment
row that is several months old. In most operational data warehouses we
are willing to insert this late-arriving row into its correct historical position,
even though our summaries for the prior month will now change. However,
we must choose the old contemporary dimension rows that apply to this
treatment carefully. If we have been date stamping the dimension rows in our
type 2 SCDs, then our processing involves the following steps:
1. For each dimension, find the corresponding dimension row whose date
stamp is the latest date stamp less than or equal to the date of the treatment.
2. Using the surrogate keys found in the each of the dimension rows from
step 1, replace the natural keys of the late-arriving fact row with the
surrogate keys.
3. Insert the late-arriving fact row into the correct physical partition of the
database containing the other fact rows from the time of the late-arriving
treatment.
Late-Arriving Dimension Rows
A late-arriving dimension row presents an entirely different set of issues that,
in some ways, are more complex than a late-arriving fact row. Suppose that
John Doe’s patient dimension row contains a marital flag attribute that always
contained the value “Single.” We have a number of patient rows for John Doe
because this is a slowly changing dimension and other attributes such as
John’s address and employment status have changed over the past year or two.
Today we are notified that John Doe was married on July 15, 1999 and has been
married ever since. To add this new information to the data warehouse
requires the following steps:
1. Insert a fresh row, with a new surrogate key, for John Doe into the patient
dimension with the marital status attribute set to “Married” and the effective
date set to “July 15, 1999.”
2. Scan forward in the patient dimension table from July 15, 1999, finding
any other rows for John Doe, and destructively overwrite the marital
status field to “Married.”
3. Find all fact rows involving John Doe from July 15, 1999 to the first next
change for him in the dimension after July 15, 1999 and destructively
change the patient foreign key in those fact rows to the new surrogate key
created in step 1.
This is a fairly messy change, but you should be able to automate these steps
in a good programmable ETL environment. We have some subtle issues in this
case, too. First, we need to check to see if some other change took place for
John Doe on July 15, 1999. If so, then we only need to perform step 2. We don’t
need a new dimension row in this special case.
Second, since we are using a pair of date stamps in each product dimension
row, we need to find the closest previous to July 15 patient row for John Doe
and change its end date to July 15, 1999, and we also need to find the closest
subsequent to July 15 patient row for John Doe and set the end date for the July
15, 1999 entry to the begin date of that next row. Got it?
No comments:
Post a Comment