Pay-as-You-Go Information Integration
The third stage is to generate a check condition that can We propose to incrementally elaborate a dataspace through monitor that the relevant characteristic still holds as data the stages of characterization, customization and checking. sources evolve, and hence that the customization is still After describing each of these stages, we present a specific valid. For example, if a new version of the related source
example from the domain of medication standards. contains an additional tuple <c1, has-form, c2>, where c1
is a branded drug, then we must reconsider the hasForm
Recently, dataspace management has been proposed as an We imagine a framework for dataspace elaboration where approach that takes a holistic view of all information in an new modules can be plugged in, each with characterization, enterprise, be it structured, semi-structured or unstructured, customization and checking components. Call such an ex- and whether or not it is supported by an explicit informa- tion system [1, 2]. While its goal is to provide services over the entirety of the data, dataspace management takes the 3. A MEDICATION DATASPACE
pragmatic view that initial limits on time and effort may We give a more detailed example from a dataspace we are only permit simple services at first, such as cataloging and currently exploring. The RxSafe project (led by Samaritan keyword search. Additional services or capabilities can North Lincoln Hospital and Oregon Health & Science Uni- come later, little by little, as resources permit. Such a pay- versity) is developing a consolidated medication-list facil- as-you-go approach seeks a steady return on investment ity for rural elders in Lincoln County, Oregon. We are in- (ROI), rather than a long period of implementation before vestigating various standards proposed for e-prescribing to add value to RxSafe, such as noting equivalences of ge-neric and brand names, or grouping medications by drug 2. CHARACTERIZATION AND
class. Two particular sources in our dataspace are RxNorm CUSTOMIZATION
and NDF-RT. RxNorm [3] is an effort from NLM to pro- We have been investigating one path to pay-as-you-go duce a standard nomenclature for medications and their elaboration of a dataspace, involving tools to aid in incre- components. NDF-RT (National Drug File – Reference mental characterization and customization of a dataspace. Terminology) [4] contains drug-class information (among We assume that the data sources in a dataspace may be other things) developed by the US Department of Veterans incompletely – or incorrectly – documented. The first stage is to run routines that determine a class of We would like to connect information from these two characteristics or traits of one or more data sources. These sources to link brand names with drug classes. Figure 1 can be fairly simple, as in current data profiling systems, shows an excerpt from a table we derived from RxNorm, such as compiling value distributions for fields and deter- relating Semantic Clinical Drug (SCD) with brand name. mining keys, or more complex, such as detecting domain- (SCD is essentially the most complete generic name for a specific structure in generic representations. drug, giving ingredients, their strengths and a dose form.) There are 8,431 SCDs in this table. Figure 2 is an excerpt The second stage presents customizations or enhancements from a table derived from NDF-RT relating SCDs with that are enabled by specific discovered characteristics. drug class and class type. (Class types are more general Consider, for example, a UMLS-style generic relationship categories over drug classes.) This table has 6,661 entries. structure with related(Concept1 Rel_Name Concept2) as
These two tables are themselves the results of (manual) schema. Suppose characterization discovers that in every characterization and customization of the RxNorm and tuple <c1, has-form, c2>, concept c1 is always a clinical drug and c2 is always a dose form (e.g., tablet, capsule, It would seem that a join of these two tables would connect syrup). Then one possible customization is to factor out brand names and drug classes for us. The problem is that tuples of this form from related into a specialized table
the connection is incomplete. About 54% of the SCDs in hasForm(ClinicalDrug, DoseForm), either in material-
the RxNorm table do not appear in the NDF-RT table. Go- ing in the other direction, 42% of the SCDs in the NDF-RT table are missing from the RxNorm table. (Most of them b1. (Note by the nature of a stratification, any SCD equiva- are in RxNorm, just not connected with any brand name.) lent to s1 will yield the same class c2.) The check condition However, the situation is probably not as severe as it for this customization is that the particular stratification sounds. Figure 3 illustrates what we think is happening – there are variations in strength and dose form. Figure 1: SCD-to-brand-name connection derived
Figure 3: Example of missing connections between
We are currently developing specific C-C-C modules, as well as considering high-level ways to define them and a framework for incrementally investigating and enhancing a dataspace. Figure 2: SCD-to-drug-class connection derived from
This work is supported by NSF grant IIS-0534762 and
A human could probably figure out the connection between brand names and drug classes on a case-by-case basis. But is there possibly a C-C-C module that might overcome this 6. REFERENCES
connection problem? We think so. We could start with the [1] A. Y. Halevy, M. J. Franklin, D. Maier. Principles of data- collection of all SCDs in RxNorm that do have drug-class space systems. Proc. of the Twenty-Fifth ACM SIGACT- information in NDF-RT. We could then run a characteriza- SIGMOD-SIGART Symp. on Principles of Database Systems, tion routine to see if this collection obeys any non-trivial stratification: an equivalence relation on a domain where [2] M. J. Franklin, A. Y. Halevy, D. Maier. From databases to equivalent items have the same image under a relationship. dataspaces: A new abstraction for information management. In our example, there may be a stratification based on SIGMOD Record 34(4), December 2005. equality of ingredient lists, ignoring strength and dose [3] S. Liu, W. Ma, R. Moore, V. Ganesan, S. Nelson. RxNorm: form. (Note that we might require a prior customization to Prescription for electronic drug information exchange. IEEE pick apart an SCD in order to express this equivalence.) IT Professional 7(5), September 2005. If there are such stratifications, we could choose one to [4] S. H Brown, et al. VA National Drug File Reference Termi- help connect “unconnected” brand names with drug class. nology: A cross-institutional content-coverage study. Pro- That is, consider a brand name b1 that is connected in ceedings from the Medinfo 2004 World Congress on Medical RxNorm to an SCD s1, but s1 is not assigned a drug class Informatics, San Francisco, August 2004. in NDF-RT. If s1 is equivalent to another SCD s2 that does have a drug class c2, then we can impute c2 as the class for


