MILPRIT : Mining Interval Logic Patterns with Regular Expression Constraints Sandra de Amo1, Arnaud Giacometti2, Magno Soares Santana1
1Faculdade de Computac¸˜ao – Universidade Federal de Uberlˆandia (UFU)
[email protected], [email protected], [email protected]Abstract. Most methods for temporal pattern mining assume that time is pon- tual, that is, represented by points in a straight line beginning at some initial instant. In this paper, we consider a new temporal pattern, where time is ex- plicitly represented by intervals. We present the algorithm MILPRIT for mining these kind of temporal patterns, which uses variants of the classical level-wise search algorithms. We also present some preliminary experimental results and discuss further applications. Keywords. Sequential Pattern, Interval Logic, Contraint-based Mining. 1. Introduction We present a new sequential pattern where time is measured in terms of intervals instead of points. These patterns, which we call temporal interval patterns, aims at capturing how events taking place in time intervals relate to each other. For instance, (1) in a medical application, we could be interested in discovering if patients who take some medicine X during a certain period of time, and who presented the symptom Y before taking the medicine, will present the symptom Z during or after taking the medicine X, (2) in an agricultural application, we could be interested in discovering if the use of some organic fertilizer during a period of time has an effect on the way a plant grows during and after the fertilizer application.
In [5], Allen’s Propositional Interval Logic ([3]) has been used for the first time,
to treat the problem of discovering association rules over time series. However, to ourknowledge the use of Interval Logic in temporal data mining has been restricted to propo-sitional temporal patterns, that is, patterns not involving first order predicates. The needfor more expressive kind of temporal patterns arises for instance, when modelling Unix-users behaviour ([6]), as is pointed out in [7]. Our temporal pattern is defined as a set ofatomic first order formulae where time is explicitly represented by an interval variable,together with a set of interval relationships (before, during, overlaps, start, finish, meet)described in terms of Allen’s First Order Interval Logic ([3]). An example of such tempo-ral pattern is {M ed(x, penicillin,i1), Symp(x, dizziness,i2), during(i2, i1)}, meaning
that “patients who take penicillin during the time interval i1 will feel dizzy during thisperiod, but eventually the dizziness will stop before they stop taking the medicine”. Some related work. The algorithm MILPRIT we introduce in this paper is designed to mine first order temporal patterns, where time is represented by intervals, and regular expression contraints are used to restrict the search space. In [7], the algorithm SeqLog, based on Inductive Logic Programming, was introduced for discovering first order se- quential patterns where time is represented by points in a straight line (and not by inter- vals like in our case). Recently, many work on contrained based temporal mining have adopted regular expressions as a mechanism to specify user constraints ([4, 8, 2]) in order to not overwhelming him(her) with uninteresting patterns. In [8], the algorithm (Spirit- Log) designed to mine first order temporal patterns was introduced. Spirit-Log extends SeqLog (where time is pontual), by pushing regular expression constraints in the mining process. 2. Problem Formalization In this section, we introduce the main concepts related to the problem of discovering temporal interval patterns. Temporal i-Patterns.
Let us suppose three disjoints sets of symbols P (predicates), V
(data variables) and C (data constants). Let us also assume the set of temporal predicates T= {before, meets, overlaps, during, finishes, starts } and two other disjoint sets of symbolsVt (temporal variables) and Ct (temporal constants). A data atom is an expression of typep(s1, ., sn, e) or t1 = t2, where each si are data variables or data constants, e is a temporalvariable, t1 is a data variable and t2 is a data constant or a data variable. A temporal atomis an expression of type r(e, f ) or e = f , where r is one of the temporal predicates in Tand e, f are temporal variables.
The following figure illustrates the temporal predicates:
The temporal variables are evaluated over intervals [a, b], where a and b are dates
with a < b. For instance, the formula before(e,f) is true if the variables e and f are evalu-ated as [10/12/2004, 15/12/2005] and [17/12/2004, 28/12/2005] respectively. The for-mula during(e,f) is true if the variables e and f are evaluated as [10/12/2004, 15/12/2005]and [5/12/2004, 28/12/2005] respectively.
A temporal i-pattern (temporal interval pattern) is a triple (K,D,T ) where (1)
D is a set of data atoms (2) T is a set of temporal atoms and (3) K is a special data atom K(x1, ., xn) whose variables x1, ., xn appear in D. The data atom K is called the reference query. Example 1 Let us consider the temporal i-pattern (Patient(x), D, T ) where D {Med(x, y, e), Symp(x, z, f )} and T = {bef ore(e, f )}. Intuitively, this temporal pat-tern translates the fact that a patient x takes the medicine y during a period of time e and
during a further period of time f (after he has stopped taking the medicine y) presentsa disease symptom z. Note that we are always interested in analyzing the behaviour ofpatients registered in the relation Patient, even though relations Med and Symp may con-tain data related to other patients (not only those registered in Patient). For this reason,Patient(x) is called the reference query.
An instantiation of a temporal variable e is a mapping v associating an interval
v(e) = [a, b] to e, where a,b are integers (mapping dates). A ground temporal atom is thetemporal atom obtained by instantiating its temporal variables.
In what follows, we suppose that the set of temporal atoms T is consistent, i.e.,
the temporal variables can be instantiated in a consistent way. For instance, the set oftemporal atoms {before(e, f ), before(f, g), before(g, e)} is not consistent, because thefirst two atoms imply that e comes before g and so, there is no way to instantiate thetemporal variables e, f and g in order to make the three temporal formulae simultaneouslytrue. We also suppose that the set T is complete, i.e., for each pair of temporal variablese, f appearing in T it is possible to infer a temporal relationship between e and f . Forinstance, the set {before(e,f),meet(f,g)} is complete because the relationship between eand g is completely determined (even though it does not appear explicitly in the set):the only possibility is before(e,g). On the other hand, {during(e,f), overlaps(f,g)} is notcomplete because the relationship between f and g is not deterministically implied bythe two relationships during(e,f), overlaps(f,g): it could be during(e, g), bef ore(e, g) oroverlaps(e, g).
As we will see next, given a consistent and complete set of temporal atoms, its
temporal variables can be ordered in a natural way. Due to space limitations, we will givean intuitive idea of this order relation, ommiting its rigorous formalization.
First of all, we can define a total order ≤ over the temporal domain I of intervals
[a, b], a, b ∈ N in the following way: Definition 1 (A total order over the intervals) Let [a, b] and [c, d] be two intervals in I. We say that [a, b] ≤ [c, d] if and only if one and only one of the following conditions is verified: (1) a = c and b = d, (2) b < d or (3) b = d and a < c.
This order over the interval structure I naturally induces an ordering over the
temporal variables of a set of temporal atoms T . The following example illustrates theidea: let T = {before(e,f), during(f,g). Then the induced ordering over the set of temporalvariables {e, f, g} is e <T f <T g. Indeed, for any instantiation Ie = [ae, be], If =
[af , bf ], Ig = [ag, bg] such that the predicates in T are simultaneously true, the endingpoints be, bf , bg necessarily satisfy be < bf < bg. Then Ie ≤ Ib ≤ Ic.
The induced order over the temporal variables appearing in a temporal i-pattern
M allows to view M as a sequence of data atoms. Indeed, let M = (K, D, T ), where D= {p1(x1, ., x1 , ek)}. We know that T is complete and consistent,
and so, its temporal variables are ordered by the natural order <T discussed above. Letus suppose, without loss of generality, that e1 <T e2 <T . . . <T ek. So, the set D can beviewed as the sequence < p1, ., pk >. The Dataset.
Let R = {K, R1, ., Rn} be a temporal database schema, i.e., a database
schema such that each relational schema Ri has arity ki ≥ 2 and sort (data,.,data,time) and K has arity m and sort (data,.,data). A temporal dataset over R is a set of temporal relations {k, r1, ., rn} where each ri is a set of tuples (a1, ., an , e) over R
set of tuples over K. A temporal relation r is called coalesced if for all (a1, ., ak, e)and (a1, ., ak, f ) ∈ r, we have e ∩ f = ∅, i.e., intervals associated to a same tuple aremaximal. In what follows, we assume that all temporal relations are coalesced. Definition 2 Let DB be a temporal database over the schema R = {K, R1, ., Rn} and M = (K(x1, ., xm),{D1, .Ds}, {T1, ., Tl }) be a temporal i-pattern over R. The support of M
with respect to DB, denoted by sup(M ), is defined by sup(M ) =
QM is the relational calculus query ∃y1∃y2.∃yk(D1 ∧ D2 ∧ . . . ∧ Ds ∧ T1 ∧ . . . ∧ Tl), andy1, ., yk are the data variables not appearing in the reference query K(x1, ., xm). Theexpression u |= QM means that u is an answer to the query QM when executed over DB. Given 0 ≤ α ≤ 1, we say that the temporal i-pattern M is frequent with respect to DBand α if sup(M ) ≥ α. Example 2 Let us consider the temporal database schema R = {Patient(NamePat), Med(NamePat, NameMed, T), Symp(NamePat, NameSymp, T)}. Let us consider the temporal i-pattern M = (Patient(x), {M ed(x, y, e), Symp(x, z, f )}, {bef ore(e, f )}). Let DB be the following temporal database P atient = {(Paul),(Charles),(John)}, Med = {(Paul,penicillin,[1,3]), (Charles,tetracycline, [4,7]), (Mary,penicillin,[8,10])} and Symp = {(Paul,dizziness,[4,6]), (Charles,dizziness, [5,6]), (John,fever,[8,10])}.
tional query QM can be specified in the relational algebra formalism by the expressionΠNamePat(P atientSymp). The answer of QM is {(Paul)}. Note that Charles
is not an answer. Indeed, Charles has taken tetracycline and was dizzy during (and notafter) the time he was taking this medicine. John is not an answer neither, because he hasnot taken any medicine at all. So, the support of M w.r.t. DB is 1 . Notice that Mary is
not taken into account because she does not appear in the relation Patient (that is, she isnot an answer to the reference query Patient(x)). 3. Temporal Patterns with Restrictions In an application where the user has an idea of some special characteristics of the patterns he(she) is searching, he(she) can be overwhelmed with uninteresting patterns. In this section, we propose to use regular expressions in order to reduce the search space for temporal patterns and better satisfy user requirements. These regular expressions must be satisfied by the temporal i-patterns, viewed as sequences as discussed before.
A mode over a temporal database R = {K, R1, ., Rk} is an expression of the
form R(u1, ., us, #), where each ui is a data variable or the (new) symbol ∗, # is a new symbol, and R is one of the predicates Ri ∈ R. We say that a data atom A is in accord with the mode R(u1, ., us, #) if A = R(y1, ., ys, e) and for each i = 1, ., s we have: (1) If ui is a variable then yi = ui; (2) If ui = ∗, then yi is a variable or a constant ; (3) e is a temporal variable. For instance, M ed(x, ∗, #) is a mode and the
atom M ed(x, penicillin, e) is in accord with M ed(x, ∗, #). On the other hand, the atomMed(y, penicillin, e) is not in accord with the mode Med(x, z, #).
Let Σ be a set of modes over R = {K, R1, ., Rn}. In what follows, we will
consider regular languages (sets of strings) over the alphabet Σ in order to reduce the search space size. These languages are specified by regular expressions. Given a regular expression L over the alphabet Σ, we denote by M(L) the set of temporal i-patterns (K, < p1, ., pn >, T ) over R such that there exists a string w1, w2, ., wn ∈ L, such that pi is in accord with wi for each i = 1, ., n.
The search space associated to L is the set of temporal i-patterns M =
(K(x1, ., xm),D,T ) ∈ M(L) such that D =< p1(t1, ., t1 , e
satisfies the following property: Let < w1, ., wn > be the string of modes correspond-ing to the sequence < p1, . . . , pn >. The data variables appearing in M which replacesa symbol ∗ in some wi are not in {x1, ., xm} and have only one occurrence in D. Wedenote by QL the search space specified by L. Example 3 Let us consider the database schema R= {Patient, Med, Symp } of our running example.
Let L = M ed(x, ∗, #) M ed(x, ∗, #)∗ Symp(x, ∗, #).
pattern M1 = (P atient(x), {M ed(x, a, e), M ed(x, b, f ), Symp(x, diarrhea, g)},{bef ore(e, f ), overlaps(f, g)}) belongs to QL. On the other hand, the pattern M2 =(P atient(x), {M ed(x, y, e), Symp(x, y, f ) }, {bef ore(e, f ) }) does not belong to QL. Intuitively, the regular expression L captures the temporal i-patterns corresponding to apatient x taking certain medicines during successive periods of time and eventually pre-senting some symptom.
Our mining task is: Given a temporal database DB, a minimal support threshold
α, 0 ≤ α ≤ 1, and a regular expression L, find all temporal i-patterns which belong to thesearch space associated to L and are frequent with respect to DB. 4. The algorithm MILPRIT In this section, we present the Algorithm MILPRIT (Mining Interval Logic Patterns wiht Regular expressIons conTraints) which generalizes the idea of the SPIRIT algorithm in- troduced in [4] in the context of classical sequence patterns. In a high level, it follows the general Apriori strategy of Agrawal/Srikant [1], working in passes, and each pass producing patterns more specificic than those produced in the previous pass.
The classical Apriori strategy uses a Pruning Phase, where all patterns more spe-
cific than a pattern p not belonging to the frequent patterns produced so far (at the previouspass) are pruned. This pruning strategy relies on the Antimonotonic Property: “frequentpatterns cannot be more specific than not frequent patterns”. In our setting, at the end ofpass k, the algorithm discovers the frequent patterns Fk which satisfy a regular expressionL. The constraint L is pushed into the generation phase at pass k + 1 in order to produceonly patterns satisfying it. So, the number of generated patterns is in direct proportionto the restrictiveness of L. In the pruning phase, however, all patterns which are morespecific than a pattern satisfying L and not belonging to Fk should be pruned. We notice
that in this case, the size of the set of pruned patterns is in inverse proportion to the re-strictiveness of the constraint L. Such tradeoffs are due to the fact that regular expressioncontraints are not anti-monotone, that is, if a pattern satisfies L it may have some sub-pattern not satisfying L. In order to find a suitable trade-off, we use a relaxation of theconstraint L, that is, a less restrictive one, the prefix constraint associated to L. We callP ref (QL) the set of prefixes of words w ∈ QL. The general structure of the algorithmMILPRIT is depicted below:
Procedure MILPRIT(DB,α,R) DB is a temporal dataset, R is a regular expression over the alphabet of modes Σ over the temporal database schema R = {K, R1, ., Rn} underlying DB. Let M1 be the set of data atoms Ri(x1, ., xn ). begin 1. F := F1 = {M ∈ M1 : M is frequent and M ∈ P ref (QL)} 2. k := 2 3. repeat 4. Ck := Gen(Fk−1,L) (Generation Phase) 5. P := {M ∈ Ck : ∃N ∈ P ref (QL), such that N ∈ F and M is more specific than N } 6. Ck := Ck − P (Pruning Phase) 7. Fk := {M ∈ Ck : support(M) ≤ α} (Validation Phase) 8. F := F ∪ Fk 9. k := k + 1 10. until Fk = ∅ 11. F := {M ∈ F : M ∈ QL } The Generation Phase.
MILPRIT works in passes, each pass k producing patterns of
“level” k. The “level” of a temporal i-pattern M is measured by means of the refinementvector associated to M :
Definition 3 The refinement vector associated to M = (K, D, T ) is v(M ) = (n, c) where n is the size of D and c is the number of constants appearing in D. The refinement level of M , denoted by l(M ), is defined as n + c. For instance, let M be the temporal i-pattern (P atient(x), {M ed(x, penicillin, e), Symp(x, z, f )}, {bef ore(e, f )}). Then v(M ) = (2, 1) and l(M) = 3.
The procedure Gen(Fk−1,L) is designed to generate temporal i-patterns whose
refinement level is k, by specializing the temporal i-patterns of Fk−1 (whose refinementlevel is k − 1) in such a way that the patterns produced constitute a prefix of patternssatisfying L. This is accomplished by two specialization operators defined below. Definition 4 (Extension) Let L be a regular expression and M ∈ P ref (QL), M = (K, D, T ). The extension operator ρE executed over M is defined as follows: (a) if v(M) = (n, c) with c > 0, then ρE(M) = ∅; (b) if v(M) = (n, 0) then ρE(M) is the set of temporal i-patterns M = (K, D , T ) ∈ QL such that D = D ∪ {pn+1(x1, . . . , xl, en+1)} where pn+1 ∈ P red, xi are data variables, en+1 is a temporal variable and T = T ∪ {r1(e1, en+1), . . . , rn(en, en+1)} where ri are temporal predicates, and T is complete and consistent. Example 4 Let L be the regular expression E = M ed(x, ∗, #)∗Symp(x, ∗, #). The i-pattern M {Med(x, y, e1)}, ∅) belongs to Pref (QL).
Let us consider the following i-patterns:
M1 = (P atient(x), {Med(x, y1, e1),
Med(x, y2, e2)},
{overlaps(e1, e2)}),
Med(x, y2, e2),Symp(x, z, e3)}, {overlaps(e1, e2), overlaps(e2, e3)}). ρE(M), but M2 is not in ρE(M1), because there is no explicit nor implicit relationshipbetween the temporal variables e2 and e3. Definition 5 (Instantiation) Let L be a regular expression and M ∈ P ref (QL), M = (K(x1, ., xn), D, T ). The instantiation operator executed over M produces the set ρI(M) of temporal i-patterns M = (K, E , T ) obtained by replacing some variable yi (yi = xj, for all j = 1, ., n) appearing in some data atom D = p(., yi, .) of D by a constant c in such a way that: (a) the mode ws = p(u1, . . . , ui, . . . , ul, #) corresponding to the data atom D in the specification of M according to a string w1.ws.wk ∈ P ref (L) contains a “∗” in the position ui; (b) there is no ∗ after the position i in ws such that the data atom p(., yi, .) of D contains a constant in that position; (c) there is no constant in the data atoms after p(., yi, .) in the sequence D. Example 5 Let L be the regular expression E = M ed(x, ∗, #)∗Symp(x, ∗, #). Then M2 and M3 ∈ ρI(M1), M4 ∈ ρI(M1), M4 ∈ ρI(M2), M4 ∈ ρI(M3), where: M1 = (P atient(x), {Med(x, y1, e1), Med(x, y2, e2)}, {overlaps(e1, e2)})M2 = (P atient(x), {Med(x, penicillin, e1), Med(x, y2, e2)}, {overlaps(e1, e2)})M3 = (P atient(x), {Med(x, y1, e1), Med(x, prozac, e2)}, {overlaps(e1, e2)})
M4 = (P atient(x), {Med(x, penicillin, e1), Med(x, prozac, e2)}, {overlaps(e1, e2)})
The specialization operator ρ defined as ρ(M ) = ρI(M ) ∪ ρE(M ) satisfies some
nice properties: (a) it is complete, in the sense that every i-pattern M ∈ P ref (QL) canbe obtained by applying ρ successively to some i-pattern in the refinement level 1; (b)by appropriately restricting the form of the regular expression L 1, ρ is optimal, in thesense that a pattern cannot be obtained by specializing successively two non-equivalenti-patterns2. 5. Preliminary Experimental Results and Discussion MILPRIT has been implemented in C++, under Linux, and executed on a Pentium IV, 3.0 GHz, 1Gb RAM and 80Gb HD SEAGATE ST336607SW. We have developped a synthetic data generator allowing to adjust several parameters in order to test the perfor- mance and scalability of MILPRIT. We have performed some preliminary experiments over synthetic data. We have generated a dataset with two tables R1(A, B, Begin, End), R2(A, C, Begin, End), each table with 3000 tuples, a reference table K(A) with 500 tuples. This dataset has been generated in such a way it includes at least 20 temporal i-patterns, satisfying the regular expression R2(x, ∗, #)*R1(x, ∗, #)* with support 15%.
1i.e., by requiring that if two distinct modes appear in L, associated to the same predicate p, the positions containing
the symbol * are the same in boths modes.
2Two temporal i-patterns M and M are equivalent if M is obtained from M by renaming its variables.
MILPRIT has been executed over this dataset, with minimum support threshold 10%, andthe execution time was 28 minutes. It stopped at level 10, producing all the temporali-patterns introduced in the dataset.
For the time being, MILPRIT executes correctly over small datasets, containing
an average of 50 temporal i-patterns. A lot of work has to be done yet. First, we have tooptimize MILPRIT in order to evaluate its performance and scalability over large volumeof data. We are investigating buffer management techniques in order to deal with candi-date sets which don’t fit in main memory. Another issue we are investigating is how tooptimize the support counting to validate “long” candidates. Secondly, we have to testMILPRIT over real data. We intend to use the AMDI database for this purpose. AM DIis part of a huge project which aims at building a system supporting an indexed atlas ofdigital mammograms. This system will be available online and intend to be used by ra-diologists as well as radiology students, in order to help them in breast cancer diagnosis. The atlas will store, for each patient, a series of digital mammograms, taken during itslifetime. Besides image data, some important temporal data will be stored, related to thepatient habits and life style: in which periods of time did she smoke, or take contraceptivedrugs, for instance. MILPRIT will be used in order to find temporal i-patterns relating theevolution of a breast cancer tumor and the patient’s habits and life style. References
[1] R. Agrawal, R. Srikant. Mining Sequential Patterns: Generalizations and Performance Improvements. Proc.
of the Fifth Int. Conference on Extending Database Technology, Avignon, France, March 1996.
[2] H. Albert-Lorincza and J-F. Boulicaut. Mining frequent sequential patterns under regular expressions:
a highly adaptive strategy for pushing constraints. In International Conference on Data MiningSDM’03, pages 316–320, 2003.
[3] James F. Allen and George Ferguson. Actions and Events in Interval Temporal Logic. Technical Report
[4] Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. SPIRIT: Sequential Pattern Mining with Regular
Expression Constraints. In The VLDB Journal, pages 223–234, 1999.
[5] Frank H¨oppner. Discovery of Temporal Patterns: Learning Rules about the Qualitative Behaviour of Time
Series. In PKDD 2001, pages 192–203, 2001.
[6] Nico Jacobs and Hendrik Blockeel. From Shell Logs to Shell Scripts. In ILP’2001, volume 2157 of LectureNotes in Computer Science, pages 80–90. Springer, 2001.
[7] Sau Dan Lee and Luc De Raedt. Constraint Based Mining of First Order Sequences in Seqlog. In Workshopon Knowledge Discovery in Inductive Databases (KDID´02), 2002.
[8] Cyrille Masson and Franc¸ois Jacquenet. Mining Frequent Logical Sequences with Spirit-Log. In ILP’2002,
volume 2583 of Lecture Notes in Computer Science, pages 166–181. Springer, 2003.
SPECIAL REPORT-THE DRS SYSTEMOctober 1998 Low back pain can have many causes. It is exceedingly frequent, and is experienced at some time by up to 80% of the population. The differential diagnosis of low back pain is broad and includes systemic diseases (e.g. metastatic cancer), primary spine disease (e.g. disk herniation, degenerative arthritis), and regional diseases (e.g. aortic dissection)
J. Am. Chem. Soc. 2000, 122, 12898-12900 Validation of a Model for the Complex of HIV-1 Reverse Transcriptase with Sustiva through Computation of Resistance Profiles Robert C. Rizzo, De-Ping Wang, Julian Tirado-Rives, andWilliam L. Jorgensen* Department of Chemistry, Yale Uni V ersity New Ha V en, Connecticut 06520-8107 Re V ised Manuscript Recei V ed October 24, 2000 All retrovi