I have a very large data set, where matching has to be performed according to certain rules. The data set is millions of records, so I'm thinking a hash objects solution might work here, but I don't know how to do that.
I have daily transactions with trans=Type1 or Type2, each with an amount (variable name: TranValueAdj) and date associated with it. I want to match the Type1 record to the first Type2 that has the same amount and ID, such that the date of the Type1 transaction is less than the date of the Type2 transaction. Once a Type1 transaction is matched with a Type2 transaction, neither that Type1 or Type2 transaction can be matched again. When there are multiple transactions with the same ID and amount, this becomes tricky. The variable SourceType is not used in the matching, but needs to be in the output data set.
Below is an example of the actual data, and following that is the desired outcome.
Thanks!
data have;
infile datalines dsd truncover;
input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;
data want;
infile datalines dsd truncover;
input id:$4. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32. MatchedDate:DATE9. MatchedAmt:$6.;
format dateTran DATE9. MatchedDate DATE9.;
label id="id" dateTran="dateTran" Trans="Trans" SourceType="SourceType" TranValueADj="TranValueADj" MatchedDate="MatchedDate" MatchedAmt="MatchedAmt";
datalines4;
A003,03JAN2020,Type1,B3,100000,07JAN2020,100000
A003,03JAN2020,Type1,A2,2061,,.
A003,13JAN2020,Type1,A2,775.1,,.
A003,14JAN2020,Type1,B3,100000,16JAN2020,100000
A003,23JAN2020,Type1,B3,25000,,.
A003,24JAN2020,Type1,A2,121.16,,.
A003,27JAN2020,Type1,A2,146.67,,.
A003,31JAN2020,Type1,A2,1246.7,,.
A001,03JAN2020,Type1,M1,1776,,.
A001,03JAN2020,Type1,A2,1824.12,,.
A001,03JAN2020,Type1,M1,1000,,.
A001,17JAN2020,Type1,A2,1830.88,,.
A001,27JAN2020,Type1,M1,375.27,,.
A001,27JAN2020,Type1,M1,470.04,,.
A001,27JAN2020,Type1,M1,4712.3,,.
A001,31JAN2020,Type1,A2,1869.88,,.
A002,02JAN2020,Type1,B3,1848,,.
A002,03JAN2020,Type1,A2,3351,,.
A002,03JAN2020,Type1,A2,871,,.
A002,03JAN2020,Type1,A2,5135,,.
A002,03JAN2020,Type2,,1000,,.
A002,03JAN2020,Type1,A2,1070,,.
A002,03JAN2020,Type1,A2,8357,,.
A002,06JAN2020,Type1,B3,49425,,.
A002,06JAN2020,Type1,B3,61744.1,,.
A002,06JAN2020,Type1,B3,11273,,.
A002,07JAN2020,Type2,,2198,,.
A002,07JAN2020,Type1,B3,36006.95,,.
A002,08JAN2020,Type1,B3,29329.5,,.
A002,10JAN2020,Type1,B3,695,,.
A002,10JAN2020,Type2,,703,,.
A002,10JAN2020,Type1,B3,29912.3,,.
A002,13JAN2020,Type2,,250,,.
A002,13JAN2020,Type1,B3,24951,,.
A002,14JAN2020,Type2,,250,,.
A002,16JAN2020,Type1,B3,9759.5,,.
A002,17JAN2020,Type1,A2,652,,.
A002,17JAN2020,Type1,B3,4538,,.
A002,21JAN2020,Type1,B3,845,,.
A002,21JAN2020,Type1,B3,83436.27,,.
A002,21JAN2020,Type1,B3,75,,.
A002,21JAN2020,Type1,B3,8814,,.
A002,22JAN2020,Type2,,250,,.
A002,24JAN2020,Type1,B3,2270,,.
A002,24JAN2020,Type1,B3,2258,,.
A002,28JAN2020,Type1,B3,9135,,.
A004,06JAN2020,Type1,A1,300,09JAN2020,300
A004,15JAN2020,Type1,A2,1204.5,,.
A004,31JAN2020,Type1,B1,0,,.
A004,31JAN2020,Type1,A2,3079.09,,.
A005,13JAN2020,Type1,B3,1000,23JAN2020,1000
A005,21JAN2020,Type1,B3,1000,,.
A005,28JAN2020,Type1,B3,1000,30JAN2020,1000
A006,02JAN2020,Type1,M1,250,,.
A006,03JAN2020,Type1,A2,23.72,,.
A006,03JAN2020,Type2,,200,,.
A006,06JAN2020,Type1,M1,400,09JAN2020,400
A006,16JAN2020,Type1,M1,150,,.
A006,21JAN2020,Type1,M1,190,,.
A006,22JAN2020,Type1,M1,200,28JAN2020,200
A006,23JAN2020,Type1,M1,200,,.
A006,24JAN2020,Type1,M1,184.75,,.
A006,27JAN2020,Type1,M1,120,,.
A006,29JAN2020,Type1,M1,550,,.
A007,02JAN2020,Type1,B1,0,,.
A007,13JAN2020,Type1,MF,180,,.
A007,13JAN2020,Type1,M1,175,16JAN2020,175
A007,15JAN2020,Type1,M1,240,21JAN2020,240
A007,16JAN2020,Type1,M1,440,22JAN2020,440
A007,17JAN2020,Type1,M1,345,23JAN2020,345
;;;;
For ID A005, you have 5 values of 1,000
Type 1 Jan 13 is matched to Type2 on Jan 23
Type 1 Jan 21 is not matched
Type1 of Jan 28 is matched to Type2 on Jan 30
Why is the Type1 of Jan 21 not matched to the earliest remaining Type2 - i.e. Jan 30? Which would leave the Jan 28 Type 1 unmatched. That's what my program below does.
Should we assume that once a given type 2 is matched, no earlier unmatched type 1 is eligible for some later type 2?
data have ;
infile datalines dsd truncover;
length id $4;
input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
id=cats('A',id);
format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;
data want (drop=_:);
if 0 then set have;
if _n_=1 then do;
if 0 then set have (rename=(datetran=matcheddate TranValueADJ=matchedamt));
declare hash t2 (multidata:'Y');
t2.definekey('matchedamt');
t2.definedata('matchedamt','matcheddate');
t2.definedone();
end;
do until (last.id); /* Populate hash object of type 2 trans */
set have;
by id;
if trans='Type2' then t2.add(key:TranValueADJ,data:TranValueADJ,data:datetran);
end;
do until (last.id);
set have;
by id;
call missing(of matched:, _rc);
if trans='Type1' then do; /* Look for a matching Type2 */
_rc=t2.find(key:TranValueADJ);
if _rc=0 then t2.removedup(key:TranValueADJ); /*If found, remove from hash*/
output;
end;
else if trans='Type2' then do; /*See if this type2 not removed yet */
_rc=t2.find(key:TranValueADJ);
if _rc=0 and matcheddate=datetran then do; /*If found, output and remove from hash*/
t2.removedup();
call missing(of matched:);
output;
end;
end;
else output; /* Newly added statement, to accommodate non type1/type2 obs*/
end;
run;
For each id, this program populates a hash object of the type2 transactions. That's the first do until (last.id). In the second do until, each incoming type1 is checked against for a qualifying match in the hash object. If a match is found, that transaction is removed from the hash object. For each incoming type2, if it is not exactly matched in the hash object, if must have already matched a preceding type1, and therefore won't be output. But if it is still there, it is output, and it is also removed from the hash. At the end of each id, the hash object will therefore be empty.
Just read @PaigeMiller 's comment about non type1/type2 records not being transcribed to WANT. I've added an "else output;" statement, which I think should address that oversight.
Another note: When the hash object T2 is indexed by the MatchedAmt and there are ties (allowed by the MULTIDATA:Y argument), the tied dataitems will be stored in the order they are added to the hash object. Since the data are sorted by date, that means the tied values are stored in chronological order. In turn, since the FIND method behavior is to retrieve the first available dataitem for a given key value, it will always retrieve the earliest date for that value (see Page 65 of Dorfman and Henderson's "Data Management Solutions Using SAS Hash Table Operations: A Business Case Study")
Good morning Sir @PaigeMiller Thank you for the question. Just a couple of clarifications please, though may sound silly/lame as I am perhaps not reading between the lines-
1. WANT IDs have prefix letters unlike HAVE ID's. Any reason why that we should pay attention to?
2. The order of the WANT dataset by ID is not the same as HAVE. Is there a specific order you are after?
@novinosrin wrote:
Good morning Sir @PaigeMiller Thank you for the question. Just a couple of clarifications please, though may sound silly/lame as I am perhaps not reading between the lines-
1. WANT IDs have prefix letters unlike HAVE ID's. Any reason why that we should pay attention to?
2. The order of the WANT dataset by ID is not the same as HAVE. Is there a specific order you are after?
1. Ignore the prefix letters
2. WANT order not important as long as the matching is correct
@PaigeMiller can a type2 transaction occur with the same amount on the same day?
And if such two occur, should only one of them be prohibited from being matched again?
@PeterClemmensen wrote:
@PaigeMiller can a type2 transaction occur with the same amount on the same day?
And if such two occur, should only one of them be prohibited from being matched again?
Yes to both.
For ID A005, you have 5 values of 1,000
Type 1 Jan 13 is matched to Type2 on Jan 23
Type 1 Jan 21 is not matched
Type1 of Jan 28 is matched to Type2 on Jan 30
Why is the Type1 of Jan 21 not matched to the earliest remaining Type2 - i.e. Jan 30? Which would leave the Jan 28 Type 1 unmatched. That's what my program below does.
Should we assume that once a given type 2 is matched, no earlier unmatched type 1 is eligible for some later type 2?
data have ;
infile datalines dsd truncover;
length id $4;
input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
id=cats('A',id);
format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;
data want (drop=_:);
if 0 then set have;
if _n_=1 then do;
if 0 then set have (rename=(datetran=matcheddate TranValueADJ=matchedamt));
declare hash t2 (multidata:'Y');
t2.definekey('matchedamt');
t2.definedata('matchedamt','matcheddate');
t2.definedone();
end;
do until (last.id); /* Populate hash object of type 2 trans */
set have;
by id;
if trans='Type2' then t2.add(key:TranValueADJ,data:TranValueADJ,data:datetran);
end;
do until (last.id);
set have;
by id;
call missing(of matched:, _rc);
if trans='Type1' then do; /* Look for a matching Type2 */
_rc=t2.find(key:TranValueADJ);
if _rc=0 then t2.removedup(key:TranValueADJ); /*If found, remove from hash*/
output;
end;
else if trans='Type2' then do; /*See if this type2 not removed yet */
_rc=t2.find(key:TranValueADJ);
if _rc=0 and matcheddate=datetran then do; /*If found, output and remove from hash*/
t2.removedup();
call missing(of matched:);
output;
end;
end;
else output; /* Newly added statement, to accommodate non type1/type2 obs*/
end;
run;
For each id, this program populates a hash object of the type2 transactions. That's the first do until (last.id). In the second do until, each incoming type1 is checked against for a qualifying match in the hash object. If a match is found, that transaction is removed from the hash object. For each incoming type2, if it is not exactly matched in the hash object, if must have already matched a preceding type1, and therefore won't be output. But if it is still there, it is output, and it is also removed from the hash. At the end of each id, the hash object will therefore be empty.
Just read @PaigeMiller 's comment about non type1/type2 records not being transcribed to WANT. I've added an "else output;" statement, which I think should address that oversight.
Another note: When the hash object T2 is indexed by the MatchedAmt and there are ties (allowed by the MULTIDATA:Y argument), the tied dataitems will be stored in the order they are added to the hash object. Since the data are sorted by date, that means the tied values are stored in chronological order. In turn, since the FIND method behavior is to retrieve the first available dataitem for a given key value, it will always retrieve the earliest date for that value (see Page 65 of Dorfman and Henderson's "Data Management Solutions Using SAS Hash Table Operations: A Business Case Study")
Hello, everyone, you are correct about the mistake. The JAN30 Type 2 transaction should be match to the JAN21 Type 1 transaction, and there is no match for the JAN28 Type 1 transaction.
To build on the problem. If you only wanted to match the Type 1 and 2 by id if they were <=15 days apart (Type 2 still occurring after 1). How would you modify this program? here are 2 more id's with multiple instances of type 1 with same amount and only 1 type 2. I tried adding another loop with a t2.next() but it wasn't dropping the first find.
have:
id dateTran Trans SourceType TranValueADj
A009 3-Jan-20 Type1 A2 520.84
A009 15-Jan-20 Type1 A2 2008.8
A009 17-Jan-20 Type1 A2 520.84
A009 28-Jan-20 Type1 B2 1041.68
A009 31-Jan-20 Type2 520.84
A009 31-Jan-20 Type1 A2 520.84
A011 2-Jan-20 Type1 A2 1023.3
A011 14-Jan-20 Type1 M1 1023.3
A011 15-Jan-20 Type1 A2 1703.1
A011 17-Jan-20 Type2 1023.3
want -added the number of days based on excel function, i realize sas may be slightly different:
id dateTran Trans SourceType TranValueADj MatchedDate MatchedAmt daysafter
A009 3-Jan-20 Type1 A2 520.84
A009 15-Jan-20 Type1 A2 2008.8
A009 17-Jan-20 Type1 A2 520.84 31-Jan-20 520.84 14.00
A009 28-Jan-20 Type1 B2 1041.68
A011 31-Jan-20 Type1 A2 520.84
A011 2-Jan-20 Type1 A2 1023.3 17-Jan-20 1023.3 15.00
A011 14-Jan-20 Type1 M1 1023.3
A011 15-Jan-20 Type1 A2 1703.1
You're saying that a type2 record can sometimes be too far in the future to match a type 1 record, i.e. you don't want to match "stale" type 1 records.
To do that, only minimal changes would be needed.
That's it - change
if trans='Type1' then do; /* Look for a matching Type2 */
_rc=t2.find(key:TranValueADJ);
if _rc=0 then t2.removedup(key:TranValueADJ); /*If found, remove from hash*/
output;
end;
to
if trans='Type1' then do; /* Look for a matching Type2 */
_rc=t2.find(key:TranValueADJ);
if _rc=0 then do; /*Found matched amount*/
if datetran+15<matcheddate then call missing(of matched:); /*Too far future*/
else t2.removedup(key:TranValueADJ); /*Remove if matched and close enough*/
end;
output;
end;
@mkeintz Thank you i knew it was something in that loop.. I will give this a try.
I think this does the job ?
data have;
infile datalines dsd truncover;
input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;
data temp / view=temp;
set have;
n = _N_;
run;
data want (drop=n);
if _N_ = 1 then do;
dcl hash h(dataset : "temp(where = (Trans='Type2')
rename = (dateTran=MatchedDate TranValueADj=MatchedAmt))",
multidata : "Y");
h.definekey("id", "MatchedAmt");
h.definedata("MatchedDate", "MatchedAmt", "n");
h.definedone();
dcl hash u();
u.definekey("n");
u.definedone();
end;
set temp;
if u.check() = 0 then delete;
if Trans = "Type1" then do;
do while (h.do_over(key : id, key : TranValueADj) = 0);
if dateTran < MatchedDate then do;
u.ref();
h.removedup();
leave;
end;
else call missing(MatchedDate, MatchedAmt, n);
end;
end;
format MatchedDate date9.;
run;
Thanks, @PeterClemmensen . Looks good on the sample data. We're going to try it on millions of records, keep your fingers crossed!
Fingers crossed 🙂
The @PeterClemmensen solution run of over 50,000,000 records seems correct so far, but more checking is warranted.
The @novinosrin solution is now running.
data have;
infile datalines dsd truncover;
input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;
data want;
if _n_=1 then do;
dcl hash h(multidata:'y');
h.definekey('TranValueADj');
h.definedata('_dateTran');
h.definedone();
dcl hash h1(multidata:'y',hashexp:20);
h1.definekey('_dateTran','Matchedamt');
h1.definedata('MatchedDate','Matchedamt');
h1.definedone();
end;
array t(-138061:2936547) _temporary_;
call missing(of t(*));
do _n_=1 by 1 until(last.id);
set have;
by id;
if trans='Type1' then do;
_dateTran=dateTran;
h.add();
end;
else if h.find()=0 then do;
MatchedDate=dateTran;
MatchedAmt=TranValueADj;
t(MatchedDate)=MatchedDate;
h1.add();
_rc=h.removedup();
end;
end;
do _n_=1 to _n_;
set have;
call missing(MatchedDate,MatchedAmt);
if t(dateTran) and trans='Type2' then continue;
if h1.num_items then do;
_iorc_=h1.find(key:datetran,key:TranValueADj);
output;
end;
else output;
end;
h.clear();
h1.clear();
format MatchedDate date9.;
drop _:;
run;
proc print noobs;run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.