BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PaigeMiller
Diamond | Level 26

I have a very large data set, where matching has to be performed according to certain rules. The data set is millions of records, so I'm thinking a hash objects solution might work here, but I don't know how to do that.

 

I have daily transactions with trans=Type1 or Type2, each with an amount (variable name: TranValueAdj) and date associated with it. I want to match the Type1 record to the first Type2 that has the same amount and ID, such that the date of the Type1 transaction is less than the date of the Type2 transaction. Once a Type1 transaction is matched with a Type2 transaction, neither that Type1 or Type2 transaction can be matched again. When there are multiple transactions with the same ID and amount, this becomes tricky. The variable SourceType is not used in the matching, but needs to be in the output data set.

 

Below is an example of the actual data, and following that is the desired outcome.

 

Thanks!

 

data have;
  infile datalines dsd truncover;
  input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
  format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;
data want;
  infile datalines dsd truncover;
  input id:$4. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32. MatchedDate:DATE9. MatchedAmt:$6.;
  format dateTran DATE9. MatchedDate DATE9.;
  label id="id" dateTran="dateTran" Trans="Trans" SourceType="SourceType" TranValueADj="TranValueADj" MatchedDate="MatchedDate" MatchedAmt="MatchedAmt";
datalines4;
A003,03JAN2020,Type1,B3,100000,07JAN2020,100000
A003,03JAN2020,Type1,A2,2061,,.
A003,13JAN2020,Type1,A2,775.1,,.
A003,14JAN2020,Type1,B3,100000,16JAN2020,100000
A003,23JAN2020,Type1,B3,25000,,.
A003,24JAN2020,Type1,A2,121.16,,.
A003,27JAN2020,Type1,A2,146.67,,.
A003,31JAN2020,Type1,A2,1246.7,,.
A001,03JAN2020,Type1,M1,1776,,.
A001,03JAN2020,Type1,A2,1824.12,,.
A001,03JAN2020,Type1,M1,1000,,.
A001,17JAN2020,Type1,A2,1830.88,,.
A001,27JAN2020,Type1,M1,375.27,,.
A001,27JAN2020,Type1,M1,470.04,,.
A001,27JAN2020,Type1,M1,4712.3,,.
A001,31JAN2020,Type1,A2,1869.88,,.
A002,02JAN2020,Type1,B3,1848,,.
A002,03JAN2020,Type1,A2,3351,,.
A002,03JAN2020,Type1,A2,871,,.
A002,03JAN2020,Type1,A2,5135,,.
A002,03JAN2020,Type2,,1000,,.
A002,03JAN2020,Type1,A2,1070,,.
A002,03JAN2020,Type1,A2,8357,,.
A002,06JAN2020,Type1,B3,49425,,.
A002,06JAN2020,Type1,B3,61744.1,,.
A002,06JAN2020,Type1,B3,11273,,.
A002,07JAN2020,Type2,,2198,,.
A002,07JAN2020,Type1,B3,36006.95,,.
A002,08JAN2020,Type1,B3,29329.5,,.
A002,10JAN2020,Type1,B3,695,,.
A002,10JAN2020,Type2,,703,,.
A002,10JAN2020,Type1,B3,29912.3,,.
A002,13JAN2020,Type2,,250,,.
A002,13JAN2020,Type1,B3,24951,,.
A002,14JAN2020,Type2,,250,,.
A002,16JAN2020,Type1,B3,9759.5,,.
A002,17JAN2020,Type1,A2,652,,.
A002,17JAN2020,Type1,B3,4538,,.
A002,21JAN2020,Type1,B3,845,,.
A002,21JAN2020,Type1,B3,83436.27,,.
A002,21JAN2020,Type1,B3,75,,.
A002,21JAN2020,Type1,B3,8814,,.
A002,22JAN2020,Type2,,250,,.
A002,24JAN2020,Type1,B3,2270,,.
A002,24JAN2020,Type1,B3,2258,,.
A002,28JAN2020,Type1,B3,9135,,.
A004,06JAN2020,Type1,A1,300,09JAN2020,300
A004,15JAN2020,Type1,A2,1204.5,,.
A004,31JAN2020,Type1,B1,0,,.
A004,31JAN2020,Type1,A2,3079.09,,.
A005,13JAN2020,Type1,B3,1000,23JAN2020,1000
A005,21JAN2020,Type1,B3,1000,,.
A005,28JAN2020,Type1,B3,1000,30JAN2020,1000
A006,02JAN2020,Type1,M1,250,,.
A006,03JAN2020,Type1,A2,23.72,,.
A006,03JAN2020,Type2,,200,,.
A006,06JAN2020,Type1,M1,400,09JAN2020,400
A006,16JAN2020,Type1,M1,150,,.
A006,21JAN2020,Type1,M1,190,,.
A006,22JAN2020,Type1,M1,200,28JAN2020,200
A006,23JAN2020,Type1,M1,200,,.
A006,24JAN2020,Type1,M1,184.75,,.
A006,27JAN2020,Type1,M1,120,,.
A006,29JAN2020,Type1,M1,550,,.
A007,02JAN2020,Type1,B1,0,,.
A007,13JAN2020,Type1,MF,180,,.
A007,13JAN2020,Type1,M1,175,16JAN2020,175
A007,15JAN2020,Type1,M1,240,21JAN2020,240
A007,16JAN2020,Type1,M1,440,22JAN2020,440
A007,17JAN2020,Type1,M1,345,23JAN2020,345
;;;;
--
Paige Miller
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

For ID A005, you have 5 values of 1,000

  Type 1 Jan 13 is matched to Type2 on Jan 23
  Type 1 Jan 21 is not matched

  Type1 of Jan 28 is matched to Type2 on Jan 30

 

Why is the Type1 of Jan 21 not matched to the earliest remaining Type2 - i.e. Jan 30?  Which would leave the Jan 28 Type 1 unmatched.  That's what my program below does.

 

Should we assume that once a given type 2 is matched, no earlier unmatched type 1 is eligible for some later type 2?

 

data have ;
  infile datalines dsd truncover;
  length id $4;
  input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
  id=cats('A',id);
  format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;

data want (drop=_:);
  if 0 then set have; 
  if _n_=1 then do;
    if 0 then set have (rename=(datetran=matcheddate TranValueADJ=matchedamt));
    declare hash t2 (multidata:'Y');
	  t2.definekey('matchedamt');
	  t2.definedata('matchedamt','matcheddate');
	  t2.definedone();
  end;

  do until (last.id);   /* Populate hash object of type 2 trans */
    set have;
	by id;
	if trans='Type2' then t2.add(key:TranValueADJ,data:TranValueADJ,data:datetran);
  end;

  do until (last.id);
    set have;
	by id;
	call missing(of matched:, _rc);
    if trans='Type1' then do;                        /* Look for a matching  Type2 */
      _rc=t2.find(key:TranValueADJ);
      if _rc=0 then t2.removedup(key:TranValueADJ);  /*If found, remove from hash*/
      output;
	end;
    else if trans='Type2' then do;                   /*See if this type2 not removed yet */
      _rc=t2.find(key:TranValueADJ);
      if _rc=0 and matcheddate=datetran then do;     /*If found, output and remove from hash*/
		t2.removedup();
        call missing(of matched:);
        output;
	  end;
	end;
    else output;     /* Newly added statement, to accommodate non type1/type2 obs*/
  end;

run;

For each id, this program populates a hash object of the type2 transactions.  That's the first do until (last.id).  In the second do until, each incoming type1 is checked against for a qualifying match in the hash object.  If a match is found, that transaction is removed from the hash object.   For each incoming type2, if it is not exactly matched in the hash object, if must have already matched a preceding type1, and therefore won't be output.  But if it is still there, it is output, and it is also removed from the hash.  At the end of each id, the hash object will therefore be empty.

 

Just read @PaigeMiller 's comment about non type1/type2 records not being transcribed to WANT.  I've added an "else output;" statement, which I think should address that oversight.

 

Another note:  When the hash object T2 is indexed by the MatchedAmt and there are ties (allowed by the MULTIDATA:Y argument), the tied dataitems will be stored in the order they are added to the hash object.  Since the data are sorted by date, that means the tied values are stored in chronological order.  In turn, since the FIND method behavior is to retrieve the first available dataitem for a given key value, it will always retrieve the earliest date for that value (see Page 65 of Dorfman and Henderson's "Data Management Solutions Using SAS Hash Table Operations: A Business Case Study")

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

21 REPLIES 21
novinosrin
Tourmaline | Level 20

Good morning Sir @PaigeMiller  Thank you for the question. Just a couple of clarifications please, though may sound silly/lame as I am perhaps not reading between the lines-

 

1. WANT IDs have prefix letters unlike HAVE ID's. Any reason why that we should pay attention to?

2. The order of the WANT dataset by ID is not the same as HAVE. Is there a specific order you are after?

PaigeMiller
Diamond | Level 26

@novinosrin wrote:

Good morning Sir @PaigeMiller  Thank you for the question. Just a couple of clarifications please, though may sound silly/lame as I am perhaps not reading between the lines-

 

1. WANT IDs have prefix letters unlike HAVE ID's. Any reason why that we should pay attention to?

2. The order of the WANT dataset by ID is not the same as HAVE. Is there a specific order you are after?


1. Ignore the prefix letters

2. WANT order not important as long as the matching is correct

--
Paige Miller
PeterClemmensen
Tourmaline | Level 20

@PaigeMiller can a type2 transaction occur with the same amount on the same day?

 

And if such two occur, should only one of them be prohibited from being matched again?

PaigeMiller
Diamond | Level 26

@PeterClemmensen wrote:

@PaigeMiller can a type2 transaction occur with the same amount on the same day?

 

And if such two occur, should only one of them be prohibited from being matched again?


Yes to both.

--
Paige Miller
mkeintz
PROC Star

For ID A005, you have 5 values of 1,000

  Type 1 Jan 13 is matched to Type2 on Jan 23
  Type 1 Jan 21 is not matched

  Type1 of Jan 28 is matched to Type2 on Jan 30

 

Why is the Type1 of Jan 21 not matched to the earliest remaining Type2 - i.e. Jan 30?  Which would leave the Jan 28 Type 1 unmatched.  That's what my program below does.

 

Should we assume that once a given type 2 is matched, no earlier unmatched type 1 is eligible for some later type 2?

 

data have ;
  infile datalines dsd truncover;
  length id $4;
  input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
  id=cats('A',id);
  format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;

data want (drop=_:);
  if 0 then set have; 
  if _n_=1 then do;
    if 0 then set have (rename=(datetran=matcheddate TranValueADJ=matchedamt));
    declare hash t2 (multidata:'Y');
	  t2.definekey('matchedamt');
	  t2.definedata('matchedamt','matcheddate');
	  t2.definedone();
  end;

  do until (last.id);   /* Populate hash object of type 2 trans */
    set have;
	by id;
	if trans='Type2' then t2.add(key:TranValueADJ,data:TranValueADJ,data:datetran);
  end;

  do until (last.id);
    set have;
	by id;
	call missing(of matched:, _rc);
    if trans='Type1' then do;                        /* Look for a matching  Type2 */
      _rc=t2.find(key:TranValueADJ);
      if _rc=0 then t2.removedup(key:TranValueADJ);  /*If found, remove from hash*/
      output;
	end;
    else if trans='Type2' then do;                   /*See if this type2 not removed yet */
      _rc=t2.find(key:TranValueADJ);
      if _rc=0 and matcheddate=datetran then do;     /*If found, output and remove from hash*/
		t2.removedup();
        call missing(of matched:);
        output;
	  end;
	end;
    else output;     /* Newly added statement, to accommodate non type1/type2 obs*/
  end;

run;

For each id, this program populates a hash object of the type2 transactions.  That's the first do until (last.id).  In the second do until, each incoming type1 is checked against for a qualifying match in the hash object.  If a match is found, that transaction is removed from the hash object.   For each incoming type2, if it is not exactly matched in the hash object, if must have already matched a preceding type1, and therefore won't be output.  But if it is still there, it is output, and it is also removed from the hash.  At the end of each id, the hash object will therefore be empty.

 

Just read @PaigeMiller 's comment about non type1/type2 records not being transcribed to WANT.  I've added an "else output;" statement, which I think should address that oversight.

 

Another note:  When the hash object T2 is indexed by the MatchedAmt and there are ties (allowed by the MULTIDATA:Y argument), the tied dataitems will be stored in the order they are added to the hash object.  Since the data are sorted by date, that means the tied values are stored in chronological order.  In turn, since the FIND method behavior is to retrieve the first available dataitem for a given key value, it will always retrieve the earliest date for that value (see Page 65 of Dorfman and Henderson's "Data Management Solutions Using SAS Hash Table Operations: A Business Case Study")

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

Hello, everyone, you are correct about the mistake. The JAN30 Type 2 transaction should be match to the JAN21 Type 1 transaction, and there is no match for the JAN28 Type 1 transaction.

--
Paige Miller
kah1
Fluorite | Level 6

To build on the problem.  If you only wanted to  match  the Type 1 and 2  by id if they were <=15 days apart (Type 2 still occurring after 1).  How would you modify this program?  here are 2 more id's with multiple instances of type 1 with same amount and only 1 type 2.    I tried adding another loop with a  t2.next()  but it wasn't dropping the first find.

 

have:

id dateTran Trans SourceType TranValueADj
 A009 3-Jan-20 Type1 A2 520.84
A009 15-Jan-20 Type1 A2 2008.8
A009 17-Jan-20 Type1 A2 520.84
A009 28-Jan-20 Type1 B2 1041.68
A009 31-Jan-20 Type2 520.84
A009 31-Jan-20 Type1 A2 520.84
A011 2-Jan-20 Type1 A2 1023.3
A011 14-Jan-20 Type1 M1 1023.3
A011 15-Jan-20 Type1 A2 1703.1
A011 17-Jan-20 Type2 1023.3

 

 

want -added the number of days based on excel function, i realize sas may be slightly different:

 

id dateTran Trans SourceType TranValueADj MatchedDate MatchedAmt daysafter
A009 3-Jan-20 Type1 A2 520.84
A009 15-Jan-20 Type1 A2 2008.8
A009 17-Jan-20 Type1 A2 520.84 31-Jan-20 520.84 14.00
A009 28-Jan-20 Type1 B2 1041.68
A011 31-Jan-20 Type1 A2 520.84
A011 2-Jan-20 Type1 A2 1023.3 17-Jan-20 1023.3 15.00
A011 14-Jan-20 Type1 M1 1023.3
A011 15-Jan-20 Type1 A2 1703.1

 

mkeintz
PROC Star

You're saying that a type2 record can sometimes be too far in the future to match a type 1 record, i.e. you don't want to match "stale" type 1 records.

 

To do that, only minimal changes would be needed.

 

  1. You would still do the find method when encountering Type1.  I.e. keep the sequence:
        if trans='Type1' then do; 
          _rc=t2.find(key:TranValueADJ);
      
  2. But if the dates are more than 15 days apart, there would be these changes:
    1. Don't do the removedup, because that type 2 record has to be available for any subsequent type 1's.
    2. Call missing for matchedamt and matacheddate before the output statement, because the record-in-hand is a stale type1 record, and its output record should not appear as matched.

That's it - change 

    if trans='Type1' then do;                        /* Look for a matching  Type2 */
      _rc=t2.find(key:TranValueADJ);
      if _rc=0 then t2.removedup(key:TranValueADJ);  /*If found, remove from hash*/
      output;
	end;

to

    if trans='Type1' then do;                        /* Look for a matching  Type2 */
      _rc=t2.find(key:TranValueADJ);
	  if _rc=0 then do;                              /*Found matched amount*/
        if datetran+15<matcheddate then call missing(of matched:); /*Too far future*/
		else t2.removedup(key:TranValueADJ);         /*Remove if matched and close enough*/
	  end;
      output;
	end;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
kah1
Fluorite | Level 6

@mkeintz  Thank you  i knew it was something in that  loop.. I will give this a try. 

PeterClemmensen
Tourmaline | Level 20

I think this does the job ?

 

data have;
  infile datalines dsd truncover;
  input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
  format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;

data temp / view=temp;
   set have;
   n = _N_;
run;

data want (drop=n);
   if _N_ = 1 then do;
      dcl hash h(dataset : "temp(where = (Trans='Type2') 
                            rename = (dateTran=MatchedDate TranValueADj=MatchedAmt))",
                 multidata : "Y");
      h.definekey("id", "MatchedAmt");
      h.definedata("MatchedDate", "MatchedAmt", "n");
      h.definedone();

      dcl hash u();
      u.definekey("n");
      u.definedone();
   end;

   set temp;
   if u.check() = 0 then delete;

   if Trans = "Type1" then do;
      do while (h.do_over(key : id, key : TranValueADj) = 0);
         if dateTran < MatchedDate then do;
            u.ref();
            h.removedup();
            leave;
         end;
         else call missing(MatchedDate, MatchedAmt, n);
      end;
   end;

   format MatchedDate date9.;
run;

 

 

PaigeMiller
Diamond | Level 26

Thanks, @PeterClemmensen . Looks good on the sample data. We're going to try it on millions of records, keep your fingers crossed!

--
Paige Miller
PeterClemmensen
Tourmaline | Level 20

Fingers crossed 🙂

PaigeMiller
Diamond | Level 26

The @PeterClemmensen solution run of over 50,000,000 records seems correct so far, but more checking is warranted.

 

The @novinosrin solution is now running.

--
Paige Miller
novinosrin
Tourmaline | Level 20
data have;
  infile datalines dsd truncover;
  input id:$3. dateTran:DATE9. Trans:$5. SourceType:$2. TranValueADj:32.;
  format dateTran DATE9.;
datalines4;
001,03JAN2020,Type1,M1,1776
001,03JAN2020,Type1,A2,1824.12
001,03JAN2020,Type1,M1,1000
001,17JAN2020,Type1,A2,1830.88
001,27JAN2020,Type1,M1,375.27
001,27JAN2020,Type1,M1,470.04
001,27JAN2020,Type1,M1,4712.3
001,31JAN2020,Type1,A2,1869.88
002,02JAN2020,Type1,B3,1848
002,03JAN2020,Type1,A2,3351
002,03JAN2020,Type1,A2,871
002,03JAN2020,Type1,A2,5135
002,03JAN2020,Type2,,1000
002,03JAN2020,Type1,A2,1070
002,03JAN2020,Type1,A2,8357
002,06JAN2020,Type1,B3,49425
002,06JAN2020,Type1,B3,61744.1
002,06JAN2020,Type1,B3,11273
002,07JAN2020,Type2,,2198
002,07JAN2020,Type1,B3,36006.95
002,08JAN2020,Type1,B3,29329.5
002,10JAN2020,Type1,B3,695
002,10JAN2020,Type2,,703
002,10JAN2020,Type1,B3,29912.3
002,13JAN2020,Type2,,250
002,13JAN2020,Type1,B3,24951
002,14JAN2020,Type2,,250
002,16JAN2020,Type1,B3,9759.5
002,17JAN2020,Type1,A2,652
002,17JAN2020,Type1,B3,4538
002,21JAN2020,Type1,B3,845
002,21JAN2020,Type1,B3,83436.27
002,21JAN2020,Type1,B3,75
002,21JAN2020,Type1,B3,8814
002,22JAN2020,Type2,,250
002,24JAN2020,Type1,B3,2270
002,24JAN2020,Type1,B3,2258
002,28JAN2020,Type1,B3,9135
003,03JAN2020,Type1,B3,100000
003,03JAN2020,Type1,A2,2061
003,07JAN2020,Type2,,100000
003,13JAN2020,Type1,A2,775.1
003,14JAN2020,Type1,B3,100000
003,16JAN2020,Type2,,100000
003,23JAN2020,Type1,B3,25000
003,24JAN2020,Type1,A2,121.16
003,27JAN2020,Type1,A2,146.67
003,31JAN2020,Type1,A2,1246.7
004,06JAN2020,Type1,A1,300
004,09JAN2020,Type2,,300
004,15JAN2020,Type1,A2,1204.5
004,31JAN2020,Type1,B1,0
004,31JAN2020,Type1,A2,3079.09
005,13JAN2020,Type1,B3,1000
005,21JAN2020,Type1,B3,1000
005,23JAN2020,Type2,,1000
005,28JAN2020,Type1,B3,1000
005,30JAN2020,Type2,,1000
006,02JAN2020,Type1,M1,250
006,03JAN2020,Type1,A2,23.72
006,03JAN2020,Type2,,200
006,06JAN2020,Type1,M1,400
006,09JAN2020,Type2,,400
006,16JAN2020,Type1,M1,150
006,21JAN2020,Type1,M1,190
006,22JAN2020,Type1,M1,200
006,23JAN2020,Type1,M1,200
006,24JAN2020,Type1,M1,184.75
006,27JAN2020,Type1,M1,120
006,28JAN2020,Type2,,200
006,29JAN2020,Type1,M1,550
007,02JAN2020,Type1,B1,0
007,13JAN2020,Type1,MF,180
007,13JAN2020,Type1,M1,175
007,15JAN2020,Type1,M1,240
007,16JAN2020,Type1,M1,440
007,16JAN2020,Type2,,175
007,17JAN2020,Type1,M1,345
007,21JAN2020,Type2,,240
007,22JAN2020,Type2,,440
007,23JAN2020,Type2,,345
;;;;


data want;
 if _n_=1 then do;
   dcl hash h(multidata:'y');
   h.definekey('TranValueADj');
   h.definedata('_dateTran');
   h.definedone();
   dcl hash h1(multidata:'y',hashexp:20);
   h1.definekey('_dateTran','Matchedamt');
   h1.definedata('MatchedDate','Matchedamt');
   h1.definedone();
 end;
 array t(-138061:2936547) _temporary_;
 call missing(of t(*));
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if trans='Type1' then do;
   _dateTran=dateTran;
   h.add();
  end;
  else if h.find()=0 then do;
   MatchedDate=dateTran;
   MatchedAmt=TranValueADj;
   t(MatchedDate)=MatchedDate;
   h1.add();
   _rc=h.removedup();
  end;
 end;
 do _n_=1 to _n_;
  set have;
  call missing(MatchedDate,MatchedAmt);
  if t(dateTran) and trans='Type2' then continue;
  if h1.num_items then do;
    _iorc_=h1.find(key:datetran,key:TranValueADj);
	output;
  end;
  else output;
 end; 
 h.clear();
 h1.clear();
 format  MatchedDate date9.;
 drop _:;
run;

proc print noobs;run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 1282 views
  • 4 likes
  • 6 in conversation