Re: data merge using date range

goldenone · Posted 03-13-2018 09:25 AM

data  have;
infile datalines dsd truncover;
input ID $13. Transacion_date  date9. Amount 8. Replicate $3. month $3;
datalines4;
1212121,10/21/2016,5000,1,10
1212121,10/21/2016,45000,1,10
1515151,22/06/2017,50000,1,6
1212185,23/05/2017,60000,1,5
21212121,15/06/2015,70000,1,6
258523,15/08/2017,80000,1,8
1212121,10/21/2016,5000,2,10
1212121,10/21/2016,45000,2,10
1515151,22/06/2017,50000,2,6
1212185,23/05/2017,60000,2,5
21212121,15/06/2015,70000,2,6
258523,15/08/2017,80000,2,8;

run;

Hello,

I am trying to merge a dataset with itself in order to handle difference in dates. I have a way to do it in proc sql. However, I want to do it in a datastep to allow for the indexing that I put on the dataset to be used. I read https://communities.sas.com/t5/General-SAS-Programming/data-step-to-merge-for-a-date-range/td-p/1910... and I attempted to follow it. Long story short, am not getting results I expected.

Here is what I attempted

data  perm.havenew;
merge  perm.have (in=a  drop=  cash_wd  cash_wdtri  dtotal   dtotaltri)    perm.have (in=b   rename=(pdate=b_pdate));
by  replicate id;
if  (pdate - b_pdate)  ge  0 and  (pdate - b_pdate)  le 6  ;
run;

and here is the warning message I received.

OTE: MERGE statement has more than one data set with repeats of BY values.

4 The SAS System 10:10 Monday, March 12, 2018

NOTE: There were 167298200 observations read from the data set perm.have.

NOTE: The data set perm.have has 167298200 observations and 9 variables.

NOTE: Compressing data set perm.CASHWD2 decreased size by 26.97 percent.

Compressed is 194248 pages; un-compressed would require 265975 pages.

NOTE: DATA statement used (Total process time):

real time 3:25.11

cpu time 3:20.16

proc sql; create table have as
select   a.replicate, a.aml_party_id, a.pdate, a.mnth  
from perm.cashwd1 a
left join perm.cashwd1 b
on a.id = b.id  and a.replicate = b.replicate
and a.pdate between b.pdate and b.pdate + 6 
group by a.replicate, a.aml_party_id, a.pdate, a.mnth; quit;

Please note that the sql code does more but it also takes far more time. here is some dummy data.

Also, here is the number of records, I should get

CASHWDTMP created, with 512251800 rows and 8 columns

ballardw · Posted 03-13-2018 10:21 AM

Some example data and the expected result of that data would help as well a description of what is wrong with the current result. "not getting results I expected" is pretty vague as to what is going wrong.

Note that often data step merge is not the best for many-to-many merge operations

goldenone · Posted 03-13-2018 12:35 PM

Hi, I have edited the post to include dummy data as well as number of records that I expect to get from my real data. Can you explain if a data merge if as you say, often data step merge is not the best for many-to-many merge operations``, how I can get this done in a timely fashion which is what using a data step and indexing was supposed to accomplish

ballardw · Posted 03-13-2018 03:31 PM

@goldenone wrote:

Hi, I have edited the post to include dummy data as well as number of records that I expect to get from my real data. Can you explain if a data merge if as you say, often data step merge is not the best for many-to-many merge operations``, how I can get this done in a timely fashion which is what using a data step and indexing was supposed to accomplish

Easiest is to show.

Below are two data sets and two different merge by a variable that has repeated values in each set.

Look at the two input data sets and think about how you might expect them to merge on variable X.

Then run the code for the two example output sets. Do either of these match your expectations? Pay close attention to the number of records and the values of variable Y.

data one;
   input x y z;
datalines;
1 2 3
1 3 4
1 4 5
2 2 3
3 2 1
;
run;
data two;
   input x y b;
datalines;
1 4 6
1 3 8
1 4 9
1 6 4
2 2 3
3 2 1
;

data example1;
   merge one two;
   by x;
run;
data example2;
   merge two one ;
   by x;
run;

Can you answer why the variable Y has the value of 4 for the first record of example1? Why there are two records with Z=5 and different B values?

And understanding that merge can get much trickier if there are two or more By variables with repeated values.

Since you are attempting data step to use indexes from the documentation:

Indexes enable PROC SQL to execute the following classes of queries more efficiently:

comparisons against a column that is indexed

an IN subquery where the column in the inner subquery is indexed

correlated subqueries, where the column being compared with the correlated reference is indexed

join-queries, where the join-expression is an equals comparison and all the columns in the join-expression are indexed in one of the tables being joined.

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

DaveBirch · Posted 03-15-2018 05:57 AM

Hi @goldenone,

Although @ballardw gave sound advice that "often data step merge is not the best for many-to-many merge operations" the problem isn't that there is different numbers of repeating values on each side of the merge. Indeed, being a self-merge (or self-join) you have exactly the same number of rows on each side! I'm quite confident that perm.havenew would be exactly the same if you removed the by statement. Ponder that for a moment 🙂

Data steps usually process data sequentially, so a merge statement will not lead to comparing different rows in a self-merge.

Try something like this (untested code):

data perm.havenew;
retain _RowNum_ 0;
do until last.id;
set perm.have (in=a drop= &droplist) nobs=NumObs;
by replicate id /* pdate */;
_RowNum_++1;
if first.id then StartIdObs = _RowNum_;

b_pdate = pdate; ** to force the while condition (below) to true **;
do _i_ = StartIdObs to NumObs
while ((pdate - b_pdate) ge 0);
set perm.have (in=b rename=(pdate=b_pdate));
if (pdate - b_pdate) ge 0 and (pdate - b_pdate) le 6 then output;
** or maybe cash_wd cash_wdtri dtotal dtotaltri **
** are accumulated before an obs is output? **;
end; /* of do _i_ */
end; /* of until last.id */
run;

P.S. it would help if you tried stating your requirements in a way that was independent of the solution you had first thought of.

Hope this helps.

DaveBirch · Posted 03-15-2018 06:38 AM

P.S. I'm assuming that perm.have is ordered or indexed by replicate id pdate.

data merge using date range