Hello there,
I am try to collapse two observation by group and condition
When the Id is the same, and Type for both observation is repeat ,and the date different is within 1 month .
Then collapse two observation and only keep the early date.
id | amount | date | type |
1 | 5 | 2017-01-01 | new |
1 | 5 | 2018-01-01 | repeat |
1 | 1 | 2018-01-17 | repeat |
2 | 1 | 2017-01-01 | new |
2 | 1 | 2018-01-01 | repeat |
2 | 1 | 2019-01-01 | repeat |
3 | 2 | 2001-01-05 | repeat |
3 | 2 | 2020-01-03 | new |
3 | 2 | 2020-04-01 | new |
Thanks
Hi @phantomofshell Can you check the results for ID=2 in your latest post and let me know where the discrepancy is. If my understanding is correct, the below should come close. The only concern I have is sets of 2 or more within type(new,repeat)
data have;
input id amount date :yymmdd10. type $;
format date yymmdd10.;
cards;
1 5 2017-01-01 new
1 5 2017-01-07 repeat
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2017-01-02 other
2 1 2018-01-01 repeat
2 1 2018-01-07 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
;
data want ;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("date") ;
h.definedata ("prev","curr","amount") ;
h.definedone () ;
end;
do _n_=1 by 1 until(last.id);
set have end=z;
by id ;
curr=date;
_s=lag(amount);
if type not in ('new','repeat') then prev=.;
else if prev then do;
if intck('mon',prev,curr,'c')<1 then do;
amount=sum(amount,_s);
h.add(key:prev,data:prev,data:curr,data:amount);
h.add(key:curr,data:prev,data:curr,data:amount);
end;
prev=date;
end;
else prev=date;
end;
do _n_=1 to _n_;
set have;
if h.find()=0 then do;
if date=curr then continue;
output;
end;
else output;
end;
h.clear();
drop curr prev _s;
run;
HI @phantomofshell Please post the sample as plain text to make it convenient for community members to copay/paste and work on the solution. Pics make it impossible.
id | amount | date | type |
1 | 5 | 2017-01-01 | new |
1 | 5 | 2018-01-01 | repeat |
1 | 1 | 2018-01-17 | repeat |
2 | 1 | 2017-01-01 | new |
2 | 1 | 2018-01-01 | repeat |
2 | 1 | 2019-01-01 | repeat |
3 | 2 | 2001-01-05 | repeat |
3 | 2 | 2020-01-03 | new |
3 | 2 | 2020-04-01 | new |
new to the community , sorry
Hi @phantomofshell Welcome to SAS communities. To make it easy, I am herewith providing a traditional rather 2 step solution for reading ease and convenience.
Logic:
1. Determine the repeat groups that are consecutive and check for the sets of 2
2. Should the date difference within a set of 2 is less than a month, group that set as 1
3. Iterate the same logic to identify different sets of grps that satisfy condition2 and increment the grp number accordindly
4. The result sets the base for a simple Summary stats to sum the group and take the earliest number within that group. This is what the Proc SQL does
data have;
input id amount date :yymmdd10. type $;
format date yymmdd10.;
cards;
1 5 2017-01-01 new
1 5 2018-01-01 repeat
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2018-01-01 repeat
2 1 2019-01-01 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
;
data temp;
set have;
by id type notsorted;
retain grp prev_date;
if first.id then grp=0;
if first.type and type='repeat' then do;
grp=1;
prev_date=date;
end;
else if type='repeat' and intck('mon',prev_date,date,'c')>1 then do;
grp+1;
prev_date=date;
end;
else if type ne 'repeat' then grp=0;
drop prev_date;
run;
proc sql;
create table want(drop=sum) as
select id,sum( case when type='repeat' then amount else . end) as sum,
ifn(type='repeat',calculated sum,amount) as amount,date,type
from temp
group by id,grp
having min(date)=date and grp or not grp
order by id,date;
quit;
See if the above helps and let us know.
Also, Your sample leads to some assumptions that can be best helped if you can clarify:
1. Within a ID group, can exist a scenario like below?
1 5 2018-01-01 repeat
1 1 2018-01-17 repeat
1 4 2018-01-19 repeat
1 6 2018-01-22 repeat
So if yes, does this make it 2 sets of two obs or all 4 belongs to one set making it a sum(5,1,4,6)=16 with date 2018-01-01 ?
2. Two obs combining to one as stated in the subject line can make the situation highly ambiguous as sometimes what if 2&3 can make an independent set leaving 1 and 4. How would you want to handle this.
I would request you to review the requirement diligently
1 5 2017-01-01 new
1 5 2017-01-015 repeat
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2018-01-01 repeat
2 1 2019-01-01 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
Thanks for your help novionosrin, what if I would like to change the condition say :for the same id if previous type within 1 month= new and current type = repeat, then I collapse the two as what we did before
What should the output look like, would the collapsed summed record take thew date of NEW or REPEAT?
Or is the below assumption is what it is:
data have;
input id amount date :yymmdd10. type $;
format date yymmdd10.;
cards;
1 5 2017-01-01 new
1 5 2017-01-015 repeat
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2018-01-01 repeat
2 1 2019-01-01 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
;
data temp;
set have;
by id ;
retain grp prev_date;
if first.id then grp=0;
if type='new' then do;
grp+1;
prev_date=date;
end;
if lag(type) ne 'new' and type='repeat' or
type='repeat' and lag(type)='new' and intck('mon',prev_date,date,'c')
then grp=0;
drop prev_date;
run;
proc sql;
create table want(drop=sum) as
select id,sum(amount) as sum,
ifn(type='new',calculated sum,amount) as amount,date,type
from temp
group by id,grp
having min(date)=date and grp or not grp
order by id,date;
quit;
This is the Original Set
1 5 2017-01-01 new
1 5 2017-01-07 repeat
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2017-01-02 other
2 1 2018-01-01 repeat
2 1 2018-01-07 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
This is the result
1 10 2017-01-01 new
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2018-01-01 repeat
2 1 2017-01-02 other
2 2 2018-01-01 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
Sorry about the ambiguous description. Please be aware of there's other Type out there, so I added one other here.
The logic is simple, for one ID , if there are observations for new or repeat ,and these observations have time stamp within one month.
Then I consider those observation as one, so I add the amount and take the date and type of the first observation (timely most oldest)
Hi @phantomofshell Can you check the results for ID=2 in your latest post and let me know where the discrepancy is. If my understanding is correct, the below should come close. The only concern I have is sets of 2 or more within type(new,repeat)
data have;
input id amount date :yymmdd10. type $;
format date yymmdd10.;
cards;
1 5 2017-01-01 new
1 5 2017-01-07 repeat
1 1 2018-01-17 repeat
2 1 2017-01-01 new
2 1 2017-01-02 other
2 1 2018-01-01 repeat
2 1 2018-01-07 repeat
3 2 2001-01-05 repeat
3 2 2020-01-03 new
3 2 2020-04-01 new
;
data want ;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("date") ;
h.definedata ("prev","curr","amount") ;
h.definedone () ;
end;
do _n_=1 by 1 until(last.id);
set have end=z;
by id ;
curr=date;
_s=lag(amount);
if type not in ('new','repeat') then prev=.;
else if prev then do;
if intck('mon',prev,curr,'c')<1 then do;
amount=sum(amount,_s);
h.add(key:prev,data:prev,data:curr,data:amount);
h.add(key:curr,data:prev,data:curr,data:amount);
end;
prev=date;
end;
else prev=date;
end;
do _n_=1 to _n_;
set have;
if h.find()=0 then do;
if date=curr then continue;
output;
end;
else output;
end;
h.clear();
drop curr prev _s;
run;
Thanks so much for your help! It works
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.