Hello Community. Bringing this back up after 2 years.
I have the following data that I need to use to measure the periods of membership overlap (if any) for the same individual but in multiple stores:
Obs_Num | Store_ID | Member_ID | Member_From | Member_To | Member_Type |
1 | AA | 123 | 01/01/2020 | 12/31/2021 | Full |
2 | BB | 123 | 06/01/2021 | 12/31/2999 | Full |
3 | AA | 456 | 01/01/2020 | 12/31/2021 | Full |
4 | BB | 456 | 09/01/2021 | 10/31/2021 | Part |
5 | AA | 789 | 01/01/2020 | 12/31/2021 | Full |
6 | BB | 911 | 06/01/2021 | 10/01/2021 | Part |
7 | AA | 411 | 01/01/2020 | 12/31/2022 | Full |
8 | BB | 411 | 09/01/2021 | 12/31/2099 | Part |
9 | CC | 411 | 12/01/2021 | 12/31/2022 | Full |
Few notes (the colors are to clarify the overlapping):
- Observations 1&2 are for the same ID in 2 different stores
- Observations 5& 6 are for distinct ID and for only one stores (no overlap)
- Observations 7-8-9 are for the same ID in 3 different stores
- The new dataset (below) should have 5 new columns to measure overlap from start UNTILL the end of 2021.
- The new column "Type_Overlap" is tricky in that it has to start with that store's membership type first. For example, Obs 7/8/9 are similar in 4 of the new 5 columns EXCEPT the type_overlap.
The new columns to be added to the data should be as follows:
Obs_Num | Store_ID | Member_ID | Member_From | Member_To | Member_Type | Mnths_Overlaps_till2021 | Num_Stores_Overlap | Type_Overlap | Overlap_From | Overlap_To |
1 | AA | 123 | 01/01/2020 | 12/31/2021 | Full | 7 | 2 | Full/Full | 06/01/2021 | 12/31/2021 |
2 | BB | 123 | 06/01/2021 | 12/31/2999 | Full | 7 | 2 | Full/Full | 06/01/2021 | 12/31/2021 |
3 | AA | 456 | 01/01/2020 | 12/31/2021 | Full | 2 | 2 | Full/Part | 09/01/2021 | 10/31/2021 |
4 | BB | 456 | 09/01/2021 | 10/31/2021 | Part | 2 | 2 | Part/Full | 09/01/2021 | 10/31/2021 |
5 | AA | 789 | 01/01/2020 | 12/31/2021 | Full | 0 | 0 | Full | ||
6 | BB | 911 | 06/01/2021 | 10/01/2021 | Part | 0 | 0 | Part | ||
7 | AA | 411 | 01/01/2020 | 12/31/2022 | Full | 4 | 3 | Full/Part/Full | 09/01/2021 | 12/31/2021 |
8 | BB | 411 | 09/01/2021 | 12/31/2099 | Part | 4 | 3 | Part/Full/Full | 09/01/2021 | 12/31/2021 |
9 | CC | 411 | 12/01/2021 | 12/31/2022 | Full | 4 | 3 | Full/Full/Part | 09/01/2021 | 12/31/2021 |
Is there a way to do this in SAS in a meaningful code? My data is a lot much larger than the above, but I hope I captured all the scenarios.
Assuming that no store_id occurs more than once per member_id (so that there won't be a false overlap inferred), then a DATA step passing through each ID twice will do.
It uses a 2-way array indexed by YEAR and MONTH to track overlaps.
/*Combine and Condense Duplicate Records*/
%let HLB=2019; /* History lower bound */
%let HUB=2021; /* History upper bound */
data have;
infile cards expandtabs truncover;
input Obs_Num Store_ID $ Member_ID $ Member_From :mmddyy10. Member_To :mmddyy10. Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1 AA 123 01/01/2020 12/31/2021 Full
2 BB 123 06/01/2021 12/31/2999 Full
3 AA 456 01/01/2020 12/31/2021 Full
4 BB 456 09/01/2021 10/31/2021 Part
5 AA 789 01/01/2020 12/31/2021 Full
6 BB 911 06/01/2021 10/01/2021 Part
7 AA 411 01/01/2020 12/31/2022 Full
8 BB 411 09/01/2021 12/31/2099 Part
9 CC 411 12/01/2021 12/31/2022 Full
;
data want (drop=_:);
/** Do a first pass through a single ID **/
set have;
by member_id notsorted;
array mfreq{&HLB:&HUB,12} _temporary_ /*Frequency by year and month */;
retain mnths_overlaps_till2021 num_stores_overlap type_overlap overlap_from overlap_to;
length type_overlap $20;
format overlap_from overlap_to mmddyy10. ;
if first.member_id then do;
call missing(of mfreq{*},of mnths_overlaps_till2021 -- overlap_to);
mnths_overlaps_till2021 = 0;
end;
num_stores_overlap + not(first.member_id=1 and last.member_id=1);
type_overlap=catx('/',type_overlap,member_type);
_mdate=member_from;
do while (_mdate< min(member_to,"31dec&HUB"d));
_y=year(_mdate);
_m=month(_mdate);
mfreq{_y,_m}+1;
if mfreq{_y,_m}=2 then do;
overlap_from=min(overlap_from,mdy(_m,1,_y));
overlap_to=max(overlap_to,intnx('month',mdy(_m,1,_y),0,'end'));
mnths_overlaps_till2021 = intck('month',overlap_from,overlap_to)+1;
end;
_mdate=intnx('month',_mdate,1);
end;
/* Do a second pass through the same ID, and output */
if last.member_id then do _i=1 to coalesce(dif(_n_),_n_);
set have;
output;
end;
run;
Also, this assumes that all overlaps are nested (e.g. for 3 stores A,B, and C: A/B overlap for 4 months, and A/C and B/C overlap 1 month within that 4 month A/B overlap). And you want the longest span of those nested overlaps.
What if a member has periods at the same store that do not over lap at all what would the output look like:
5 AA 789 01/01/2020 12/31/2021 Full 6 AA 789 01/01/2023 12/31/2023 Full
Or ta different stores
5 AA 789 01/01/2020 12/31/2021 Full 6 BB 789 01/01/2023 12/31/2023 Full
Same store with gap and overlap by different store
5 AA 789 01/01/2020 12/31/2021 Full 6 AA 789 01/01/2023 12/31/2024 Part 7 BB 789 06/01/2021 07/31/2023 Full
Multiple stores with gaps but no overlap as well for the same ID.
Should sequential non-overlapping dates be treated differently:
5 AA 789 01/01/2020 12/31/2021 Full 6 AA 789 01/01/2022 12/31/2022 Full 7 AA 789 01/01/2023 12/31/2023 Full
You will likely get better responses if you provide some example data in the form of a working data step.
I am also interested to a minor degree in exactly how the output data set is to be used.
Do you have any idea what the maximum number of values might be involved with this Type_overlap? If you data has enough time involved that has the potential of having dozens if not more values in one variable an it is seldom a good idea to ever have more than one value in a single variable.
Your overlap period only considers months through 31dec2021. I get that. But why is the overlap count for the three observations for ID 411 set to 4? The date ranges are
Yes, the first two obs overlap for 4 months, but the last one only overlaps for 1, yet you assign a vale of 4 to all three obs.
What if the dates were
How many overlap months would those 3 ranges have?
What is your rule for measuring overlap?
Also
data have;
infile cards expandtabs truncover;
input Obs_Num Store_ID $ Member_ID $ Member_From :mmddyy10. Member_To :mmddyy10. Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1 AA 123 01/01/2020 12/31/2021 Full
2 BB 123 06/01/2021 12/31/2999 Full
3 AA 456 01/01/2020 12/31/2021 Full
4 BB 456 09/01/2021 10/31/2021 Part
5 AA 789 01/01/2020 12/31/2021 Full
6 BB 911 06/01/2021 10/01/2021 Part
7 AA 411 01/01/2020 12/31/2022 Full
8 BB 411 09/01/2021 12/31/2099 Part
9 CC 411 12/01/2021 12/31/2022 Full
;
data have2;
set have;
if Member_To>'31dec2021'd then Member_To='31dec2021'd;
run;
data temp;
set have2;
do date= Member_From to Member_To;
output;
end;
format date mmddyy10.;
drop Member_From Member_To;
run;
proc sort data=temp out=temp2 nodupkey;
by Member_ID date Obs_Num Store_ID;
run;
data temp3;
do until(last.date);
set temp2;
by Member_ID date ;
length _Obs_Num _Store_ID _Member_Type $ 200;
_Obs_Num=catx('|',_Obs_Num,Obs_Num);
_Store_ID=catx('|',_Store_ID,Store_ID);
_Member_Type=catx('|',_Member_Type,Member_Type);
end;
if findc(_Obs_Num,'|') then output;
drop Obs_Num Store_ID Member_Type;
run;
data temp4;
set temp3;
by Member_ID date ;
if first.Member_ID or dif(date) ne 1 then group+1;
run;
proc sql;
create table want1 as
select group,Member_ID,min(date) as Overlap_From format=mmddyy10.,max(date) as Overlap_To format=mmddyy10.,
intck('month',calculated Overlap_From,calculated Overlap_To,'c')+1 as Mnths_Overlaps_till2021,
max(countw(_Store_ID,'|')) as Num_Stores_Overlap
from temp4
group by group,Member_ID;
create table want2 as
select a.*,group,Overlap_From,Overlap_To,
coalesce(Mnths_Overlaps_till2021,0) as Mnths_Overlaps_till2021,
coalesce(Num_Stores_Overlap,0) as Num_Stores_Overlap
from have as a left join want1 as b
on a.Member_ID=b.Member_ID
order by Obs_Num;
create table temp5 as
select distinct group,Member_ID,_Obs_Num,_Member_Type
from temp4
group by group,Member_ID
having countw(_Obs_Num,'|')=max(countw(_Obs_Num,'|'));
create table want3 as
select a.*,_Obs_Num,coalescec(_Member_Type,Member_Type) as _Member_Type
from want2 as a left join temp5 as b
on a.group=b.group and a.Member_ID=b.Member_ID
order by Obs_Num;
quit;
data want4;
set want3;
length Type_Overlap $ 200;
if missing(_Obs_Num) or input(scan(_Obs_Num,1,'|'),best.)=Obs_Num then Type_Overlap=_Member_Type;
else do;
p=findw(_Obs_Num,strip(Obs_Num),'|','eit');
Type_Overlap=scan(_Member_Type,p,'|');
do i=1 to countw(_Member_Type,'|');
if i ne p then Type_Overlap=catx('|',Type_Overlap,scan(_Member_Type,i,'|'));
end;
end;
drop p i _Obs_Num _Member_Type;
run;
Assuming that no store_id occurs more than once per member_id (so that there won't be a false overlap inferred), then a DATA step passing through each ID twice will do.
It uses a 2-way array indexed by YEAR and MONTH to track overlaps.
/*Combine and Condense Duplicate Records*/
%let HLB=2019; /* History lower bound */
%let HUB=2021; /* History upper bound */
data have;
infile cards expandtabs truncover;
input Obs_Num Store_ID $ Member_ID $ Member_From :mmddyy10. Member_To :mmddyy10. Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1 AA 123 01/01/2020 12/31/2021 Full
2 BB 123 06/01/2021 12/31/2999 Full
3 AA 456 01/01/2020 12/31/2021 Full
4 BB 456 09/01/2021 10/31/2021 Part
5 AA 789 01/01/2020 12/31/2021 Full
6 BB 911 06/01/2021 10/01/2021 Part
7 AA 411 01/01/2020 12/31/2022 Full
8 BB 411 09/01/2021 12/31/2099 Part
9 CC 411 12/01/2021 12/31/2022 Full
;
data want (drop=_:);
/** Do a first pass through a single ID **/
set have;
by member_id notsorted;
array mfreq{&HLB:&HUB,12} _temporary_ /*Frequency by year and month */;
retain mnths_overlaps_till2021 num_stores_overlap type_overlap overlap_from overlap_to;
length type_overlap $20;
format overlap_from overlap_to mmddyy10. ;
if first.member_id then do;
call missing(of mfreq{*},of mnths_overlaps_till2021 -- overlap_to);
mnths_overlaps_till2021 = 0;
end;
num_stores_overlap + not(first.member_id=1 and last.member_id=1);
type_overlap=catx('/',type_overlap,member_type);
_mdate=member_from;
do while (_mdate< min(member_to,"31dec&HUB"d));
_y=year(_mdate);
_m=month(_mdate);
mfreq{_y,_m}+1;
if mfreq{_y,_m}=2 then do;
overlap_from=min(overlap_from,mdy(_m,1,_y));
overlap_to=max(overlap_to,intnx('month',mdy(_m,1,_y),0,'end'));
mnths_overlaps_till2021 = intck('month',overlap_from,overlap_to)+1;
end;
_mdate=intnx('month',_mdate,1);
end;
/* Do a second pass through the same ID, and output */
if last.member_id then do _i=1 to coalesce(dif(_n_),_n_);
set have;
output;
end;
run;
Also, this assumes that all overlaps are nested (e.g. for 3 stores A,B, and C: A/B overlap for 4 months, and A/C and B/C overlap 1 month within that 4 month A/B overlap). And you want the longest span of those nested overlaps.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.