Hello Community. Bringing this back up after 2 years.
I have the following data that I need to use to measure the periods of membership overlap (if any) for the same individual but in multiple stores:
Obs_Num | Store_ID | Member_ID | Member_From | Member_To | Member_Type |
1 | AA | 123 | 01/01/2020 | 12/31/2021 | Full |
2 | BB | 123 | 06/01/2021 | 12/31/2999 | Full |
3 | AA | 456 | 01/01/2020 | 12/31/2021 | Full |
4 | BB | 456 | 09/01/2021 | 10/31/2021 | Part |
5 | AA | 789 | 01/01/2020 | 12/31/2021 | Full |
6 | BB | 911 | 06/01/2021 | 10/01/2021 | Part |
7 | AA | 411 | 01/01/2020 | 12/31/2022 | Full |
8 | BB | 411 | 09/01/2021 | 12/31/2099 | Part |
9 | CC | 411 | 12/01/2021 | 12/31/2022 | Full |
Few notes (the colors are to clarify the overlapping):
- Observations 1&2 are for the same ID in 2 different stores
- Observations 5& 6 are for distinct ID and for only one stores (no overlap)
- Observations 7-8-9 are for the same ID in 3 different stores
- The new dataset (below) should have 5 new columns to measure overlap from start UNTILL the end of 2021.
- The new column "Type_Overlap" is tricky in that it has to start with that store's membership type first. For example, Obs 7/8/9 are similar in 4 of the new 5 columns EXCEPT the type_overlap.
The new columns to be added to the data should be as follows:
Obs_Num | Store_ID | Member_ID | Member_From | Member_To | Member_Type | Mnths_Overlaps_till2021 | Num_Stores_Overlap | Type_Overlap | Overlap_From | Overlap_To |
1 | AA | 123 | 01/01/2020 | 12/31/2021 | Full | 7 | 2 | Full/Full | 06/01/2021 | 12/31/2021 |
2 | BB | 123 | 06/01/2021 | 12/31/2999 | Full | 7 | 2 | Full/Full | 06/01/2021 | 12/31/2021 |
3 | AA | 456 | 01/01/2020 | 12/31/2021 | Full | 2 | 2 | Full/Part | 09/01/2021 | 10/31/2021 |
4 | BB | 456 | 09/01/2021 | 10/31/2021 | Part | 2 | 2 | Part/Full | 09/01/2021 | 10/31/2021 |
5 | AA | 789 | 01/01/2020 | 12/31/2021 | Full | 0 | 0 | Full | ||
6 | BB | 911 | 06/01/2021 | 10/01/2021 | Part | 0 | 0 | Part | ||
7 | AA | 411 | 01/01/2020 | 12/31/2022 | Full | 4 | 3 | Full/Part/Full | 09/01/2021 | 12/31/2021 |
8 | BB | 411 | 09/01/2021 | 12/31/2099 | Part | 4 | 3 | Part/Full/Full | 09/01/2021 | 12/31/2021 |
9 | CC | 411 | 12/01/2021 | 12/31/2022 | Full | 4 | 3 | Full/Full/Part | 09/01/2021 | 12/31/2021 |
Is there a way to do this in SAS in a meaningful code? My data is a lot much larger than the above, but I hope I captured all the scenarios.
Assuming that no store_id occurs more than once per member_id (so that there won't be a false overlap inferred), then a DATA step passing through each ID twice will do.
It uses a 2-way array indexed by YEAR and MONTH to track overlaps.
/*Combine and Condense Duplicate Records*/
%let HLB=2019; /* History lower bound */
%let HUB=2021; /* History upper bound */
data have;
infile cards expandtabs truncover;
input Obs_Num Store_ID $ Member_ID $ Member_From :mmddyy10. Member_To :mmddyy10. Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1 AA 123 01/01/2020 12/31/2021 Full
2 BB 123 06/01/2021 12/31/2999 Full
3 AA 456 01/01/2020 12/31/2021 Full
4 BB 456 09/01/2021 10/31/2021 Part
5 AA 789 01/01/2020 12/31/2021 Full
6 BB 911 06/01/2021 10/01/2021 Part
7 AA 411 01/01/2020 12/31/2022 Full
8 BB 411 09/01/2021 12/31/2099 Part
9 CC 411 12/01/2021 12/31/2022 Full
;
data want (drop=_:);
/** Do a first pass through a single ID **/
set have;
by member_id notsorted;
array mfreq{&HLB:&HUB,12} _temporary_ /*Frequency by year and month */;
retain mnths_overlaps_till2021 num_stores_overlap type_overlap overlap_from overlap_to;
length type_overlap $20;
format overlap_from overlap_to mmddyy10. ;
if first.member_id then do;
call missing(of mfreq{*},of mnths_overlaps_till2021 -- overlap_to);
mnths_overlaps_till2021 = 0;
end;
num_stores_overlap + not(first.member_id=1 and last.member_id=1);
type_overlap=catx('/',type_overlap,member_type);
_mdate=member_from;
do while (_mdate< min(member_to,"31dec&HUB"d));
_y=year(_mdate);
_m=month(_mdate);
mfreq{_y,_m}+1;
if mfreq{_y,_m}=2 then do;
overlap_from=min(overlap_from,mdy(_m,1,_y));
overlap_to=max(overlap_to,intnx('month',mdy(_m,1,_y),0,'end'));
mnths_overlaps_till2021 = intck('month',overlap_from,overlap_to)+1;
end;
_mdate=intnx('month',_mdate,1);
end;
/* Do a second pass through the same ID, and output */
if last.member_id then do _i=1 to coalesce(dif(_n_),_n_);
set have;
output;
end;
run;
Also, this assumes that all overlaps are nested (e.g. for 3 stores A,B, and C: A/B overlap for 4 months, and A/C and B/C overlap 1 month within that 4 month A/B overlap). And you want the longest span of those nested overlaps.
What if a member has periods at the same store that do not over lap at all what would the output look like:
5 AA 789 01/01/2020 12/31/2021 Full 6 AA 789 01/01/2023 12/31/2023 Full
Or ta different stores
5 AA 789 01/01/2020 12/31/2021 Full 6 BB 789 01/01/2023 12/31/2023 Full
Same store with gap and overlap by different store
5 AA 789 01/01/2020 12/31/2021 Full 6 AA 789 01/01/2023 12/31/2024 Part 7 BB 789 06/01/2021 07/31/2023 Full
Multiple stores with gaps but no overlap as well for the same ID.
Should sequential non-overlapping dates be treated differently:
5 AA 789 01/01/2020 12/31/2021 Full 6 AA 789 01/01/2022 12/31/2022 Full 7 AA 789 01/01/2023 12/31/2023 Full
You will likely get better responses if you provide some example data in the form of a working data step.
I am also interested to a minor degree in exactly how the output data set is to be used.
Do you have any idea what the maximum number of values might be involved with this Type_overlap? If you data has enough time involved that has the potential of having dozens if not more values in one variable an it is seldom a good idea to ever have more than one value in a single variable.
Your overlap period only considers months through 31dec2021. I get that. But why is the overlap count for the three observations for ID 411 set to 4? The date ranges are
Yes, the first two obs overlap for 4 months, but the last one only overlaps for 1, yet you assign a vale of 4 to all three obs.
What if the dates were
How many overlap months would those 3 ranges have?
What is your rule for measuring overlap?
Also
data have;
infile cards expandtabs truncover;
input Obs_Num Store_ID $ Member_ID $ Member_From :mmddyy10. Member_To :mmddyy10. Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1 AA 123 01/01/2020 12/31/2021 Full
2 BB 123 06/01/2021 12/31/2999 Full
3 AA 456 01/01/2020 12/31/2021 Full
4 BB 456 09/01/2021 10/31/2021 Part
5 AA 789 01/01/2020 12/31/2021 Full
6 BB 911 06/01/2021 10/01/2021 Part
7 AA 411 01/01/2020 12/31/2022 Full
8 BB 411 09/01/2021 12/31/2099 Part
9 CC 411 12/01/2021 12/31/2022 Full
;
data have2;
set have;
if Member_To>'31dec2021'd then Member_To='31dec2021'd;
run;
data temp;
set have2;
do date= Member_From to Member_To;
output;
end;
format date mmddyy10.;
drop Member_From Member_To;
run;
proc sort data=temp out=temp2 nodupkey;
by Member_ID date Obs_Num Store_ID;
run;
data temp3;
do until(last.date);
set temp2;
by Member_ID date ;
length _Obs_Num _Store_ID _Member_Type $ 200;
_Obs_Num=catx('|',_Obs_Num,Obs_Num);
_Store_ID=catx('|',_Store_ID,Store_ID);
_Member_Type=catx('|',_Member_Type,Member_Type);
end;
if findc(_Obs_Num,'|') then output;
drop Obs_Num Store_ID Member_Type;
run;
data temp4;
set temp3;
by Member_ID date ;
if first.Member_ID or dif(date) ne 1 then group+1;
run;
proc sql;
create table want1 as
select group,Member_ID,min(date) as Overlap_From format=mmddyy10.,max(date) as Overlap_To format=mmddyy10.,
intck('month',calculated Overlap_From,calculated Overlap_To,'c')+1 as Mnths_Overlaps_till2021,
max(countw(_Store_ID,'|')) as Num_Stores_Overlap
from temp4
group by group,Member_ID;
create table want2 as
select a.*,group,Overlap_From,Overlap_To,
coalesce(Mnths_Overlaps_till2021,0) as Mnths_Overlaps_till2021,
coalesce(Num_Stores_Overlap,0) as Num_Stores_Overlap
from have as a left join want1 as b
on a.Member_ID=b.Member_ID
order by Obs_Num;
create table temp5 as
select distinct group,Member_ID,_Obs_Num,_Member_Type
from temp4
group by group,Member_ID
having countw(_Obs_Num,'|')=max(countw(_Obs_Num,'|'));
create table want3 as
select a.*,_Obs_Num,coalescec(_Member_Type,Member_Type) as _Member_Type
from want2 as a left join temp5 as b
on a.group=b.group and a.Member_ID=b.Member_ID
order by Obs_Num;
quit;
data want4;
set want3;
length Type_Overlap $ 200;
if missing(_Obs_Num) or input(scan(_Obs_Num,1,'|'),best.)=Obs_Num then Type_Overlap=_Member_Type;
else do;
p=findw(_Obs_Num,strip(Obs_Num),'|','eit');
Type_Overlap=scan(_Member_Type,p,'|');
do i=1 to countw(_Member_Type,'|');
if i ne p then Type_Overlap=catx('|',Type_Overlap,scan(_Member_Type,i,'|'));
end;
end;
drop p i _Obs_Num _Member_Type;
run;
Assuming that no store_id occurs more than once per member_id (so that there won't be a false overlap inferred), then a DATA step passing through each ID twice will do.
It uses a 2-way array indexed by YEAR and MONTH to track overlaps.
/*Combine and Condense Duplicate Records*/
%let HLB=2019; /* History lower bound */
%let HUB=2021; /* History upper bound */
data have;
infile cards expandtabs truncover;
input Obs_Num Store_ID $ Member_ID $ Member_From :mmddyy10. Member_To :mmddyy10. Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1 AA 123 01/01/2020 12/31/2021 Full
2 BB 123 06/01/2021 12/31/2999 Full
3 AA 456 01/01/2020 12/31/2021 Full
4 BB 456 09/01/2021 10/31/2021 Part
5 AA 789 01/01/2020 12/31/2021 Full
6 BB 911 06/01/2021 10/01/2021 Part
7 AA 411 01/01/2020 12/31/2022 Full
8 BB 411 09/01/2021 12/31/2099 Part
9 CC 411 12/01/2021 12/31/2022 Full
;
data want (drop=_:);
/** Do a first pass through a single ID **/
set have;
by member_id notsorted;
array mfreq{&HLB:&HUB,12} _temporary_ /*Frequency by year and month */;
retain mnths_overlaps_till2021 num_stores_overlap type_overlap overlap_from overlap_to;
length type_overlap $20;
format overlap_from overlap_to mmddyy10. ;
if first.member_id then do;
call missing(of mfreq{*},of mnths_overlaps_till2021 -- overlap_to);
mnths_overlaps_till2021 = 0;
end;
num_stores_overlap + not(first.member_id=1 and last.member_id=1);
type_overlap=catx('/',type_overlap,member_type);
_mdate=member_from;
do while (_mdate< min(member_to,"31dec&HUB"d));
_y=year(_mdate);
_m=month(_mdate);
mfreq{_y,_m}+1;
if mfreq{_y,_m}=2 then do;
overlap_from=min(overlap_from,mdy(_m,1,_y));
overlap_to=max(overlap_to,intnx('month',mdy(_m,1,_y),0,'end'));
mnths_overlaps_till2021 = intck('month',overlap_from,overlap_to)+1;
end;
_mdate=intnx('month',_mdate,1);
end;
/* Do a second pass through the same ID, and output */
if last.member_id then do _i=1 to coalesce(dif(_n_),_n_);
set have;
output;
end;
run;
Also, this assumes that all overlaps are nested (e.g. for 3 stores A,B, and C: A/B overlap for 4 months, and A/C and B/C overlap 1 month within that 4 month A/B overlap). And you want the longest span of those nested overlaps.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.