Solved: Combine and Condense Duplicate Records

altijani · Posted 08-16-2024 04:20 PM

Hello Community. Bringing this back up after 2 years.

I have the following data that I need to use to measure the periods of membership overlap (if any) for the same individual but in multiple stores:

Obs_Num	Store_ID	Member_ID	Member_From	Member_To	Member_Type
1	AA	123	01/01/2020	12/31/2021	Full
2	BB	123	06/01/2021	12/31/2999	Full
3	AA	456	01/01/2020	12/31/2021	Full
4	BB	456	09/01/2021	10/31/2021	Part
5	AA	789	01/01/2020	12/31/2021	Full
6	BB	911	06/01/2021	10/01/2021	Part
7	AA	411	01/01/2020	12/31/2022	Full
8	BB	411	09/01/2021	12/31/2099	Part
9	CC	411	12/01/2021	12/31/2022	Full

Few notes (the colors are to clarify the overlapping):

- Observations 1&2 are for the same ID in 2 different stores

- Observations 5& 6 are for distinct ID and for only one stores (no overlap)

- Observations 7-8-9 are for the same ID in 3 different stores

- The new dataset (below) should have 5 new columns to measure overlap from start UNTILL the end of 2021.

- The new column "Type_Overlap" is tricky in that it has to start with that store's membership type first. For example, Obs 7/8/9 are similar in 4 of the new 5 columns EXCEPT the type_overlap.

The new columns to be added to the data should be as follows:

Obs_Num	Store_ID	Member_ID	Member_From	Member_To	Member_Type	Mnths_Overlaps_till2021	Num_Stores_Overlap	Type_Overlap	Overlap_From	Overlap_To
1	AA	123	01/01/2020	12/31/2021	Full	7	2	Full/Full	06/01/2021	12/31/2021
2	BB	123	06/01/2021	12/31/2999	Full	7	2	Full/Full	06/01/2021	12/31/2021
3	AA	456	01/01/2020	12/31/2021	Full	2	2	Full/Part	09/01/2021	10/31/2021
4	BB	456	09/01/2021	10/31/2021	Part	2	2	Part/Full	09/01/2021	10/31/2021
5	AA	789	01/01/2020	12/31/2021	Full	0	0	Full
6	BB	911	06/01/2021	10/01/2021	Part	0	0	Part
7	AA	411	01/01/2020	12/31/2022	Full	4	3	Full/Part/Full	09/01/2021	12/31/2021
8	BB	411	09/01/2021	12/31/2099	Part	4	3	Part/Full/Full	09/01/2021	12/31/2021
9	CC	411	12/01/2021	12/31/2022	Full	4	3	Full/Full/Part	09/01/2021	12/31/2021

Is there a way to do this in SAS in a meaningful code? My data is a lot much larger than the above, but I hope I captured all the scenarios.

mkeintz · Posted 08-17-2024 10:31 PM

Assuming that no store_id occurs more than once per member_id (so that there won't be a false overlap inferred), then a DATA step passing through each ID twice will do.

It uses a 2-way array indexed by YEAR and MONTH to track overlaps.

/*Combine and Condense Duplicate Records*/
%let HLB=2019;  /* History lower bound */
%let HUB=2021;  /* History upper bound */

data have;
infile cards expandtabs truncover;
input Obs_Num	Store_ID $	Member_ID $	Member_From :mmddyy10.	Member_To  :mmddyy10.	Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1	AA	123	01/01/2020	12/31/2021	Full
2	BB	123	06/01/2021	12/31/2999	Full
3	AA	456	01/01/2020	12/31/2021	Full
4	BB	456	09/01/2021	10/31/2021	Part
5	AA	789	01/01/2020	12/31/2021	Full
6	BB	911	06/01/2021	10/01/2021	Part
7	AA	411	01/01/2020	12/31/2022	Full
8	BB	411	09/01/2021	12/31/2099	Part
9	CC	411	12/01/2021	12/31/2022	Full
;

data want (drop=_:);
  /** Do a first pass through a single ID **/
  set have;
  by member_id notsorted;

  array mfreq{&HLB:&HUB,12} _temporary_  /*Frequency by year and month */;
  retain mnths_overlaps_till2021 num_stores_overlap type_overlap overlap_from overlap_to;
  length type_overlap $20;
  format overlap_from overlap_to mmddyy10. ;

  if first.member_id then do;
    call missing(of mfreq{*},of mnths_overlaps_till2021 -- overlap_to);
    mnths_overlaps_till2021 = 0;
  end;

  num_stores_overlap + not(first.member_id=1 and last.member_id=1);

  type_overlap=catx('/',type_overlap,member_type);

  _mdate=member_from;
  do while (_mdate< min(member_to,"31dec&HUB"d));
    _y=year(_mdate);
    _m=month(_mdate);
    mfreq{_y,_m}+1;
    
    if mfreq{_y,_m}=2 then do;
      overlap_from=min(overlap_from,mdy(_m,1,_y));
      overlap_to=max(overlap_to,intnx('month',mdy(_m,1,_y),0,'end'));
      mnths_overlaps_till2021 = intck('month',overlap_from,overlap_to)+1;
    end;
    _mdate=intnx('month',_mdate,1);
  end;

  /* Do a second pass through the same ID, and output */
  if last.member_id then do _i=1 to coalesce(dif(_n_),_n_);
    set have;
    output;
  end;
run;

Also, this assumes that all overlaps are nested (e.g. for 3 stores A,B, and C: A/B overlap for 4 months, and A/C and B/C overlap 1 month within that 4 month A/B overlap). And you want the longest span of those nested overlaps.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 08-16-2024 05:09 PM

What if a member has periods at the same store that do not over lap at all what would the output look like:

5	AA	789	01/01/2020	12/31/2021	Full
6	AA	789	01/01/2023	12/31/2023	Full

Or ta different stores

5	AA	789	01/01/2020	12/31/2021	Full
6	BB	789	01/01/2023	12/31/2023	Full

Same store with gap and overlap by different store

5	AA	789	01/01/2020	12/31/2021	Full
6	AA	789	01/01/2023	12/31/2024	Part
7	BB	789	06/01/2021	07/31/2023	Full

Multiple stores with gaps but no overlap as well for the same ID.

Should sequential non-overlapping dates be treated differently:

5	AA	789	01/01/2020	12/31/2021	Full
6	AA	789	01/01/2022	12/31/2022	Full
7	AA	789	01/01/2023	12/31/2023	Full

You will likely get better responses if you provide some example data in the form of a working data step.

I am also interested to a minor degree in exactly how the output data set is to be used.

Do you have any idea what the maximum number of values might be involved with this Type_overlap? If you data has enough time involved that has the potential of having dozens if not more values in one variable an it is seldom a good idea to ever have more than one value in a single variable.

mkeintz · Posted 08-16-2024 10:38 PM

Your overlap period only considers months through 31dec2021. I get that. But why is the overlap count for the three observations for ID 411 set to 4? The date ranges are

01jan2020:31dec2022
01sep2021:31dec2099
and
01dec2021:31dec2022

Yes, the first two obs overlap for 4 months, but the last one only overlaps for 1, yet you assign a vale of 4 to all three obs.

What if the dates were

01jan2020:31dec2021
01jan2020:30jun2020
and
01jan2021:30apr2021

How many overlap months would those 3 ranges have?

What is your rule for measuring overlap?

Also

Do you always have 1 obs for each member_id/store_id combination?
Are the data always sorted by ~~member_id/member_from~~ member_from within each member_id?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 08-17-2024 05:49 AM

data have;
infile cards expandtabs truncover;
input Obs_Num	Store_ID $	Member_ID $	Member_From :mmddyy10.	Member_To  :mmddyy10.	Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1	AA	123	01/01/2020	12/31/2021	Full
2	BB	123	06/01/2021	12/31/2999	Full
3	AA	456	01/01/2020	12/31/2021	Full
4	BB	456	09/01/2021	10/31/2021	Part
5	AA	789	01/01/2020	12/31/2021	Full
6	BB	911	06/01/2021	10/01/2021	Part
7	AA	411	01/01/2020	12/31/2022	Full
8	BB	411	09/01/2021	12/31/2099	Part
9	CC	411	12/01/2021	12/31/2022	Full
;
data have2;
 set have;
 if Member_To>'31dec2021'd then Member_To='31dec2021'd;
run;
data temp;
 set have2;
 do date= Member_From to Member_To;
   output;
 end;
 format date mmddyy10.;
 drop  Member_From Member_To;
run;
proc sort data=temp out=temp2 nodupkey;
by Member_ID date  Obs_Num	Store_ID;
run;
data temp3;
do until(last.date);
 set temp2;
 by Member_ID date ; 
 length _Obs_Num _Store_ID _Member_Type $ 200;
 _Obs_Num=catx('|',_Obs_Num,Obs_Num);
 _Store_ID=catx('|',_Store_ID,Store_ID);
 _Member_Type=catx('|',_Member_Type,Member_Type);
end;
if findc(_Obs_Num,'|') then output;
 drop  Obs_Num	Store_ID Member_Type;
run;
data temp4;
 set temp3;
 by Member_ID date ; 
 if first.Member_ID or dif(date) ne 1 then group+1;
run;
proc sql;
create table want1 as
select group,Member_ID,min(date) as Overlap_From format=mmddyy10.,max(date) as Overlap_To format=mmddyy10.,
intck('month',calculated Overlap_From,calculated Overlap_To,'c')+1 as Mnths_Overlaps_till2021,
max(countw(_Store_ID,'|')) as Num_Stores_Overlap
 from temp4
  group by group,Member_ID;

create table want2 as
select a.*,group,Overlap_From,Overlap_To,
coalesce(Mnths_Overlaps_till2021,0) as Mnths_Overlaps_till2021,
coalesce(Num_Stores_Overlap,0) as Num_Stores_Overlap
 from have as a left join want1 as b
  on a.Member_ID=b.Member_ID
   order by Obs_Num;

create table temp5 as
select distinct group,Member_ID,_Obs_Num,_Member_Type
 from temp4
  group by group,Member_ID
   having countw(_Obs_Num,'|')=max(countw(_Obs_Num,'|'));

create table want3 as
select a.*,_Obs_Num,coalescec(_Member_Type,Member_Type) as _Member_Type
 from want2 as a left join temp5 as b
  on a.group=b.group and a.Member_ID=b.Member_ID
   order by Obs_Num;
quit;

data want4;
 set want3;
 length Type_Overlap $ 200;
 if missing(_Obs_Num) or input(scan(_Obs_Num,1,'|'),best.)=Obs_Num then Type_Overlap=_Member_Type;
  else do;
    p=findw(_Obs_Num,strip(Obs_Num),'|','eit');
    Type_Overlap=scan(_Member_Type,p,'|');
	do i=1 to countw(_Member_Type,'|');
      if i ne p then Type_Overlap=catx('|',Type_Overlap,scan(_Member_Type,i,'|'));
	end;
  end;
drop p i  _Obs_Num _Member_Type;
run;

mkeintz · Posted 08-17-2024 10:31 PM

Assuming that no store_id occurs more than once per member_id (so that there won't be a false overlap inferred), then a DATA step passing through each ID twice will do.

It uses a 2-way array indexed by YEAR and MONTH to track overlaps.

/*Combine and Condense Duplicate Records*/
%let HLB=2019;  /* History lower bound */
%let HUB=2021;  /* History upper bound */

data have;
infile cards expandtabs truncover;
input Obs_Num	Store_ID $	Member_ID $	Member_From :mmddyy10.	Member_To  :mmddyy10.	Member_Type $;
format Member_From Member_To mmddyy10.;
cards;
1	AA	123	01/01/2020	12/31/2021	Full
2	BB	123	06/01/2021	12/31/2999	Full
3	AA	456	01/01/2020	12/31/2021	Full
4	BB	456	09/01/2021	10/31/2021	Part
5	AA	789	01/01/2020	12/31/2021	Full
6	BB	911	06/01/2021	10/01/2021	Part
7	AA	411	01/01/2020	12/31/2022	Full
8	BB	411	09/01/2021	12/31/2099	Part
9	CC	411	12/01/2021	12/31/2022	Full
;

data want (drop=_:);
  /** Do a first pass through a single ID **/
  set have;
  by member_id notsorted;

  array mfreq{&HLB:&HUB,12} _temporary_  /*Frequency by year and month */;
  retain mnths_overlaps_till2021 num_stores_overlap type_overlap overlap_from overlap_to;
  length type_overlap $20;
  format overlap_from overlap_to mmddyy10. ;

  if first.member_id then do;
    call missing(of mfreq{*},of mnths_overlaps_till2021 -- overlap_to);
    mnths_overlaps_till2021 = 0;
  end;

  num_stores_overlap + not(first.member_id=1 and last.member_id=1);

  type_overlap=catx('/',type_overlap,member_type);

  _mdate=member_from;
  do while (_mdate< min(member_to,"31dec&HUB"d));
    _y=year(_mdate);
    _m=month(_mdate);
    mfreq{_y,_m}+1;
    
    if mfreq{_y,_m}=2 then do;
      overlap_from=min(overlap_from,mdy(_m,1,_y));
      overlap_to=max(overlap_to,intnx('month',mdy(_m,1,_y),0,'end'));
      mnths_overlaps_till2021 = intck('month',overlap_from,overlap_to)+1;
    end;
    _mdate=intnx('month',_mdate,1);
  end;

  /* Do a second pass through the same ID, and output */
  if last.member_id then do _i=1 to coalesce(dif(_n_),_n_);
    set have;
    output;
  end;
run;

Also, this assumes that all overlaps are nested (e.g. for 3 stores A,B, and C: A/B overlap for 4 months, and A/C and B/C overlap 1 month within that 4 month A/B overlap). And you want the longest span of those nested overlaps.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

Re: Combine and Condense Duplicate Records

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!