The following is part of my data: If you can't see this because of format issue, please see the attached data.
FAMILY_ID | FROMDATE |
000130758 | 01Jan2016 |
000130758 | 11Feb2016 |
000130758 | 01Mar2016 |
000535847 | 21Dec2015 |
000535847 | 22Dec2015 |
001808772 | 01May2016 |
001808772 | 02May2016 |
001808772 | 03May2016 |
009084285 | 30Jan2016 |
009084285 | 01Mar2016 |
035064420 | 23Aug2015 |
035064420 | 24Sep2015 |
035064420 | 23Oct2015 |
035064420 | 24Nov2015 |
567835520 | 24Dec2015 |
keep the fromdate If each distinct family_ID has only one fromdate. If each distinct family_ID has multiple fromdate, then compare the later fromdate and earlier fromdate. If the difference between later fromdate and earlier fromdate is > 30 then keep both date, then keep both fromdate; if the difference between later fromdate and earlier fromdate is <= 30, then keep only the earlier fromdate. For example, for family_ID 000130758, I keep all of three fromdate; for family_ID 000535847, I only keep the earlier fromdate 21Dec2015 and delete the fromdate 22Dec2015.
I appreciate any solutions and/or suggestions!
data one;
infile datalines;
input FAMILY_ID :9. FROMDATE date9. ;
Format FROMDATE date9. ;
datalines;
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
000535847 21Dec2015
000535847 22Dec2015
001808772 01May2016
001808772 02May2016
001808772 03May2016
009084285 30Jan2016
009084285 01Mar2016
035064420 23Aug2015
035064420 24Sep2015
035064420 23Oct2015
035064420 24Nov2015
567835520 24Dec2015
;
run;
proc sort data=one;
by family_id fromdate;
run;
data two /*(drop=lag_family_id lag_fromdate lead_fromdate lagdate days)*/;
merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));
lag_family_id=lag(family_id);
lag_fromdate=lag(fromdate);
if family_id=lag_family_id then lagdate=lag_fromdate;
if lagdate=. then lagdate=lead_fromdate;
days=ABS(INTCK('days',lagdate,fromdate));
run;
data three(drop=days);
set two (drop=lag_family_id lag_fromdate lead_fromdate lagdate);
by family_id;
if first.family_id or days>30 then output;
run;
Assumind data is sorted by family ID fromdate:
data want;
set have;
by family_id;
if first.family_id then output;
else do;
lagdate = lag(fromdate); /* previous observarion date */
if fromdate - lagdate > 30 then output;
and;
run;
You said you keep all the three for family_id 130758. Here family_id 130758 has 3 records. Is your criteria when it has more than 2 is to check the date diff with only first and last record.
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
when you compare first two records then date diff is >30 so keep both. when you compare 2 and 3rd record then date diff is <30 then should keep only 2nd record. Finally it should have only two records.
Correct me if I'm wrong.
Please check this output
data want1 ;
record=_n_+1;
set have end=last;
if not last then set have(keep=fromdate rename=(fromdate=fromdate2)) point=record;
else call missing(fromdate2);
run;
data want2 ;
set want1;
by FAMILY_ID;
lagdate=lag(fromdate);
if last.FAMILY_ID then fromdate2=.;
if first.family_id then lagdate=.;
fromdiff=ifn((fromdate2-fromdate)>30,(fromdate2-fromdate),.);
lagdiff=ifn((fromdate-lagdate)>30,(fromdate-lagdate),.);
result=coalesce(lagdiff,fromdiff);
format lagdate date9.;
if result ne .;
drop fromdate2 lagdate fromdiff lagdiff result;
run;
You want to look both forward and backward to find records within 30 days of a predecessor or successor. This is an appropriate context to use the LAG functions (to see if current record is near its predecessor) and also a MERGE with FIrSTOBS=2 parameter (to see if current record is near to successor).
If the family_id is a character var, then change "dif(family_id)^=0" to "family_id^=lag(family_id)". and an analogous change for "dif(family_id)=0".
data want (drop=nxt:);
merge have
have (firstobs=2 keep=family_id fromdate rename=(family_id=nxtfid fromdate=nxtfd));
if (dif(family_id)^=0 and nxtfid^=family_id) /* Single record*/
or
(nxtfid=family_id and (nxtfd-fromdate)<=30) /* Early member of 30-day pair*/
or
(dif(family_id)=0 and dif(fromdate)<=30) /* Late member of 30-day pair*/
;
run;
I guess your condition dosen't have a proper reason, beacuase consider this group
035064420 | 23Aug2015 |
035064420 | 24Sep2015 |
035064420 | 23Oct2015 |
035064420 | 24Nov2015 |
When you condider row 1 and 2 the difference is >30 so you need both of them. when considering row 2 and 3 difference is <30 then you only need row 2 and not the row 3. Now here comes the problem, when you consider row 3 and row 4 the difference is >30 so youy need both the rows 3 and 4. So according to your condition the row 3 is satsifing and even not satsifing. confusing
So I keep row1, row2 and row4. It is confusing. but this is what my boss wants.
data one;
infile datalines;
input FAMILY_ID :9. FROMDATE date9. ;
Format FROMDATE date9. ;
datalines;
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
000535847 21Dec2015
000535847 22Dec2015
001808772 01May2016
001808772 02May2016
001808772 03May2016
009084285 30Jan2016
009084285 01Mar2016
035064420 23Aug2015
035064420 24Sep2015
035064420 23Oct2015
035064420 24Nov2015
567835520 24Dec2015
;
run;
proc sort data=one;
by family_id fromdate;
run;
data two (drop=lag_family_id lag_fromdate lead_fromdate lagdate days);
merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));
lag_family_id=lag(family_id);
lag_fromdate=lag(fromdate);
if family_id=lag_family_id then lagdate=lag_fromdate;
if lagdate=. then lagdate=lead_fromdate;
days=ABS(INTCK('days',lagdate,fromdate));
if days>30 or days=. then output;
run;
data one;
infile datalines;
input FAMILY_ID :9. FROMDATE date9. ;
Format FROMDATE date9. ;
datalines;
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
000535847 21Dec2015
000535847 22Dec2015
001808772 01May2016
001808772 02May2016
001808772 03May2016
009084285 30Jan2016
009084285 01Mar2016
035064420 23Aug2015
035064420 24Sep2015
035064420 23Oct2015
035064420 24Nov2015
567835520 24Dec2015
;
run;
proc sort data=one;
by family_id fromdate;
run;
data two /*(drop=lag_family_id lag_fromdate lead_fromdate lagdate days)*/;
merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));
lag_family_id=lag(family_id);
lag_fromdate=lag(fromdate);
if family_id=lag_family_id then lagdate=lag_fromdate;
if lagdate=. then lagdate=lead_fromdate;
days=ABS(INTCK('days',lagdate,fromdate));
run;
data three(drop=days);
set two (drop=lag_family_id lag_fromdate lead_fromdate lagdate);
by family_id;
if first.family_id or days>30 then output;
run;
I updated my previous code as to get your expectated output
data want1 ;
record=_n_+1;
set have end=last;
if not last then set have(keep=fromdate rename=(fromdate=fromdate2)) point=record;
else call missing(fromdate2);
run;
data want2 ;
set want1;
by FAMILY_ID;
lagdate=lag(fromdate);
if last.FAMILY_ID then fromdate2=.;
if first.family_id then lagdate=.;
fromdiff=fromdate2-fromdate;
lagdiff=fromdate-lagdate;
result=coalesce(lagdiff,fromdiff);
format lagdate date9.;
if first.family_id or result >30;
drop fromdate2 lagdate fromdiff lagdiff result;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.