The following is part of my data: If you can't see this because of format issue, please see the attached data.
FAMILY_ID | FROMDATE |
000130758 | 01Jan2016 |
000130758 | 11Feb2016 |
000130758 | 01Mar2016 |
000535847 | 21Dec2015 |
000535847 | 22Dec2015 |
001808772 | 01May2016 |
001808772 | 02May2016 |
001808772 | 03May2016 |
009084285 | 30Jan2016 |
009084285 | 01Mar2016 |
035064420 | 23Aug2015 |
035064420 | 24Sep2015 |
035064420 | 23Oct2015 |
035064420 | 24Nov2015 |
567835520 | 24Dec2015 |
keep the fromdate If each distinct family_ID has only one fromdate. If each distinct family_ID has multiple fromdate, then compare the later fromdate and earlier fromdate. If the difference between later fromdate and earlier fromdate is > 30 then keep both date, then keep both fromdate; if the difference between later fromdate and earlier fromdate is <= 30, then keep only the earlier fromdate. For example, for family_ID 000130758, I keep all of three fromdate; for family_ID 000535847, I only keep the earlier fromdate 21Dec2015 and delete the fromdate 22Dec2015.
I appreciate any solutions and/or suggestions!
data one;
infile datalines;
input FAMILY_ID :9. FROMDATE date9. ;
Format FROMDATE date9. ;
datalines;
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
000535847 21Dec2015
000535847 22Dec2015
001808772 01May2016
001808772 02May2016
001808772 03May2016
009084285 30Jan2016
009084285 01Mar2016
035064420 23Aug2015
035064420 24Sep2015
035064420 23Oct2015
035064420 24Nov2015
567835520 24Dec2015
;
run;
proc sort data=one;
by family_id fromdate;
run;
data two /*(drop=lag_family_id lag_fromdate lead_fromdate lagdate days)*/;
merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));
lag_family_id=lag(family_id);
lag_fromdate=lag(fromdate);
if family_id=lag_family_id then lagdate=lag_fromdate;
if lagdate=. then lagdate=lead_fromdate;
days=ABS(INTCK('days',lagdate,fromdate));
run;
data three(drop=days);
set two (drop=lag_family_id lag_fromdate lead_fromdate lagdate);
by family_id;
if first.family_id or days>30 then output;
run;
Assumind data is sorted by family ID fromdate:
data want;
set have;
by family_id;
if first.family_id then output;
else do;
lagdate = lag(fromdate); /* previous observarion date */
if fromdate - lagdate > 30 then output;
and;
run;
You said you keep all the three for family_id 130758. Here family_id 130758 has 3 records. Is your criteria when it has more than 2 is to check the date diff with only first and last record.
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
when you compare first two records then date diff is >30 so keep both. when you compare 2 and 3rd record then date diff is <30 then should keep only 2nd record. Finally it should have only two records.
Correct me if I'm wrong.
Please check this output
data want1 ;
record=_n_+1;
set have end=last;
if not last then set have(keep=fromdate rename=(fromdate=fromdate2)) point=record;
else call missing(fromdate2);
run;
data want2 ;
set want1;
by FAMILY_ID;
lagdate=lag(fromdate);
if last.FAMILY_ID then fromdate2=.;
if first.family_id then lagdate=.;
fromdiff=ifn((fromdate2-fromdate)>30,(fromdate2-fromdate),.);
lagdiff=ifn((fromdate-lagdate)>30,(fromdate-lagdate),.);
result=coalesce(lagdiff,fromdiff);
format lagdate date9.;
if result ne .;
drop fromdate2 lagdate fromdiff lagdiff result;
run;
You want to look both forward and backward to find records within 30 days of a predecessor or successor. This is an appropriate context to use the LAG functions (to see if current record is near its predecessor) and also a MERGE with FIrSTOBS=2 parameter (to see if current record is near to successor).
If the family_id is a character var, then change "dif(family_id)^=0" to "family_id^=lag(family_id)". and an analogous change for "dif(family_id)=0".
data want (drop=nxt:);
merge have
have (firstobs=2 keep=family_id fromdate rename=(family_id=nxtfid fromdate=nxtfd));
if (dif(family_id)^=0 and nxtfid^=family_id) /* Single record*/
or
(nxtfid=family_id and (nxtfd-fromdate)<=30) /* Early member of 30-day pair*/
or
(dif(family_id)=0 and dif(fromdate)<=30) /* Late member of 30-day pair*/
;
run;
I guess your condition dosen't have a proper reason, beacuase consider this group
035064420 | 23Aug2015 |
035064420 | 24Sep2015 |
035064420 | 23Oct2015 |
035064420 | 24Nov2015 |
When you condider row 1 and 2 the difference is >30 so you need both of them. when considering row 2 and 3 difference is <30 then you only need row 2 and not the row 3. Now here comes the problem, when you consider row 3 and row 4 the difference is >30 so youy need both the rows 3 and 4. So according to your condition the row 3 is satsifing and even not satsifing. confusing
So I keep row1, row2 and row4. It is confusing. but this is what my boss wants.
data one;
infile datalines;
input FAMILY_ID :9. FROMDATE date9. ;
Format FROMDATE date9. ;
datalines;
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
000535847 21Dec2015
000535847 22Dec2015
001808772 01May2016
001808772 02May2016
001808772 03May2016
009084285 30Jan2016
009084285 01Mar2016
035064420 23Aug2015
035064420 24Sep2015
035064420 23Oct2015
035064420 24Nov2015
567835520 24Dec2015
;
run;
proc sort data=one;
by family_id fromdate;
run;
data two (drop=lag_family_id lag_fromdate lead_fromdate lagdate days);
merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));
lag_family_id=lag(family_id);
lag_fromdate=lag(fromdate);
if family_id=lag_family_id then lagdate=lag_fromdate;
if lagdate=. then lagdate=lead_fromdate;
days=ABS(INTCK('days',lagdate,fromdate));
if days>30 or days=. then output;
run;
data one;
infile datalines;
input FAMILY_ID :9. FROMDATE date9. ;
Format FROMDATE date9. ;
datalines;
000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016
000535847 21Dec2015
000535847 22Dec2015
001808772 01May2016
001808772 02May2016
001808772 03May2016
009084285 30Jan2016
009084285 01Mar2016
035064420 23Aug2015
035064420 24Sep2015
035064420 23Oct2015
035064420 24Nov2015
567835520 24Dec2015
;
run;
proc sort data=one;
by family_id fromdate;
run;
data two /*(drop=lag_family_id lag_fromdate lead_fromdate lagdate days)*/;
merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));
lag_family_id=lag(family_id);
lag_fromdate=lag(fromdate);
if family_id=lag_family_id then lagdate=lag_fromdate;
if lagdate=. then lagdate=lead_fromdate;
days=ABS(INTCK('days',lagdate,fromdate));
run;
data three(drop=days);
set two (drop=lag_family_id lag_fromdate lead_fromdate lagdate);
by family_id;
if first.family_id or days>30 then output;
run;
I updated my previous code as to get your expectated output
data want1 ;
record=_n_+1;
set have end=last;
if not last then set have(keep=fromdate rename=(fromdate=fromdate2)) point=record;
else call missing(fromdate2);
run;
data want2 ;
set want1;
by FAMILY_ID;
lagdate=lag(fromdate);
if last.FAMILY_ID then fromdate2=.;
if first.family_id then lagdate=.;
fromdiff=fromdate2-fromdate;
lagdiff=fromdate-lagdate;
result=coalesce(lagdiff,fromdiff);
format lagdate date9.;
if first.family_id or result >30;
drop fromdate2 lagdate fromdiff lagdiff result;
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.