BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zhuxiaoyan1
Quartz | Level 8

The following is part of my data: If you can't see this because of format issue, please see the attached data.

FAMILY_IDFROMDATE
00013075801Jan2016
00013075811Feb2016
00013075801Mar2016
00053584721Dec2015
00053584722Dec2015
00180877201May2016
00180877202May2016
00180877203May2016
00908428530Jan2016
00908428501Mar2016
03506442023Aug2015
03506442024Sep2015
03506442023Oct2015
03506442024Nov2015
56783552024Dec2015

keep the fromdate If each distinct family_ID has only one fromdate. If each distinct family_ID has multiple fromdate, then compare the later fromdate and earlier fromdate. If the difference between later fromdate and earlier fromdate is > 30 then keep both date, then keep both fromdate; if the difference between later fromdate and earlier fromdate is <= 30, then keep only the earlier fromdate. For example, for family_ID  000130758, I keep all of three fromdate; for family_ID 000535847, I only keep the earlier fromdate 21Dec2015 and delete the fromdate 22Dec2015.

I appreciate any solutions and/or suggestions!

 
1 ACCEPTED SOLUTION

Accepted Solutions
SuryaKiran
Meteorite | Level 14

data one;

infile datalines;

input FAMILY_ID :9. FROMDATE date9. ;

Format FROMDATE date9. ;

datalines;

000130758 01Jan2016

000130758 11Feb2016

000130758 01Mar2016

000535847 21Dec2015

000535847 22Dec2015

001808772 01May2016

001808772 02May2016

001808772 03May2016

009084285 30Jan2016

009084285 01Mar2016

035064420 23Aug2015

035064420 24Sep2015

035064420 23Oct2015

035064420 24Nov2015

567835520 24Dec2015

;

run;

proc sort data=one;

by family_id fromdate;

run;

data two /*(drop=lag_family_id lag_fromdate lead_fromdate lagdate days)*/;

merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));

lag_family_id=lag(family_id);

lag_fromdate=lag(fromdate);

if family_id=lag_family_id then lagdate=lag_fromdate;

if lagdate=. then lagdate=lead_fromdate;

days=ABS(INTCK('days',lagdate,fromdate));

run;

data three(drop=days);

set two (drop=lag_family_id lag_fromdate lead_fromdate lagdate);

by family_id;

if first.family_id or days>30 then output;

run;

Thanks,
Suryakiran

View solution in original post

14 REPLIES 14
Shmuel
Garnet | Level 18

Assumind data is sorted by family ID  fromdate:

 

data want;

 set have;

       by family_id;

           if first.family_id then output;

           else do;

                  lagdate = lag(fromdate);  /* previous observarion date */

                 if fromdate - lagdate > 30 then output;

         and;

run;

 

SuryaKiran
Meteorite | Level 14

You said you keep all the three for family_id 130758. Here family_id 130758 has 3 records. Is your criteria when it has more than 2 is to check the date diff with only first and last record.


000130758 01Jan2016
000130758 11Feb2016
000130758 01Mar2016

 

when you compare first two records then date diff is >30 so keep both. when you compare 2 and 3rd record then date diff is <30 then should keep only 2nd record. Finally it should have only two records.

 

Correct me if I'm wrong.

Thanks,
Suryakiran
zhuxiaoyan1
Quartz | Level 8
Yes, you are right. It should have only 2 records for family_ID 00130758. I said it wrong. I'm very sorry about this.

Xiaoyan Zhu
Jagadishkatam
Amethyst | Level 16

Please check this output

data want1 ;
record=_n_+1;
set have end=last;
if not last then set have(keep=fromdate rename=(fromdate=fromdate2)) point=record;
else call missing(fromdate2);
run;

data want2 ;
set want1;
by FAMILY_ID;
lagdate=lag(fromdate);
if last.FAMILY_ID then fromdate2=.;
if first.family_id then lagdate=.;
fromdiff=ifn((fromdate2-fromdate)>30,(fromdate2-fromdate),.);
lagdiff=ifn((fromdate-lagdate)>30,(fromdate-lagdate),.);
result=coalesce(lagdiff,fromdiff);
format lagdate date9.;
if result ne .;
drop fromdate2 lagdate fromdiff lagdiff result;
run;
Thanks,
Jag
mkeintz
PROC Star

You want to look both forward and backward to find records within 30 days of a predecessor or successor.   This is an appropriate context to use the LAG functions  (to see if current record is near its predecessor) and also a MERGE with FIrSTOBS=2 parameter (to  see if current record is near to successor).

 

If the family_id is a character var, then change "dif(family_id)^=0" to "family_id^=lag(family_id)".  and an analogous change for "dif(family_id)=0".

 

data want (drop=nxt:);

merge have

      have (firstobs=2 keep=family_id fromdate rename=(family_id=nxtfid fromdate=nxtfd));

 

if (dif(family_id)^=0 and nxtfid^=family_id) /* Single record*/

  or

   (nxtfid=family_id and (nxtfd-fromdate)<=30) /* Early member of 30-day pair*/

  or

   (dif(family_id)=0 and dif(fromdate)<=30) /* Late member of 30-day pair*/

  ;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
zhuxiaoyan1
Quartz | Level 8
You have received a secure message from "Zhu, Xiaoyan" entitled, "Re: by group processing and loop".

You can view the message (before 01/04/2017) at the following web address:
https://mg.usablecs.com/enduser/msg.html?x=d-07507df3c11041317d318b077abb618f9cee1286b2d1eef009b5d59...


-----------------------------------------------------------------------------------------
Delivered with MailGate SC (TM)
http://www.axway.com
MailGate SC is a trademark of Axway.
SuryaKiran
Meteorite | Level 14

I guess your condition dosen't have a proper reason, beacuase consider this group

03506442023Aug2015
03506442024Sep2015
03506442023Oct2015
03506442024Nov2015

When you condider row 1 and 2 the difference is >30 so you need both of them. when considering row 2 and 3 difference is <30 then you only need row 2 and not the row 3. Now here comes the problem, when you consider row 3 and row 4 the difference is >30 so youy need both the rows 3 and 4. So according to your condition the row 3 is satsifing and even not satsifing. confusingSmiley Surprised

 

Thanks,
Suryakiran
zhuxiaoyan1
Quartz | Level 8

So I keep row1, row2 and row4. It is confusing. but this is what my boss wants.Smiley Frustrated

SuryaKiran
Meteorite | Level 14

data one;

infile datalines;

input FAMILY_ID :9. FROMDATE date9. ;

Format FROMDATE date9. ;

datalines;

000130758 01Jan2016

000130758 11Feb2016

000130758 01Mar2016

000535847 21Dec2015

000535847 22Dec2015

001808772 01May2016

001808772 02May2016

001808772 03May2016

009084285 30Jan2016

009084285 01Mar2016

035064420 23Aug2015

035064420 24Sep2015

035064420 23Oct2015

035064420 24Nov2015

567835520 24Dec2015

;

run;

proc sort data=one;

by family_id fromdate;

run;

data two (drop=lag_family_id lag_fromdate lead_fromdate lagdate days);

merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));

lag_family_id=lag(family_id);

lag_fromdate=lag(fromdate);

if family_id=lag_family_id then lagdate=lag_fromdate;

if lagdate=. then lagdate=lead_fromdate;

days=ABS(INTCK('days',lagdate,fromdate));

if days>30 or days=. then output;

run;

Thanks,
Suryakiran
zhuxiaoyan1
Quartz | Level 8
I also need these two rowes:
000535847 21Dec2015
001808772 01May2016
SuryaKiran
Meteorite | Level 14

data one;

infile datalines;

input FAMILY_ID :9. FROMDATE date9. ;

Format FROMDATE date9. ;

datalines;

000130758 01Jan2016

000130758 11Feb2016

000130758 01Mar2016

000535847 21Dec2015

000535847 22Dec2015

001808772 01May2016

001808772 02May2016

001808772 03May2016

009084285 30Jan2016

009084285 01Mar2016

035064420 23Aug2015

035064420 24Sep2015

035064420 23Oct2015

035064420 24Nov2015

567835520 24Dec2015

;

run;

proc sort data=one;

by family_id fromdate;

run;

data two /*(drop=lag_family_id lag_fromdate lead_fromdate lagdate days)*/;

merge one one(firstobs=2 keep=fromdate rename=(fromdate=lead_fromdate));

lag_family_id=lag(family_id);

lag_fromdate=lag(fromdate);

if family_id=lag_family_id then lagdate=lag_fromdate;

if lagdate=. then lagdate=lead_fromdate;

days=ABS(INTCK('days',lagdate,fromdate));

run;

data three(drop=days);

set two (drop=lag_family_id lag_fromdate lead_fromdate lagdate);

by family_id;

if first.family_id or days>30 then output;

run;

Thanks,
Suryakiran
zhuxiaoyan1
Quartz | Level 8
Thank you so much!

Xiaoyan
Jagadishkatam
Amethyst | Level 16

I updated my previous code as to get your expectated output

 

data want1 ;
record=_n_+1;
set have end=last;
if not last then set have(keep=fromdate rename=(fromdate=fromdate2)) point=record;
else call missing(fromdate2);
run;

data want2 ;
set want1;
by FAMILY_ID;
lagdate=lag(fromdate);
if last.FAMILY_ID then fromdate2=.;
if first.family_id then lagdate=.;
fromdiff=fromdate2-fromdate;
lagdiff=fromdate-lagdate;
result=coalesce(lagdiff,fromdiff);
format lagdate date9.;
if first.family_id or result >30;
drop fromdate2 lagdate fromdiff lagdiff result;
run;

 

Thanks,
Jag
zhuxiaoyan1
Quartz | Level 8
It works perfectly! Thank you very much!

Xiaoyan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2014 views
  • 0 likes
  • 5 in conversation