Solved: Re: select row in multiple rows by several conditions

echoli · Posted 06-15-2017 04:52 PM

Hi All,

I have a data below. I want to select rows meet criterias:

1. if the ID is unique (just one record) then retain it.

2. if the ID has multiple records, after proc sort data, choose the first record that result >= 30.

3. if the ID has multiple records, after proc sort data, if all the results for this ID <30 then retain the largest value.

I can do 1 and 3 criteria, but no idea about 2. Any one can help?

data have;

input ID Date mmddyy10. Result;

datalines;

14022 6/22/2010 6

14032 6/28/2011 5

14032 2/7/2012 8

14102 4/13/2010 6

14102 6/24/2009 16

14129 11/8/2011 9

14129 12/29/2009 24

14209 9/29/2011 10

14209 9/24/2013 12

14238 8/18/2011 4

14273 5/31/2011 18

14399 1/21/2014 4

14399 4/10/2012 10

14424 11/6/2012 10

14593 11/8/2013 20

14593 5/31/2011 22

14593 1/7/2014 26

14593 9/13/2011 46

14601 8/29/2013 2

14601 2/29/2012 3

14601 1/30/2014 4

14601 11/15/2011 5

14601 7/21/2009 6

14601 8/24/2009 13

14621 11/10/2009 4

14621 1/24/2012 7

14726 11/20/2012 11

14726 10/14/2014 13

14726 3/27/2012 20

14868 10/11/2012 7

14868 5/28/2009 10

15597 7/3/2012 5

15597 8/24/2010 5

17246 6/10/2014 8

17246 10/9/2012 16

17638 6/11/2013 3

17638 1/3/2013 3

17672 5/17/2011 3

17950 5/14/2013 4

17950 3/13/2012 4

17950 4/20/2010 6

17950 5/3/2010 7

17955 12/11/2012 8

17955 10/1/2013 12

17955 4/19/2011 13

17955 11/11/2014 26

;

Thanks

Astounding · Posted 06-15-2017 05:51 PM

Assuming your data is in sorted order then (by ID and date):

data want (drop=highest_result highest_date already_output);

set have;

by id;

if first.id then do;

highest_result = result;

highest_date = date;

already_output='N';

end;

retain highest_result highest_date already_output;

if result >= 30 and already_output='N' then do;

output;

already_output='Y';

end;

else if result > highest_result then do;

highest_date = date;

highest_result = result;

end;

if last.id and already_output='N' then do;

result = highest_result;

date = highest_date;

output;

end;

run;

It's untested code, so it might need a little tweaking, but it looks right.

View solution in original post

Astounding · Posted 06-15-2017 05:25 PM

In your sample data, there is no date related variable. What are you sorting by to define "first"? Couldn't a value of 50 come before a value of 40?

echoli · Posted 06-15-2017 05:34 PM

Sorry about that. I've edit the message.

Thanks,

kiranv_ · Posted 06-15-2017 05:46 PM

something like this in SQL by doing union all for each scenario mentioned.

proc sql;
 create table want as
 /*scenario 1*/
 select id, result from have
 group by 1
 having count(id) =1
 
union all
/*scenario 2*/
  select id, min(result) as result from
( select id, result   from have
 group by 1
 having count(id) gt 1
and min(result) >= 30)a
group by id
 union all
 /*scenario 3*/
  select id, max(result) as result from
( select id, result from have
 group by 1
 having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;
;

on date column as min or max value

proc sql;
create table want as
/*scenario 1*/
select id, result, date format=date9. from have
group by 1
having count(id) =1

union all
/*scenario 2*/
select id, result, min(date) as date from
( select id, result , date from have
group by 1
having count(id) gt 1
and min(result) >= 30)a
group by id
union all
/*scenario 3*/
select id,result, max(date) as date from
( select id, result, date from have
group by 1
having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;

Astounding · Posted 06-15-2017 05:51 PM

Assuming your data is in sorted order then (by ID and date):

data want (drop=highest_result highest_date already_output);

set have;

by id;

if first.id then do;

highest_result = result;

highest_date = date;

already_output='N';

end;

retain highest_result highest_date already_output;

if result >= 30 and already_output='N' then do;

output;

already_output='Y';

end;

else if result > highest_result then do;

highest_date = date;

highest_result = result;

end;

if last.id and already_output='N' then do;

result = highest_result;

date = highest_date;

output;

end;

run;

It's untested code, so it might need a little tweaking, but it looks right.

echoli · Posted 06-16-2017 11:53 AM

Thanks so much, Astounding. That's exactly what I want.

echoli · Posted 06-16-2017 12:46 PM

Hi Astounding,

May I ask a question? in the code you answer me, what is the role for retain statement? we've already create highest_result, highest_date, already_output variables, in my opinion, i can run outcome out without retain statement, but the outcome is wrong. So, what is the role of retain statement in the code?

Thanks,

L

Astounding · Posted 06-16-2017 01:10 PM

When you create a new variable in a DATA step, it begins with a missing value. Your later statements may assign a value, but it begins with a missing value on each observation.

RETAIN changes that. It instructs SAS to hold on to whatever value was left over at the end of the previous observation, instead of re-setting the value to missing for each observation.

echoli · Posted 06-16-2017 03:44 PM

Got it! Thanks Astounding!

Can I have one more question: what is the role for already_output = 'Y' in the code?

Thanks,

L

Astounding · Posted 06-16-2017 03:54 PM

ALREADY_OUTPUT gets set for each ID.

Two of the stipulations were:

Output just a single observation per ID
Output the first observation with a value >= 30

So ALREADY_OUTPUT tracks whether an observation was already output for that ID. If another one comes along that has a value of >= 30, the program can now determine that it should skip the second (or 3rd or 4th, etc.) one because the DATA step has already output an observation for that ID.

echoli · Posted 06-16-2017 05:01 PM

Got it!!! Thanks for your help, Astounding!!

ChrisNZ · Posted 06-15-2017 06:34 PM

Like this?

proc sort data=HAVE; 
  by ID RESULT;
data WANT; 
  retain IDSAVED;
  set HAVE;
  by ID ;
  if first.ID then IDSAVED=0;
  if IDSAVED then return;
  if last.ID or RESULT >= 30;
  IDSAVED=1;
run;

High-Performance SAS Coding - Third Edition

Registration is open

SAS Training: Just a Click Away