Hi All,
I have a data below. I want to select rows meet criterias:
1. if the ID is unique (just one record) then retain it.
2. if the ID has multiple records, after proc sort data, choose the first record that result >= 30.
3. if the ID has multiple records, after proc sort data, if all the results for this ID <30 then retain the largest value.
I can do 1 and 3 criteria, but no idea about 2. Any one can help?
data have;
input ID Date mmddyy10. Result;
datalines;
14022 6/22/2010 6
14032 6/28/2011 5
14032 2/7/2012 8
14102 4/13/2010 6
14102 6/24/2009 16
14129 11/8/2011 9
14129 12/29/2009 24
14209 9/29/2011 10
14209 9/24/2013 12
14238 8/18/2011 4
14273 5/31/2011 18
14399 1/21/2014 4
14399 4/10/2012 10
14424 11/6/2012 10
14593 11/8/2013 20
14593 5/31/2011 22
14593 1/7/2014 26
14593 9/13/2011 46
14601 8/29/2013 2
14601 2/29/2012 3
14601 1/30/2014 4
14601 11/15/2011 5
14601 7/21/2009 6
14601 8/24/2009 13
14621 11/10/2009 4
14621 1/24/2012 7
14726 11/20/2012 11
14726 10/14/2014 13
14726 3/27/2012 20
14868 10/11/2012 7
14868 5/28/2009 10
15597 7/3/2012 5
15597 8/24/2010 5
17246 6/10/2014 8
17246 10/9/2012 16
17638 6/11/2013 3
17638 1/3/2013 3
17672 5/17/2011 3
17950 5/14/2013 4
17950 3/13/2012 4
17950 4/20/2010 6
17950 5/3/2010 7
17955 12/11/2012 8
17955 10/1/2013 12
17955 4/19/2011 13
17955 11/11/2014 26
;
Thanks
Assuming your data is in sorted order then (by ID and date):
data want (drop=highest_result highest_date already_output);
set have;
by id;
if first.id then do;
highest_result = result;
highest_date = date;
already_output='N';
end;
retain highest_result highest_date already_output;
if result >= 30 and already_output='N' then do;
output;
already_output='Y';
end;
else if result > highest_result then do;
highest_date = date;
highest_result = result;
end;
if last.id and already_output='N' then do;
result = highest_result;
date = highest_date;
output;
end;
run;
It's untested code, so it might need a little tweaking, but it looks right.
In your sample data, there is no date related variable. What are you sorting by to define "first"? Couldn't a value of 50 come before a value of 40?
Sorry about that. I've edit the message.
Thanks,
something like this in SQL by doing union all for each scenario mentioned.
proc sql; create table want as /*scenario 1*/ select id, result from have group by 1 having count(id) =1 union all /*scenario 2*/ select id, min(result) as result from ( select id, result from have group by 1 having count(id) gt 1 and min(result) >= 30)a group by id union all /*scenario 3*/ select id, max(result) as result from ( select id, result from have group by 1 having count(id) gt 1 and min(result) < 30)a group by id order by id; quit; ;
on date column as min or max value
proc sql;
 create table want as
 /*scenario 1*/
 select id, result, date format=date9. from have
 group by 1
 having count(id) =1
 
union all
/*scenario 2*/
 select id, result, min(date) as date from
( select id, result , date from have
 group by 1
 having count(id) gt 1
and min(result) >= 30)a
group by id
 union all
 /*scenario 3*/
 select id,result, max(date) as date from
( select id, result, date from have
 group by 1
 having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;
Assuming your data is in sorted order then (by ID and date):
data want (drop=highest_result highest_date already_output);
set have;
by id;
if first.id then do;
highest_result = result;
highest_date = date;
already_output='N';
end;
retain highest_result highest_date already_output;
if result >= 30 and already_output='N' then do;
output;
already_output='Y';
end;
else if result > highest_result then do;
highest_date = date;
highest_result = result;
end;
if last.id and already_output='N' then do;
result = highest_result;
date = highest_date;
output;
end;
run;
It's untested code, so it might need a little tweaking, but it looks right.
Thanks so much, Astounding. That's exactly what I want.
Hi Astounding,
May I ask a question? in the code you answer me, what is the role for retain statement? we've already create highest_result, highest_date, already_output variables, in my opinion, i can run outcome out without retain statement, but the outcome is wrong. So, what is the role of retain statement in the code?
Thanks,
L
When you create a new variable in a DATA step, it begins with a missing value. Your later statements may assign a value, but it begins with a missing value on each observation.
RETAIN changes that. It instructs SAS to hold on to whatever value was left over at the end of the previous observation, instead of re-setting the value to missing for each observation.
Got it! Thanks Astounding!
Can I have one more question: what is the role for already_output = 'Y' in the code?
Thanks,
L
ALREADY_OUTPUT gets set for each ID.
Two of the stipulations were:
So ALREADY_OUTPUT tracks whether an observation was already output for that ID. If another one comes along that has a value of >= 30, the program can now determine that it should skip the second (or 3rd or 4th, etc.) one because the DATA step has already output an observation for that ID.
Got it!!! Thanks for your help, Astounding!!
Like this?
proc sort data=HAVE; 
  by ID RESULT;
data WANT; 
  retain IDSAVED;
  set HAVE;
  by ID ;
  if first.ID then IDSAVED=0;
  if IDSAVED then return;
  if last.ID or RESULT >= 30;
  IDSAVED=1;
run;
 
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
