BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
echoli
Obsidian | Level 7

Hi All,

 

I have a data below. I want to select rows meet criterias:

      1. if the ID is unique (just one record) then retain it.

      2. if the ID has multiple records, after proc sort data, choose the first record that result >= 30. 

      3. if the ID has multiple records, after proc sort data, if all the results for this ID <30 then retain the largest value.

I can do 1 and 3 criteria, but no idea about 2. Any one can help?

 

data have;

    input ID   Date mmddyy10.  Result;

datalines;

 

14022  6/22/2010        6

14032  6/28/2011        5

14032  2/7/2012          8

14102  4/13/2010        6

14102  6/24/2009        16

14129  11/8/2011        9

14129  12/29/2009      24

14209  9/29/2011        10

14209  9/24/2013        12

14238  8/18/2011        4

14273  5/31/2011        18

14399  1/21/2014        4

14399  4/10/2012        10

14424  11/6/2012        10

14593  11/8/2013        20

14593  5/31/2011        22

14593  1/7/2014          26

14593  9/13/2011        46

14601  8/29/2013        2

14601  2/29/2012        3

14601  1/30/2014        4

14601  11/15/2011      5

14601  7/21/2009        6

14601  8/24/2009        13

14621  11/10/2009      4

14621  1/24/2012        7

14726  11/20/2012      11

14726  10/14/2014      13

14726  3/27/2012        20

14868  10/11/2012      7

14868  5/28/2009        10

15597  7/3/2012          5

15597  8/24/2010        5

17246  6/10/2014        8

17246  10/9/2012        16

17638  6/11/2013        3

17638  1/3/2013          3

17672  5/17/2011        3

17950  5/14/2013        4

17950  3/13/2012        4

17950  4/20/2010        6

17950  5/3/2010          7

17955  12/11/2012      8

17955  10/1/2013        12

17955  4/19/2011        13

17955  11/11/2014      26

 

 ;

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Assuming your data is in sorted order then (by ID and date):

 

data want (drop=highest_result highest_date already_output);

set have;

by id;

if first.id then do;

   highest_result = result;

   highest_date = date;

   already_output='N';

end;

retain highest_result highest_date already_output;

if result >= 30 and already_output='N' then do;

   output;

   already_output='Y';

end;

else if result > highest_result then do;

   highest_date = date;

   highest_result = result;

end;

if last.id and already_output='N' then do;

   result = highest_result;

   date = highest_date;

   output;

end;

run;

 

It's untested code, so it might need a little tweaking, but it looks right.

View solution in original post

11 REPLIES 11
Astounding
PROC Star

In your sample data, there is no date related variable.  What are  you sorting by to define "first"?  Couldn't a value of 50 come before a value of 40?

echoli
Obsidian | Level 7

Sorry about that. I've edit the message.

 

Thanks,

kiranv_
Rhodochrosite | Level 12

something like this in SQL by doing union all for each scenario mentioned. 

proc sql;
 create table want as
 /*scenario 1*/
 select id, result from have
 group by 1
 having count(id) =1
 
union all
/*scenario 2*/
  select id, min(result) as result from
( select id, result   from have
 group by 1
 having count(id) gt 1
and min(result) >= 30)a
group by id
 union all
 /*scenario 3*/
  select id, max(result) as result from
( select id, result from have
 group by 1
 having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;
;

  on date column as min or max value

 

proc sql;
create table want as
/*scenario 1*/
select id, result, date format=date9. from have
group by 1
having count(id) =1

union all
/*scenario 2*/
select id, result, min(date) as date from
( select id, result , date from have
group by 1
having count(id) gt 1
and min(result) >= 30)a
group by id
union all
/*scenario 3*/
select id,result, max(date) as date from
( select id, result, date from have
group by 1
having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;

Astounding
PROC Star

Assuming your data is in sorted order then (by ID and date):

 

data want (drop=highest_result highest_date already_output);

set have;

by id;

if first.id then do;

   highest_result = result;

   highest_date = date;

   already_output='N';

end;

retain highest_result highest_date already_output;

if result >= 30 and already_output='N' then do;

   output;

   already_output='Y';

end;

else if result > highest_result then do;

   highest_date = date;

   highest_result = result;

end;

if last.id and already_output='N' then do;

   result = highest_result;

   date = highest_date;

   output;

end;

run;

 

It's untested code, so it might need a little tweaking, but it looks right.

echoli
Obsidian | Level 7

Thanks so much, Astounding. That's exactly what I want.

 

 

echoli
Obsidian | Level 7

Hi Astounding,

 

May I ask a question? in the code you answer me, what is the role for retain statement? we've already create highest_result, highest_date, already_output variables, in my opinion, i can run outcome out without retain statement, but the outcome is wrong. So, what is the role of retain statement in the code?

 

Thanks,

L

Astounding
PROC Star

When you create a new variable in a DATA step, it begins with a missing value.  Your later statements may assign a value, but it begins with a missing value on each observation.

 

RETAIN changes that.  It instructs SAS to hold on to whatever value was left over at the end of the previous observation, instead of re-setting the value to missing for each observation.

echoli
Obsidian | Level 7

Got it! Thanks Astounding!

Can I have one more question: what is the role for    already_output = 'Y'    in the code? 

 

Thanks,

L

Astounding
PROC Star

ALREADY_OUTPUT gets set for each ID. 

 

Two of the stipulations were:

 

  • Output just a single observation per ID
  • Output the first observation with a value >= 30

So ALREADY_OUTPUT tracks whether an observation was already output for that ID.  If another one comes along that has a value of >= 30, the program can now determine that it should skip the second (or 3rd or 4th, etc.) one because the DATA step has already output an observation for that ID.

echoli
Obsidian | Level 7

Got it!!! Thanks for your help, Astounding!!

 

 

ChrisNZ
Tourmaline | Level 20

Like this?

 

proc sort data=HAVE; 
  by ID RESULT;
data WANT; 
  retain IDSAVED;
  set HAVE;
  by ID ;
  if first.ID then IDSAVED=0;
  if IDSAVED then return;
  if last.ID or RESULT >= 30;
  IDSAVED=1;
run;
 

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 9120 views
  • 0 likes
  • 4 in conversation