DATA Step, Macro, Functions and more

select row in multiple rows by several conditions

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 76
Accepted Solution

select row in multiple rows by several conditions

[ Edited ]

Hi All,

 

I have a data below. I want to select rows meet criterias:

      1. if the ID is unique (just one record) then retain it.

      2. if the ID has multiple records, after proc sort data, choose the first record that result >= 30. 

      3. if the ID has multiple records, after proc sort data, if all the results for this ID <30 then retain the largest value.

I can do 1 and 3 criteria, but no idea about 2. Any one can help?

 

data have;

    input ID   Date mmddyy10.  Result;

datalines;

 

14022  6/22/2010        6

14032  6/28/2011        5

14032  2/7/2012          8

14102  4/13/2010        6

14102  6/24/2009        16

14129  11/8/2011        9

14129  12/29/2009      24

14209  9/29/2011        10

14209  9/24/2013        12

14238  8/18/2011        4

14273  5/31/2011        18

14399  1/21/2014        4

14399  4/10/2012        10

14424  11/6/2012        10

14593  11/8/2013        20

14593  5/31/2011        22

14593  1/7/2014          26

14593  9/13/2011        46

14601  8/29/2013        2

14601  2/29/2012        3

14601  1/30/2014        4

14601  11/15/2011      5

14601  7/21/2009        6

14601  8/24/2009        13

14621  11/10/2009      4

14621  1/24/2012        7

14726  11/20/2012      11

14726  10/14/2014      13

14726  3/27/2012        20

14868  10/11/2012      7

14868  5/28/2009        10

15597  7/3/2012          5

15597  8/24/2010        5

17246  6/10/2014        8

17246  10/9/2012        16

17638  6/11/2013        3

17638  1/3/2013          3

17672  5/17/2011        3

17950  5/14/2013        4

17950  3/13/2012        4

17950  4/20/2010        6

17950  5/3/2010          7

17955  12/11/2012      8

17955  10/1/2013        12

17955  4/19/2011        13

17955  11/11/2014      26

 

 ;

 

Thanks


Accepted Solutions
Solution
‎06-16-2017 11:53 AM
Super User
Posts: 5,072

Re: select row in multiple rows by several conditions

[ Edited ]

Assuming your data is in sorted order then (by ID and date):

 

data want (drop=highest_result highest_date already_output);

set have;

by id;

if first.id then do;

   highest_result = result;

   highest_date = date;

   already_output='N';

end;

retain highest_result highest_date already_output;

if result >= 30 and already_output='N' then do;

   output;

   already_output='Y';

end;

else if result > highest_result then do;

   highest_date = date;

   highest_result = result;

end;

if last.id and already_output='N' then do;

   result = highest_result;

   date = highest_date;

   output;

end;

run;

 

It's untested code, so it might need a little tweaking, but it looks right.

View solution in original post


All Replies
Super User
Posts: 5,072

Re: select row in multiple rows by several conditions

In your sample data, there is no date related variable.  What are  you sorting by to define "first"?  Couldn't a value of 50 come before a value of 40?

Frequent Contributor
Posts: 76

Re: select row in multiple rows by several conditions

Sorry about that. I've edit the message.

 

Thanks,

PROC Star
Posts: 251

Re: select row in multiple rows by several conditions

[ Edited ]

something like this in SQL by doing union all for each scenario mentioned. 

proc sql;
 create table want as
 /*scenario 1*/
 select id, result from have
 group by 1
 having count(id) =1
 
union all
/*scenario 2*/
  select id, min(result) as result from
( select id, result   from have
 group by 1
 having count(id) gt 1
and min(result) >= 30)a
group by id
 union all
 /*scenario 3*/
  select id, max(result) as result from
( select id, result from have
 group by 1
 having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;
;

  on date column as min or max value

 

proc sql;
create table want as
/*scenario 1*/
select id, result, date format=date9. from have
group by 1
having count(id) =1

union all
/*scenario 2*/
select id, result, min(date) as date from
( select id, result , date from have
group by 1
having count(id) gt 1
and min(result) >= 30)a
group by id
union all
/*scenario 3*/
select id,result, max(date) as date from
( select id, result, date from have
group by 1
having count(id) gt 1
and min(result) < 30)a
group by id
order by id;
quit;

Solution
‎06-16-2017 11:53 AM
Super User
Posts: 5,072

Re: select row in multiple rows by several conditions

[ Edited ]

Assuming your data is in sorted order then (by ID and date):

 

data want (drop=highest_result highest_date already_output);

set have;

by id;

if first.id then do;

   highest_result = result;

   highest_date = date;

   already_output='N';

end;

retain highest_result highest_date already_output;

if result >= 30 and already_output='N' then do;

   output;

   already_output='Y';

end;

else if result > highest_result then do;

   highest_date = date;

   highest_result = result;

end;

if last.id and already_output='N' then do;

   result = highest_result;

   date = highest_date;

   output;

end;

run;

 

It's untested code, so it might need a little tweaking, but it looks right.

Frequent Contributor
Posts: 76

Re: select row in multiple rows by several conditions

Thanks so much, Astounding. That's exactly what I want.

 

 

Frequent Contributor
Posts: 76

Re: select row in multiple rows by several conditions

Hi Astounding,

 

May I ask a question? in the code you answer me, what is the role for retain statement? we've already create highest_result, highest_date, already_output variables, in my opinion, i can run outcome out without retain statement, but the outcome is wrong. So, what is the role of retain statement in the code?

 

Thanks,

L

Super User
Posts: 5,072

Re: select row in multiple rows by several conditions

When you create a new variable in a DATA step, it begins with a missing value.  Your later statements may assign a value, but it begins with a missing value on each observation.

 

RETAIN changes that.  It instructs SAS to hold on to whatever value was left over at the end of the previous observation, instead of re-setting the value to missing for each observation.

Frequent Contributor
Posts: 76

Re: select row in multiple rows by several conditions

Got it! Thanks Astounding!

Can I have one more question: what is the role for    already_output = 'Y'    in the code? 

 

Thanks,

L

Super User
Posts: 5,072

Re: select row in multiple rows by several conditions

ALREADY_OUTPUT gets set for each ID. 

 

Two of the stipulations were:

 

  • Output just a single observation per ID
  • Output the first observation with a value >= 30

So ALREADY_OUTPUT tracks whether an observation was already output for that ID.  If another one comes along that has a value of >= 30, the program can now determine that it should skip the second (or 3rd or 4th, etc.) one because the DATA step has already output an observation for that ID.

Frequent Contributor
Posts: 76

Re: select row in multiple rows by several conditions

Got it!!! Thanks for your help, Astounding!!

 

 

PROC Star
Posts: 1,558

Re: select row in multiple rows by several conditions

[ Edited ]

Like this?

 

proc sort data=HAVE; 
  by ID RESULT;
data WANT; 
  retain IDSAVED;
  set HAVE;
  by ID ;
  if first.ID then IDSAVED=0;
  if IDSAVED then return;
  if last.ID or RESULT >= 30;
  IDSAVED=1;
run;
 

 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 221 views
  • 0 likes
  • 4 in conversation