possibly a proc sql question? (previously: "multistep if/then logic")

Accepted Solution Solved
Reply
Contributor
Posts: 25
Accepted Solution

possibly a proc sql question? (previously: "multistep if/then logic")

Hi, I have been working on reorganizing a SAS dataset that has many columns and rows.  I've made a dummy "dataset 1" here as an example.  Dummy "dataset 1":

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

patientB          apple          ...

patientC          cherry          ...

patientC          grape          ....

 

I have figured out how to create a new dataset containing only the entries that have fruit = apple.  I.e., "dataset 2":

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientB          apple          ....

 

I'm wondering if anyone could please teach me how to create a "dataset 3" from "dataset 1," that contains all of the entries for patients who have ever had fruit = apple more than once.  In other words, I'm interested in examining a "dataset 3" that looks like this:

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

 

Thanks for your help and patience.  I also apologize in advance to anyone who also saw this question earlier today.  I posted this as a follow-up question / reply to a solved question & wonder if that is limiting who can view the question, so I'm posting it as a new post here.


Accepted Solutions
Solution
‎02-29-2016 04:28 PM
Valued Guide
Posts: 856

Re: possibly a proc sql question? (previously: "multistep if/then logic")

You'll always get a quicker result if you type in your 'have' table like I have it.  Here you go:

data have;
input patient$          fruit$;
cards;
patientA          apple      
patientA          apple        
patientA          mango          
patientB          apple        
patientC          cherry        
patientC          grape
;run;

data want;
do until (last.patient);
set have;
by patient;
if fruit = 'apple' then count +1;
if first.patient then count = 1;
end;
do until (last.patient);
set have;
by patient;
if count > 1 then output;
end;
drop count;
run;

View solution in original post


All Replies
Solution
‎02-29-2016 04:28 PM
Valued Guide
Posts: 856

Re: possibly a proc sql question? (previously: "multistep if/then logic")

You'll always get a quicker result if you type in your 'have' table like I have it.  Here you go:

data have;
input patient$          fruit$;
cards;
patientA          apple      
patientA          apple        
patientA          mango          
patientB          apple        
patientC          cherry        
patientC          grape
;run;

data want;
do until (last.patient);
set have;
by patient;
if fruit = 'apple' then count +1;
if first.patient then count = 1;
end;
do until (last.patient);
set have;
by patient;
if count > 1 then output;
end;
drop count;
run;

Contributor
Posts: 25

Re: possibly a proc sql question? (previously: "multistep if/then logic")

Hi.  Thank you, this is working.  It's also very helpful that this preserves my sorting.  So after this step I thought I could do a proc sql to get "dataset4," but this this doesn't seem to be working.  I think I might not understand how to use proc sql.  Here's what I'm trying to do:

 

I have "dataset1" that looks like this (I've added in a patient D to this dummy dataset now, to illustrate my problem):

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

patientB          apple          ...

patientC          cherry          ...

patientC          grape          ....

patientD          grape          ....

patientD          apple          ...

patientD          apple          ...

patientD          apple          ...

 

Using your solution, I got "dataset3," which looks like this:

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

patientD          grape          ....

patientD          apple          ...

patientD          apple          ...

patientD          apple          ...

 

I had planned to next use the following lines:

 

proc sql ;
create table dataset4 as
select *
from dataset3
where patient in(
select patient from dataset1 where fruit ^= 'mango' and fruit ^= 'peach')
;
quit;

 

To get a "dataset4" that looks like this:

patient          fruit          [many other columns]

patientD          grape          ....

patientD          apple          ...

patientD          apple          ...

patientD          apple          ...

 

In other words, I want to remove from "dataset3" all patients for whom the "fruit" entry is at any point mango or peach.  Thoughts?

Trusted Advisor
Posts: 1,114

Re: possibly a proc sql question? (previously: "multistep if/then logic")

[ Edited ]

Hi @beginner,

please find below a modified version of your PROC SQL code, which will do what you want.

proc sql;
create table dataset4 as
select *
from dataset3
where patient ^in(
select distinct patient from dataset3 where fruit = 'mango' or fruit = 'peach')
;
quit;

Edit: Explanations:

The subquery of your PROC SQL step looks at each observation of DATASET1 separately and selects the value of variable PATIENT from it whenever FRUIT is neither 'mango' nor 'peach'. But this can easily bring in patients who do have 'mango' or 'peach' -- in other observations (example: patientA). A single non-('mango' or 'peach') observation is sufficient to make this happen.

 

Therefore, I changed the logic to exclude (via ^in, i.e. not in) patients who do have 'mango' or 'peach' in any observation. Just to improve performance I restricted the search (of the subquery) to DATASET3, because this is our "universe" in the main query, and added the DISTINCT keyword in order to keep the result of the subquery as small as possible.

 

Alternatively, one could apply the DOW loop technique (as in Steelers_In_DC's solution) again instead of PROC SQL. If DATASET3 is only used as the basis for the creation of DATASET4, it would be even easier to take the step from DATASET1 to DATASET4 directly. Simply add the exclusion criterion to the existing solution, as shown below:

 

data dataset4;
do until (last.patient);
  set dataset1;
  by patient;
  if fruit = 'apple' then count=sum(count,1);
  else if fruit in ('mango', 'peach') then excl=1;
end;
do until (last.patient);
  set dataset1;
  by patient;
  if count > 1 & ^excl then output;
end;
drop count excl;
run;

As a minor tweak (just for "optimization", not necessary) I changed the way how variable COUNT is incremented: The original "count+1" (a sum statement) causes an implicit RETAIN for variable COUNT. As a consequence, @Steelers_In_DC had to reset COUNT to 1 at the beginning of each BY group in order to count the "apples" for each patient separately. We need COUNT to be retained only within the BY groups and this is guaranteed already by the DOW loop technique (do until(last.xxx); set ...), because each BY group is processed within one iteration of the data step.

 

However, we need another feature of the sum statement: the implicit initialization of the variable (here: COUNT) to zero. The SUM function shares this feature with the sum statement. So, I used that together with an ordinary assignment statement and deleted "if first.patient then count = 1;". (An assignment statement of the form count=count+1 would not work without additional code, because COUNT would start with a missing value at the beginning of each BY group and "missing + 1 is missing.")

 

 

Likewise, @PGStats's PROC SQL approach could be amended easily to exclude the 'mango'-or-'peach' patients by using the following HAVING clause:

having sum(fruit="apple") >= 2 & sum(fruit in ("mango" "peach"))=0;

 

Contributor
Posts: 25

Re: possibly a proc sql question? (previously: "multistep if/then logic")

Thank you, @!

Respected Advisor
Posts: 4,609

Re: possibly a proc sql question? (previously: "multistep if/then logic")

A rather simple SQL query:

 


proc sql;
create table likesApples as
select * 
from have
group by patient
having sum(fruit="apple") >= 2;
quit;
PG
Trusted Advisor
Posts: 1,114

Re: possibly a proc sql question? (previously: "multistep if/then logic")

Unlike the data step (DOW loop) solution, PROC SQL does not preserve the original sort order of observations. But, if this was an issue, given the "many" columns in HAVE, it should be possible to build a suitable ORDER BY clause.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 453 views
  • 0 likes
  • 4 in conversation