BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
beginner
Calcite | Level 5

Hi, I have been working on reorganizing a SAS dataset that has many columns and rows.  I've made a dummy "dataset 1" here as an example.  Dummy "dataset 1":

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

patientB          apple          ...

patientC          cherry          ...

patientC          grape          ....

 

I have figured out how to create a new dataset containing only the entries that have fruit = apple.  I.e., "dataset 2":

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientB          apple          ....

 

I'm wondering if anyone could please teach me how to create a "dataset 3" from "dataset 1," that contains all of the entries for patients who have ever had fruit = apple more than once.  In other words, I'm interested in examining a "dataset 3" that looks like this:

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

 

Thanks for your help and patience.  I also apologize in advance to anyone who also saw this question earlier today.  I posted this as a follow-up question / reply to a solved question & wonder if that is limiting who can view the question, so I'm posting it as a new post here.

1 ACCEPTED SOLUTION

Accepted Solutions
Steelers_In_DC
Barite | Level 11

You'll always get a quicker result if you type in your 'have' table like I have it.  Here you go:

data have;
input patient$          fruit$;
cards;
patientA          apple      
patientA          apple        
patientA          mango          
patientB          apple        
patientC          cherry        
patientC          grape
;run;

data want;
do until (last.patient);
set have;
by patient;
if fruit = 'apple' then count +1;
if first.patient then count = 1;
end;
do until (last.patient);
set have;
by patient;
if count > 1 then output;
end;
drop count;
run;

View solution in original post

6 REPLIES 6
Steelers_In_DC
Barite | Level 11

You'll always get a quicker result if you type in your 'have' table like I have it.  Here you go:

data have;
input patient$          fruit$;
cards;
patientA          apple      
patientA          apple        
patientA          mango          
patientB          apple        
patientC          cherry        
patientC          grape
;run;

data want;
do until (last.patient);
set have;
by patient;
if fruit = 'apple' then count +1;
if first.patient then count = 1;
end;
do until (last.patient);
set have;
by patient;
if count > 1 then output;
end;
drop count;
run;

beginner
Calcite | Level 5

Hi.  Thank you, this is working.  It's also very helpful that this preserves my sorting.  So after this step I thought I could do a proc sql to get "dataset4," but this this doesn't seem to be working.  I think I might not understand how to use proc sql.  Here's what I'm trying to do:

 

I have "dataset1" that looks like this (I've added in a patient D to this dummy dataset now, to illustrate my problem):

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

patientB          apple          ...

patientC          cherry          ...

patientC          grape          ....

patientD          grape          ....

patientD          apple          ...

patientD          apple          ...

patientD          apple          ...

 

Using your solution, I got "dataset3," which looks like this:

patient          fruit          [many other columns]

patientA          apple          ...

patientA          apple          ...

patientA          mango          ...

patientD          grape          ....

patientD          apple          ...

patientD          apple          ...

patientD          apple          ...

 

I had planned to next use the following lines:

 

proc sql ;
create table dataset4 as
select *
from dataset3
where patient in(
select patient from dataset1 where fruit ^= 'mango' and fruit ^= 'peach')
;
quit;

 

To get a "dataset4" that looks like this:

patient          fruit          [many other columns]

patientD          grape          ....

patientD          apple          ...

patientD          apple          ...

patientD          apple          ...

 

In other words, I want to remove from "dataset3" all patients for whom the "fruit" entry is at any point mango or peach.  Thoughts?

FreelanceReinh
Jade | Level 19

Hi @beginner,

please find below a modified version of your PROC SQL code, which will do what you want.

proc sql;
create table dataset4 as
select *
from dataset3
where patient ^in(
select distinct patient from dataset3 where fruit = 'mango' or fruit = 'peach')
;
quit;

Edit: Explanations:

The subquery of your PROC SQL step looks at each observation of DATASET1 separately and selects the value of variable PATIENT from it whenever FRUIT is neither 'mango' nor 'peach'. But this can easily bring in patients who do have 'mango' or 'peach' -- in other observations (example: patientA). A single non-('mango' or 'peach') observation is sufficient to make this happen.

 

Therefore, I changed the logic to exclude (via ^in, i.e. not in) patients who do have 'mango' or 'peach' in any observation. Just to improve performance I restricted the search (of the subquery) to DATASET3, because this is our "universe" in the main query, and added the DISTINCT keyword in order to keep the result of the subquery as small as possible.

 

Alternatively, one could apply the DOW loop technique (as in Steelers_In_DC's solution) again instead of PROC SQL. If DATASET3 is only used as the basis for the creation of DATASET4, it would be even easier to take the step from DATASET1 to DATASET4 directly. Simply add the exclusion criterion to the existing solution, as shown below:

 

data dataset4;
do until (last.patient);
  set dataset1;
  by patient;
  if fruit = 'apple' then count=sum(count,1);
  else if fruit in ('mango', 'peach') then excl=1;
end;
do until (last.patient);
  set dataset1;
  by patient;
  if count > 1 & ^excl then output;
end;
drop count excl;
run;

As a minor tweak (just for "optimization", not necessary) I changed the way how variable COUNT is incremented: The original "count+1" (a sum statement) causes an implicit RETAIN for variable COUNT. As a consequence, @Steelers_In_DC had to reset COUNT to 1 at the beginning of each BY group in order to count the "apples" for each patient separately. We need COUNT to be retained only within the BY groups and this is guaranteed already by the DOW loop technique (do until(last.xxx); set ...), because each BY group is processed within one iteration of the data step.

 

However, we need another feature of the sum statement: the implicit initialization of the variable (here: COUNT) to zero. The SUM function shares this feature with the sum statement. So, I used that together with an ordinary assignment statement and deleted "if first.patient then count = 1;". (An assignment statement of the form count=count+1 would not work without additional code, because COUNT would start with a missing value at the beginning of each BY group and "missing + 1 is missing.")

 

 

Likewise, @PGStats's PROC SQL approach could be amended easily to exclude the 'mango'-or-'peach' patients by using the following HAVING clause:

having sum(fruit="apple") >= 2 & sum(fruit in ("mango" "peach"))=0;

 

PGStats
Opal | Level 21

A rather simple SQL query:

 


proc sql;
create table likesApples as
select * 
from have
group by patient
having sum(fruit="apple") >= 2;
quit;
PG
FreelanceReinh
Jade | Level 19

Unlike the data step (DOW loop) solution, PROC SQL does not preserve the original sort order of observations. But, if this was an issue, given the "many" columns in HAVE, it should be possible to build a suitable ORDER BY clause.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 816 views
  • 0 likes
  • 4 in conversation