Hi, I have been working on reorganizing a SAS dataset that has many columns and rows. I've made a dummy "dataset 1" here as an example. Dummy "dataset 1":
patient fruit [many other columns]
patientA apple ...
patientA apple ...
patientA mango ...
patientB apple ...
patientC cherry ...
patientC grape ....
I have figured out how to create a new dataset containing only the entries that have fruit = apple. I.e., "dataset 2":
patient fruit [many other columns]
patientA apple ...
patientA apple ...
patientB apple ....
I'm wondering if anyone could please teach me how to create a "dataset 3" from "dataset 1," that contains all of the entries for patients who have ever had fruit = apple more than once. In other words, I'm interested in examining a "dataset 3" that looks like this:
patient fruit [many other columns]
patientA apple ...
patientA apple ...
patientA mango ...
Thanks for your help and patience. I also apologize in advance to anyone who also saw this question earlier today. I posted this as a follow-up question / reply to a solved question & wonder if that is limiting who can view the question, so I'm posting it as a new post here.
You'll always get a quicker result if you type in your 'have' table like I have it. Here you go:
data have;
input patient$ fruit$;
cards;
patientA apple
patientA apple
patientA mango
patientB apple
patientC cherry
patientC grape
;run;
data want;
do until (last.patient);
set have;
by patient;
if fruit = 'apple' then count +1;
if first.patient then count = 1;
end;
do until (last.patient);
set have;
by patient;
if count > 1 then output;
end;
drop count;
run;
You'll always get a quicker result if you type in your 'have' table like I have it. Here you go:
data have;
input patient$ fruit$;
cards;
patientA apple
patientA apple
patientA mango
patientB apple
patientC cherry
patientC grape
;run;
data want;
do until (last.patient);
set have;
by patient;
if fruit = 'apple' then count +1;
if first.patient then count = 1;
end;
do until (last.patient);
set have;
by patient;
if count > 1 then output;
end;
drop count;
run;
Hi. Thank you, this is working. It's also very helpful that this preserves my sorting. So after this step I thought I could do a proc sql to get "dataset4," but this this doesn't seem to be working. I think I might not understand how to use proc sql. Here's what I'm trying to do:
I have "dataset1" that looks like this (I've added in a patient D to this dummy dataset now, to illustrate my problem):
patient fruit [many other columns]
patientA apple ...
patientA apple ...
patientA mango ...
patientB apple ...
patientC cherry ...
patientC grape ....
patientD grape ....
patientD apple ...
patientD apple ...
patientD apple ...
Using your solution, I got "dataset3," which looks like this:
patient fruit [many other columns]
patientA apple ...
patientA apple ...
patientA mango ...
patientD grape ....
patientD apple ...
patientD apple ...
patientD apple ...
I had planned to next use the following lines:
proc sql ;
create table dataset4 as
select *
from dataset3
where patient in(
select patient from dataset1 where fruit ^= 'mango' and fruit ^= 'peach')
;
quit;
To get a "dataset4" that looks like this:
patient fruit [many other columns]
patientD grape ....
patientD apple ...
patientD apple ...
patientD apple ...
In other words, I want to remove from "dataset3" all patients for whom the "fruit" entry is at any point mango or peach. Thoughts?
Hi @beginner,
please find below a modified version of your PROC SQL code, which will do what you want.
proc sql;
create table dataset4 as
select *
from dataset3
where patient ^in(
select distinct patient from dataset3 where fruit = 'mango' or fruit = 'peach')
;
quit;
Edit: Explanations:
The subquery of your PROC SQL step looks at each observation of DATASET1 separately and selects the value of variable PATIENT from it whenever FRUIT is neither 'mango' nor 'peach'. But this can easily bring in patients who do have 'mango' or 'peach' -- in other observations (example: patientA). A single non-('mango' or 'peach') observation is sufficient to make this happen.
Therefore, I changed the logic to exclude (via ^in, i.e. not in) patients who do have 'mango' or 'peach' in any observation. Just to improve performance I restricted the search (of the subquery) to DATASET3, because this is our "universe" in the main query, and added the DISTINCT keyword in order to keep the result of the subquery as small as possible.
Alternatively, one could apply the DOW loop technique (as in Steelers_In_DC's solution) again instead of PROC SQL. If DATASET3 is only used as the basis for the creation of DATASET4, it would be even easier to take the step from DATASET1 to DATASET4 directly. Simply add the exclusion criterion to the existing solution, as shown below:
data dataset4;
do until (last.patient);
set dataset1;
by patient;
if fruit = 'apple' then count=sum(count,1);
else if fruit in ('mango', 'peach') then excl=1;
end;
do until (last.patient);
set dataset1;
by patient;
if count > 1 & ^excl then output;
end;
drop count excl;
run;
As a minor tweak (just for "optimization", not necessary) I changed the way how variable COUNT is incremented: The original "count+1" (a sum statement) causes an implicit RETAIN for variable COUNT. As a consequence, @Steelers_In_DC had to reset COUNT to 1 at the beginning of each BY group in order to count the "apples" for each patient separately. We need COUNT to be retained only within the BY groups and this is guaranteed already by the DOW loop technique (do until(last.xxx); set ...), because each BY group is processed within one iteration of the data step.
However, we need another feature of the sum statement: the implicit initialization of the variable (here: COUNT) to zero. The SUM function shares this feature with the sum statement. So, I used that together with an ordinary assignment statement and deleted "if first.patient then count = 1;". (An assignment statement of the form count=count+1 would not work without additional code, because COUNT would start with a missing value at the beginning of each BY group and "missing + 1 is missing.")
Likewise, @PGStats's PROC SQL approach could be amended easily to exclude the 'mango'-or-'peach' patients by using the following HAVING clause:
having sum(fruit="apple") >= 2 & sum(fruit in ("mango" "peach"))=0;
Thank you, @FreelanceReinhard!
A rather simple SQL query:
proc sql;
create table likesApples as
select *
from have
group by patient
having sum(fruit="apple") >= 2;
quit;
Unlike the data step (DOW loop) solution, PROC SQL does not preserve the original sort order of observations. But, if this was an issue, given the "many" columns in HAVE, it should be possible to build a suitable ORDER BY clause.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.