Hi
I have a large dataset with patients and their consumed fruits. And I would like to create a new dataset with only the lines where a patient has consumed any fruit twice or more.
This is my starting point:
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;
And this is how it should look after:
1 Apple
1 Apple
1 Berry
1 Berry
3 Pear
3 Pear
4 Berry
4 Berry
4 Berry
Thank you for your help!
Take a look at First. Last. and BY group processing
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;
proc sort data=exercise out=srtd_exercise ;
by patient fruit ;
run ;
data want ;
set srtd_exercise ;
by patient fruit ;
put _all_ ;
if first.fruit=1 and last.fruit=1 then
delete ;
run ;
Take a look at First. Last. and BY group processing
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;
proc sort data=exercise out=srtd_exercise ;
by patient fruit ;
run ;
data want ;
set srtd_exercise ;
by patient fruit ;
put _all_ ;
if first.fruit=1 and last.fruit=1 then
delete ;
run ;
Try this.
proc sql;
create table want as
select
a.*
from
exercise a
join
(select patient, fruit, count(*) as cnt
from exercise
group by 1,2
having count(*) > 1
) b
on a.patient = b.patient
and a.fruit = b.fruit
;
quit;
Thank you both!
Both of your solution work great 😊
data exercise; input patient fruit$; cards; 1 Apple 1 Pear 1 Apple 1 Berry 1 Berry 2 Apple 2 Pear 3 Pear 3 Pear 3 Berry 4 Pear 4 Berry 4 Berry 4 Berry ; run; proc sort data=exercise out=want nouniquekey; by patient fruit; run;
If you have a limited number of values for the variable FRUIT, then consider using a SET statement with one argument per fruit (see the "WHERE=" parameters).
That is, the SET statement will read a collection of subsets (one per fruit). When used with a BY statement, each subset will be processed as a consecutive sequence of obsrvatinos (i.e. all the "Apple", then all the "Berr", etc.) which provides an easy way to detect the presence of more than one obs for a given PATIENT/FRUIT combination:
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
run;
data want;
set exercise (where=(fruit='Apple'))
exercise (where=(fruit='Pear'))
exercise (where=(fruit='Berry')) ;
by patient fruit;
if not (first.fruit=1 and last.fruit=1);
run;
If first.fruit=1 and last.fruit=1 simultaneously then the record-in-hand is the only one for that patient/fruit, and is not wanted.
But if you have too many distinct FRUIT values, then another approach is needed - one where you can dynamically get the frequency of each fruit.
For what it's worth-
data exercise ;
input patient fruit$ ;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;
proc sql ;
create table want as
select a.*
from exercise a
inner join
(select patient, fruit, count(*) as c from exercise group by patient, fruit) b
on a.patient=b.patient and a.fruit=b.fruit and c > 1 ;
quit ;
data exercise ;
input patient fruit$ ;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;
data want ;
if _n_ = 1 then do ;
dcl hash h ( ) ;
h.definekey ('fruit') ;
h.definedata ('_iorc_') ;
h.definedone ( ) ;
end ;
do _n_ = 1 by 1 until (last.patient) ;
set exercise ;
by patient ;
if h.find( ) = 0 then _iorc_ = sum (_iorc_ , 1) ;
else _iorc_ = 1 ;
h.replace( ) ;
end ;
do _n_ = 1 to _n_ ;
set exercise ;
h.find( ) ;
if _iorc_ > 1 then output ;
end ;
h.clear( ) ;
run ;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.