BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Nina4
Obsidian | Level 7

Hi

I have a large dataset with patients and their consumed fruits. And I would like to create a new dataset with only the lines where a patient has consumed any fruit twice or more.

This is my starting point: 

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

And this is how it should look after: 

 

1 Apple
1 Apple
1 Berry
1 Berry
3 Pear
3 Pear
4 Berry
4 Berry
4 Berry

 

Thank you for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
AMSAS
SAS Super FREQ

Take a look at First. Last. and BY group processing

 

 

Spoiler
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=srtd_exercise ;
	by patient fruit ;
run ;

data want ;
	set srtd_exercise ;
	by patient fruit ;
	put _all_ ;
	if first.fruit=1 and last.fruit=1 then
		delete ;
run ;

 

 

View solution in original post

9 REPLIES 9
AMSAS
SAS Super FREQ

Take a look at First. Last. and BY group processing

 

 

Spoiler
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=srtd_exercise ;
	by patient fruit ;
run ;

data want ;
	set srtd_exercise ;
	by patient fruit ;
	put _all_ ;
	if first.fruit=1 and last.fruit=1 then
		delete ;
run ;

 

 

average_joe
Obsidian | Level 7

Try this.

 

proc sql;
create table want as
select
    a.*
from
    exercise  a
    join
    (select patient, fruit, count(*) as cnt
     from exercise
     group by 1,2
     having count(*) > 1
    )  b
        on a.patient = b.patient
           and a.fruit = b.fruit
;
quit;

Nina4
Obsidian | Level 7

Thank you both! 
Both of your solution work great 😊

 

Ksharp
Super User
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=want nouniquekey;
by patient fruit;
run;
average_joe
Obsidian | Level 7
Very nice. Didn't know that option existed until now.
mkeintz
PROC Star

If you have a limited number of values for the variable FRUIT, then consider using a SET statement with one argument per fruit (see the "WHERE=" parameters).

 

That is, the SET statement will read a collection of subsets (one per fruit).  When used with a BY statement, each subset will be processed as a consecutive sequence of obsrvatinos (i.e. all the "Apple", then all the "Berr", etc.) which provides an easy way to detect the presence of more than one obs for a given PATIENT/FRUIT combination:

 

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
run;


data want;
  set exercise (where=(fruit='Apple'))
      exercise (where=(fruit='Pear'))
      exercise (where=(fruit='Berry')) ;
  by patient fruit;
  if not (first.fruit=1 and last.fruit=1);
run;

If first.fruit=1 and last.fruit=1 simultaneously then the record-in-hand is the only one for that patient/fruit, and is not wanted.

 

But if you have too many distinct FRUIT values, then another approach is needed - one where you can dynamically get the frequency of each fruit.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Nina4
Obsidian | Level 7
This approach is not feasible for me, as I have thousend of distinct fruit variables.
Still thank you for your option!
novinosrin
Tourmaline | Level 20

For what it's worth-

 


data exercise ;
	input patient fruit$ ;
	cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;

proc sql ;
	create table want as
		select a.*
			from exercise a 
				inner join 
					(select patient, fruit, count(*) as c from exercise group by patient, fruit) b
						on a.patient=b.patient and a.fruit=b.fruit and c > 1 ;
quit ;
novinosrin
Tourmaline | Level 20


data exercise ;
	input patient fruit$ ;
	cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;


data want ;
  if _n_ = 1 then do ;
    dcl hash h ( ) ;
	h.definekey ('fruit') ;
    h.definedata ('_iorc_') ;
	h.definedone ( ) ;
  end ;
  do _n_ = 1 by 1 until (last.patient) ;
    set exercise ;
	by patient ;
	if h.find( ) = 0 then _iorc_ = sum (_iorc_ , 1) ;
	else _iorc_ = 1 ;
    h.replace( ) ;
  end ;
  do _n_ = 1 to _n_ ;
    set exercise ;
	h.find( ) ;
	if _iorc_ > 1 then output ;
  end ;
  h.clear( ) ;
run ;

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1871 views
  • 4 likes
  • 6 in conversation