BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Nina4
Obsidian | Level 7

Hi

I have a large dataset with patients and their consumed fruits. And I would like to create a new dataset with only the lines where a patient has consumed any fruit twice or more.

This is my starting point: 

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

And this is how it should look after: 

 

1 Apple
1 Apple
1 Berry
1 Berry
3 Pear
3 Pear
4 Berry
4 Berry
4 Berry

 

Thank you for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
AMSAS
SAS Super FREQ

Take a look at First. Last. and BY group processing

 

 

Spoiler
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=srtd_exercise ;
	by patient fruit ;
run ;

data want ;
	set srtd_exercise ;
	by patient fruit ;
	put _all_ ;
	if first.fruit=1 and last.fruit=1 then
		delete ;
run ;

 

 

View solution in original post

9 REPLIES 9
AMSAS
SAS Super FREQ

Take a look at First. Last. and BY group processing

 

 

Spoiler
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=srtd_exercise ;
	by patient fruit ;
run ;

data want ;
	set srtd_exercise ;
	by patient fruit ;
	put _all_ ;
	if first.fruit=1 and last.fruit=1 then
		delete ;
run ;

 

 

average_joe
Obsidian | Level 7

Try this.

 

proc sql;
create table want as
select
    a.*
from
    exercise  a
    join
    (select patient, fruit, count(*) as cnt
     from exercise
     group by 1,2
     having count(*) > 1
    )  b
        on a.patient = b.patient
           and a.fruit = b.fruit
;
quit;

Nina4
Obsidian | Level 7

Thank you both! 
Both of your solution work great 😊

 

Ksharp
Super User
data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=want nouniquekey;
by patient fruit;
run;
average_joe
Obsidian | Level 7
Very nice. Didn't know that option existed until now.
mkeintz
PROC Star

If you have a limited number of values for the variable FRUIT, then consider using a SET statement with one argument per fruit (see the "WHERE=" parameters).

 

That is, the SET statement will read a collection of subsets (one per fruit).  When used with a BY statement, each subset will be processed as a consecutive sequence of obsrvatinos (i.e. all the "Apple", then all the "Berr", etc.) which provides an easy way to detect the presence of more than one obs for a given PATIENT/FRUIT combination:

 

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
run;


data want;
  set exercise (where=(fruit='Apple'))
      exercise (where=(fruit='Pear'))
      exercise (where=(fruit='Berry')) ;
  by patient fruit;
  if not (first.fruit=1 and last.fruit=1);
run;

If first.fruit=1 and last.fruit=1 simultaneously then the record-in-hand is the only one for that patient/fruit, and is not wanted.

 

But if you have too many distinct FRUIT values, then another approach is needed - one where you can dynamically get the frequency of each fruit.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Nina4
Obsidian | Level 7
This approach is not feasible for me, as I have thousend of distinct fruit variables.
Still thank you for your option!
novinosrin
Tourmaline | Level 20

For what it's worth-

 


data exercise ;
	input patient fruit$ ;
	cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;

proc sql ;
	create table want as
		select a.*
			from exercise a 
				inner join 
					(select patient, fruit, count(*) as c from exercise group by patient, fruit) b
						on a.patient=b.patient and a.fruit=b.fruit and c > 1 ;
quit ;
novinosrin
Tourmaline | Level 20


data exercise ;
	input patient fruit$ ;
	cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;


data want ;
  if _n_ = 1 then do ;
    dcl hash h ( ) ;
	h.definekey ('fruit') ;
    h.definedata ('_iorc_') ;
	h.definedone ( ) ;
  end ;
  do _n_ = 1 by 1 until (last.patient) ;
    set exercise ;
	by patient ;
	if h.find( ) = 0 then _iorc_ = sum (_iorc_ , 1) ;
	else _iorc_ = 1 ;
    h.replace( ) ;
  end ;
  do _n_ = 1 to _n_ ;
    set exercise ;
	h.find( ) ;
	if _iorc_ > 1 then output ;
  end ;
  h.clear( ) ;
run ;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1375 views
  • 4 likes
  • 6 in conversation