Solved: Re: Keep lines where the values are the same

Nina4 · Posted 04-29-2022 08:57 AM

Hi

I have a large dataset with patients and their consumed fruits. And I would like to create a new dataset with only the lines where a patient has consumed any fruit twice or more.

This is my starting point:

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

And this is how it should look after:

1 Apple
1 Apple
1 Berry
1 Berry
3 Pear
3 Pear
4 Berry
4 Berry
4 Berry

Thank you for your help!

AMSAS · Posted 04-29-2022 09:13 AM

Take a look at First. Last. and BY group processing

Spoiler

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=srtd_exercise ;
	by patient fruit ;
run ;

data want ;
	set srtd_exercise ;
	by patient fruit ;
	put _all_ ;
	if first.fruit=1 and last.fruit=1 then
		delete ;
run ;

data exercise; input patient fruit$; cards; 1 Apple 1 Pear 1 Apple 1 Berry 1 Berry 2 Apple 2 Pear 3 Pear 3 Pear 3 Berry 4 Pear 4 Berry 4 Berry 4 Berry ; run; proc sort data=exercise out=srtd_exercise ; by patient fruit ; run ; data want ; set srtd_exercise ; by patient fruit ; put _all_ ; if first.fruit=1 and last.fruit=1 then delete ; run ;

View solution in original post

AMSAS · Posted 04-29-2022 09:13 AM

Take a look at First. Last. and BY group processing

Spoiler

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=srtd_exercise ;
	by patient fruit ;
run ;

data want ;
	set srtd_exercise ;
	by patient fruit ;
	put _all_ ;
	if first.fruit=1 and last.fruit=1 then
		delete ;
run ;

data exercise; input patient fruit$; cards; 1 Apple 1 Pear 1 Apple 1 Berry 1 Berry 2 Apple 2 Pear 3 Pear 3 Pear 3 Berry 4 Pear 4 Berry 4 Berry 4 Berry ; run; proc sort data=exercise out=srtd_exercise ; by patient fruit ; run ; data want ; set srtd_exercise ; by patient fruit ; put _all_ ; if first.fruit=1 and last.fruit=1 then delete ; run ;

average_joe · Posted 04-29-2022 09:20 AM

Try this.

proc sql;
create table want as
select
    a.*
from
    exercise  a
    join
    (select patient, fruit, count(*) as cnt
     from exercise
     group by 1,2
     having count(*) > 1
    )  b
        on a.patient = b.patient
           and a.fruit = b.fruit
;
quit;

Nina4 · Posted 04-29-2022 09:27 AM

Thank you both!
Both of your solution work great 😊

Ksharp · Posted 04-29-2022 09:55 AM

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run;

proc sort data=exercise out=want nouniquekey;
by patient fruit;
run;

average_joe · Posted 04-29-2022 09:59 AM

Very nice. Didn't know that option existed until now.

mkeintz · Posted 04-29-2022 10:21 AM

If you have a limited number of values for the variable FRUIT, then consider using a SET statement with one argument per fruit (see the "WHERE=" parameters).

That is, the SET statement will read a collection of subsets (one per fruit). When used with a BY statement, each subset will be processed as a consecutive sequence of obsrvatinos (i.e. all the "Apple", then all the "Berr", etc.) which provides an easy way to detect the presence of more than one obs for a given PATIENT/FRUIT combination:

data exercise;
input patient fruit$;
cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
run;


data want;
  set exercise (where=(fruit='Apple'))
      exercise (where=(fruit='Pear'))
      exercise (where=(fruit='Berry')) ;
  by patient fruit;
  if not (first.fruit=1 and last.fruit=1);
run;

If first.fruit=1 and last.fruit=1 simultaneously then the record-in-hand is the only one for that patient/fruit, and is not wanted.

But if you have too many distinct FRUIT values, then another approach is needed - one where you can dynamically get the frequency of each fruit.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Nina4 · Posted 04-29-2022 10:25 AM

This approach is not feasible for me, as I have thousend of distinct fruit variables.
Still thank you for your option!

novinosrin · Posted 04-29-2022 12:03 PM

For what it's worth-


data exercise ;
	input patient fruit$ ;
	cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;

proc sql ;
	create table want as
		select a.*
			from exercise a 
				inner join 
					(select patient, fruit, count(*) as c from exercise group by patient, fruit) b
						on a.patient=b.patient and a.fruit=b.fruit and c > 1 ;
quit ;

novinosrin · Posted 04-29-2022 02:57 PM



data exercise ;
	input patient fruit$ ;
	cards;
1 Apple
1 Pear
1 Apple
1 Berry
1 Berry
2 Apple
2 Pear
3 Pear
3 Pear
3 Berry
4 Pear
4 Berry
4 Berry
4 Berry
;
run ;


data want ;
  if _n_ = 1 then do ;
    dcl hash h ( ) ;
	h.definekey ('fruit') ;
    h.definedata ('_iorc_') ;
	h.definedone ( ) ;
  end ;
  do _n_ = 1 by 1 until (last.patient) ;
    set exercise ;
	by patient ;
	if h.find( ) = 0 then _iorc_ = sum (_iorc_ , 1) ;
	else _iorc_ = 1 ;
    h.replace( ) ;
  end ;
  do _n_ = 1 to _n_ ;
    set exercise ;
	h.find( ) ;
	if _iorc_ > 1 then output ;
  end ;
  h.clear( ) ;
run ;

SAS Innovate 2025: Call for Content

Classroom Training Available!