Solved: Re: How to select the more comprehensive row by a key column?

aaronh · Posted 09-26-2019 02:51 PM

Hello experts,

Suppose I have a table like this:

data have;
input fruit$ attrib$;
datalines;
apple
apple sweet
lemon sour
banana
banana
orange citrus
orange citrus
;

So in this dataset, a fruit can sometimes have attribute that is complete (like orange), empty (like banana), or somewhere in between (like apple).

The challenge here is that I want to condense the rows into 4 rows, such that if the attribute for fruit X is provided in any row, I will keep that attribute for fruit X. If no attribute is provided for fruit X, I will leave it as is. So the resulting dataset will look like:

apple sweet

lemon sour

banana

orange citrus

Any ideas will be appreciated.

Thank you,

novinosrin · Posted 09-26-2019 03:06 PM




data have;
input fruit$ attrib$;
datalines;
apple .
apple sweet
lemon sour
banana .
banana .
orange citrus
orange citrus
;

proc sql;
create table want as
select distinct a.*
from have a left join have b
on find(cats(a.fruit,a.attrib), strip(b.fruit))>0
group by a.fruit
having max(length(cats(a.fruit,a.attrib))) =length(cats(a.fruit,a.attrib));
quit;

View solution in original post

tomrvincent · Posted 09-26-2019 03:00 PM

do a distinct by fruit and another distinct by fruit and attrib. Join the 2 together.

Reeza · Posted 09-26-2019 03:04 PM

Use Sort.

proc sort data=have;
by fruit descending attrib;
run;

proc sort data=have out=want nodupkey;
by fruit;
run;

If you have fruits with different attributes you may run into some issues.

@aaronh wrote:

Hello experts,

Suppose I have a table like this:
data have;
input fruit$ attrib$;
datalines;
apple
apple sweet
lemon sour
banana
banana
orange citrus
orange citrus
;
So in this dataset, a fruit can sometimes have attribute that is complete (like orange), empty (like banana), or somewhere in between (like apple).

The challenge here is that I want to condense the rows into 4 rows, such that if the attribute for fruit X is provided in any row, I will keep that attribute for fruit X. If no attribute is provided for fruit X, I will leave it as is. So the resulting dataset will look like:

apple sweet

lemon sour

banana

orange citrus

Any ideas will be appreciated.

Thank you,

aaronh · Posted 09-26-2019 03:12 PM

Thanks Reeza! I was thinking of maybe using a sort first and then using a by group in a data step to choose the first/last in the by group. Which is somewhat similar I guess to what you have

novinosrin · Posted 09-26-2019 03:06 PM




data have;
input fruit$ attrib$;
datalines;
apple .
apple sweet
lemon sour
banana .
banana .
orange citrus
orange citrus
;

proc sql;
create table want as
select distinct a.*
from have a left join have b
on find(cats(a.fruit,a.attrib), strip(b.fruit))>0
group by a.fruit
having max(length(cats(a.fruit,a.attrib))) =length(cats(a.fruit,a.attrib));
quit;

PGStats · Posted 09-26-2019 05:08 PM

Using a hash:

data _null_;
set have end=done;
if _n_ = 1 then do;
	declare hash h();
	rc = h.definekey("fruit");
	rc = h.definedata("fruit", "attrib");
	rc = h.definedone();
	end;
if h.check() ne 0 then h.add();
else if not missing(attrib) then h.replace();
if done then h.output(dataset : "want");
run;

It will keep the last attribute encountered for a given fruit.

PG

hashman · Posted 09-27-2019 12:05 AM

@aaronh:

Hm. If you real data are patterned the same way you've presented in your data sample, i.e. grouped by [fruit,attrib] with the missing values for attrib coming first, the only thing you need to do is:

data have ;                
  input fruit $ attrib $ ; 
  cards ;                  
apple  .                   
apple  sweet               
lemon  sour                
banana .                   
banana .                   
orange citrus              
orange citrus              
;                          
run ;                      
                           
data want ;                
  set have ;               
  by fruit notsorted ;     
  if last.fruit ;          
run ;

If the real input data aren't grouped or sorted and you don't want to sort, just use what @PGStats or @novinosrin have offered.

Kind regards

Paul D.

aaronh · Posted 09-27-2019 09:07 AM

Thank you all very much for your insightful solutions!

@hashman Yes, my real data is actually a bit more complex because I am trying to select the more comprehensive set of addresses (street address 1 and street address 2) for a by-group that consists of a few variables. It seems like using by-group in a data step would be the easiest way to comprehend or trouble-shoot.

@PGStats and @novinosrin : thank you both for the solutions! At first glance, it seems like a simple logic to a human mind, but to implement it in SAS, it can be much more complicated.

Once again, I really appreciate the inputs from you all!

tomrvincent · Posted 09-27-2019 03:38 PM

PROC SQL;
	CREATE TABLE WORK.foo AS 
		SELECT f.fruit, 
			a.attrib
		FROM (SELECT DISTINCT fruit FROM USER.HAVE) f
			FULL JOIN (SELECT DISTINCT fruit, attrib FROM USER.HAVE WHERE attrib NOT IS MISSING) a 
				ON f.fruit = a.fruit;
QUIT;

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away