Solved: Re: Proc sql: replace value with another value within group

sasecn · Posted 05-25-2021 02:33 PM

I have a data with ids and names. Within one ID, there may have more than one values in Name. My goal is to create another variable called new_name. New_name will have the same value of name if all names are the same within one id, e.g. id1 in the example; will have a value of "multi_name" if ther are more than one values in Name, e.g. id 2 and 3 in the example; will have the value of name that is not equal to "no_name" wihtin the ID if there are two values within ID (one of them is "no_name") e.g. id 4 in the example.

So id3 should have new_name=multi_name, but id4 should have new_name=g. The attached code can produce the sample data "have", and "want" is the one i am looking for. The proc sql is not finished, i don't know how to code it for id4.

Thanks for any help!

data have;
input id $ name $;
datalines;
1 a
1 a
2 b
2 c
3 e
3 f
3 no_name
4 g
4 no_name
;
run;

data want;
input id $1. name $8. new_name $12.;
datalines;
1 a       a
1 a       a
2 b       multi_name
2 c       multi_name
3 e       multi_name
3 f       multi_name
3 no_name multi_name
4 g       g
4 no_name g
;
run;

proc sql;
create table want_1 as
select *,
case 
when count(distinct name) = 1 then name
when count(distinct name) > 1 then "multi_name"
/* how to code new_name = g for id 4 ? */
end as new_name
from have
group by id
;
quit;

novinosrin · Posted 05-25-2021 03:12 PM

Hi @sasecn Here is another variant similar to Tom. This actually nests IFC/IFN functions within a CASE WHEN expression taught by Guru Paul D @hashman a while ago-

data have;
input id $ name $;
datalines;
1 a
1 a
2 b
2 c
3 e
3 f
3 no_name
4 g
4 no_name
;
run;

proc sql;
 create table want as
 select *, 
 case 
  when count(distinct name)=2 and max(name='no_name') then max(ifc(name='no_name',' ',name))
  when count(distinct name)>1  then 'multi_name'
  else name 
 end as new_name length=12
 from have
 group by id;
quit;

proc print noobs;run;

id	name	new_name
1	a	a
1	a	a
2	b	multi_name
2	c	multi_name
3	no_name	multi_name
3	f	multi_name
3	e	multi_name
4	no_name	g
4	g	g

View solution in original post

jimbarbour · Posted 05-25-2021 02:53 PM

I'm not quite sure if I understand what you're trying to do.

Can you mock up what your final output should look like? If you post what you want your results to look like, then maybe I can help out.

Jim

sasecn · Posted 05-25-2021 03:04 PM

The data "want" is the output I am looking for. It should be based on data "have" with a new created variable called new_name.

Thanks,

Tom · Posted 05-25-2021 03:07 PM

Easy enough in PROC SQL since SAS will automatically remerge aggregate statistics like COUNT() and MIN() back onto all observations in a group.

First let's replace those "no_name" values with actual empty strings so that SAS will treat them as missing values. Let's add a unique ID so it will be easier to compare the values.

data have;
  row+1;
  input id $ name $;
datalines;
1 a
1 a
2 b
2 c
3 e
3 f
3 .
4 g
4 .
;

data expect;
  row+1;
  input id $ name $ new_name :$12.;
datalines;
1 a       a
1 a       a
2 b       multi_name
2 c       multi_name
3 e       multi_name
3 f       multi_name
3 .       multi_name
4 g       g
4 .       g
;

Now just group by ID and depending on whether the number of non-missing values is more than 1 decide whether to use that value (note that min() or max() of one value is the same value).

proc sql ;
create table want as
  select a.*
       , case when (count(distinct name) > 1) then 'multi_name' 
         else min(name) end as new_name length=12
  from have a
  group by a.id 
  order by row
;
quit;

proc print data=want; run;
proc compare data=want compare=expect;
run;

Results:

SAS 9.4 on WINDOWS

Obs    row    id    name    new_name

 1      1     1      a      a
 2      2     1      a      a
 3      3     2      b      multi_name
 4      4     2      c      multi_name
 5      5     3      e      multi_name
 6      6     3      f      multi_name
 7      7     3             multi_name
 8      8     4      g      g
 9      9     4             g

SAS 9.4 on WINDOWS                                                                                     

The COMPARE Procedure
Comparison of WORK.WANT with WORK.EXPECT
(Method=EXACT)

Data Set Summary

Dataset               Created          Modified  NVar    NObs

WORK.WANT    25MAY21:15:06:21  25MAY21:15:06:21     4       9
WORK.EXPECT  25MAY21:15:06:21  25MAY21:15:06:21     4       9


Variables Summary

Number of Variables in Common: 4.


Observation Summary

Observation      Base  Compare

First Obs           1        1
Last  Obs           9        9

Number of Observations in Common: 9.
Total Number of Observations Read from WORK.WANT: 9.
Total Number of Observations Read from WORK.EXPECT: 9.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 9.

NOTE: No unequal values were found. All values compared are exactly equal.

novinosrin · Posted 05-25-2021 03:12 PM

Hi @sasecn Here is another variant similar to Tom. This actually nests IFC/IFN functions within a CASE WHEN expression taught by Guru Paul D @hashman a while ago-

data have;
input id $ name $;
datalines;
1 a
1 a
2 b
2 c
3 e
3 f
3 no_name
4 g
4 no_name
;
run;

proc sql;
 create table want as
 select *, 
 case 
  when count(distinct name)=2 and max(name='no_name') then max(ifc(name='no_name',' ',name))
  when count(distinct name)>1  then 'multi_name'
  else name 
 end as new_name length=12
 from have
 group by id;
quit;

proc print noobs;run;

id	name	new_name
1	a	a
1	a	a
2	b	multi_name
2	c	multi_name
3	no_name	multi_name
3	f	multi_name
3	e	multi_name
4	no_name	g
4	g	g

sasecn · Posted 05-25-2021 03:39 PM

Thanks, it works. Can you explain the condition "and max(name='no_name')" ? I understand the first part. Does the whole condition mean: if there are two outcomes of Name, one is some value of name, the other one is 'no_name', then take the max value (ie. not the emapty one)?

novinosrin · Posted 05-25-2021 04:19 PM

Hi again @sasecn the logic is fairly simple once we understand the "waterfall construct"-

1. Start with checking for 2 distinct values with one valid name and the other no_name viz. max(name='no_name') results in a boolean 1's or 0's true for no_name or false

2. 1st When fails, the 2nd when expression checks for any count of distinct names >1 and assigns the easy multiname

3. When 1 & 2 fails, there can only be one more options i.e. all names being same.

Hope that helps

sasecn · Posted 05-25-2021 04:47 PM

Thank you! Really helpful!

sasecn · Posted 05-25-2021 03:58 PM

Thanks, this code works well. I just don't understand the "else min(name)" part. For example, how it can take the value of "g" in id4 case? Does the SAS just ingore the row with missing value?

Tom · Posted 05-25-2021 04:26 PM

MIN() of character strings ignores blank values the same as it ignores missing value for numeric variables.

COUNT() also ignores the blank values.

If your data really has "no_name" as the value of NAME instead of blanks then use CASE (or IFC()) to remove those.

count(distinct case when name='no_name' then ' ' else name end)
...
min(case when name='no_name' then ' ' else name end)

sasecn · Posted 05-25-2021 04:48 PM

Great, thanks for the answer1

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away