Hi again,
I am trying to fill in the missing records to Y if any record in the series is Y by subjid and count where coll is 1. In my dataset I have collapsed records, a new record generated is ex = 1.
My interest is if any record in variable X = Y by count (Which is series) then the missing value should be Y. similarly for variable Y and Z.
I have highlighted the output I want in bold.
Data have | subjid | X | Y | Z | Coll | count | ex | |
101 | Y | N | Y | 1 | 1 | |||
101 | 1 | 1 | 1 | |||||
101 | N | N | Y | 1 | 1 | |||
101 | Y | Y | N | 1 | 2 | |||
101 | 1 | 2 | ||||||
101 | Y | N | N | 1 | 2 | |||
102 | Y | N | N | |||||
102 | N | N | N | 1 | 1 | |||
102 | 1 | 1 | 1 | |||||
102 | Y | N | N | 1 | 1 | |||
102 | N | N | N | 1 | 1 | |||
Data want | subjid | X | Y | Z | Coll | count | ex | |
101 | Y | N | Y | 1 | 1 | |||
101 | Y | Y | Y | 1 | 1 | 1 | ||
101 | N | N | Y | 1 | 1 | |||
101 | Y | Y | N | 1 | 2 | |||
101 | Y | Y | N | 1 | 2 | 1 | ||
101 | Y | N | N | 1 | 2 | |||
102 | Y | N | N | |||||
102 | N | N | N | 1 | 1 | |||
102 | Y | N | N | 1 | 1 | 1 | ||
102 | Y | N | N | 1 | 1 | |||
102 | N | N | N | 1 | 1 | |||
Thanks in advance.
Hi,
I do not see why in obs=2 of your dataset want you get Y='Y' since there are only N in your group
The solution could be something like this:
(untested)
PROC SORT data=have; by subjid count descending X descending Y descending Z;RUN;
DATA want;
LENGTH x1 y1 z1 $1;
SET have;
BY subjid count;
RETAIN x1 y1 z1 '';
ARRAY val (3) x y z;
ARRAY lag (3) x1 y1 z1;
DO i=1 TO dim(val);
IF first.count then lag(i)='N';
IF val(i) eq 'Y' then lag(i)='Y';
IF coll eq 1 AND missing(val(i)) then val(i)=lag(i);
END;
RUN;
- Cheers -
Hello @nans_p ,
It took me a long time to figure out what you want exactly. Your explanation is not super-clear (probably due to having a variable Y and a value 'Y'). You are lucky it's a weekend day, otherwise I wouldn't have done the effort. 😉
Just like @Oligolas, one of the records in the dataset WANT is not as you specified.
I deliberately pulled it all apart in many different steps so that you can easily assess if this is what you want. If you confirm this is what you want then we can come up with more concise code (like the code provided by @Oligolas ). More concise code could come in handy (for performance reasons) if you have tens or hundreds of thousands of observations to deal with.
Here's my code:
/* one series is one combination of subjid and count? */
data have;
input subjid X $ Y $ Z $ Coll count ex;
informat X Y Z $1.;
cards;
101 Y N Y 1 1 .
101 . . . 1 1 1
101 N N Y 1 1 .
101 Y Y N 1 2 .
101 . . . 1 2 .
101 Y N N 1 2 .
102 Y N N . . .
102 N N N 1 1 .
102 . . . 1 1 1
102 Y N N 1 1 .
102 N N N 1 1 .
;
run;
%MACRO LOOP_XYZ;
%DO i=1 %TO 3;
%IF &i.=1 %THEN %LET LETTER=x;
%ELSE %IF &i.=2 %THEN %LET LETTER=y;
%ELSE %IF &i.=3 %THEN %LET LETTER=z;
%ELSE;
data missing_&LETTER._shouldbe_Y(keep=subjid count);
set have;
by subjid count;
where &LETTER.='Y';
where also Coll=1;
run;
PROC SORT data=missing_&LETTER._shouldbe_Y NODUPKEY;
by subjid count;
run;
%END;
%MEND LOOP_XYZ;
options mprint nosymbolgen;
%LOOP_XYZ
data want;
merge have
missing_x_shouldbe_Y(in=in2)
missing_y_shouldbe_Y(in=in3)
missing_z_shouldbe_Y(in=in4);
by subjid count;
if X='' and Y='' and Z='' then ex=1;
if in2 and x='' then x='Y'; else x='N';
if in3 and y='' then y='Y'; else y='N';
if in4 and z='' then z='Y'; else z='N';
run;
/* end of program */
Cheers,
Koen
Hi Koen,
Thanks for looking at my query. I am sorry I couldn't be clear to explain.
What I am looking for is if any of the values in the variable X , Y , Z is 'Y' ( 'Y' basically stands for Yes & N stands for 'No' ), by subjid and count (count is what I call series) then the missing value should be set to 'Y' (Yes).
Thanks.
It sounds like what you are trying to say is you want to replace all of the missing values with the same value for the BY group independent of which variable has the missing value?
This would be much easier if you coded the variables as 0/1 instead of 'N'/'Y'. Then you caould use the two different MAX() function (SQL aggregate and SAS maximum functions) to find the maximum value over all three variable.
select ...,coalesce(X,max(max(x,y,z))) as X....
Hi,
since I'm resorting the records by placing the filled ones ahead, I do consider all records in a group. The assumption I take is take x,y and z will be missing simultaneously.
- Cheers -
So you want to replace missing values of X with the MAX() value of X for the group? Simple enough with PROC SQL.
First we need to clean up your example results so they match the rules. Let's add a ROW variable so we can recreate the order.
data have ;
infile cards dsd dlm='|' truncover;
row+1;
input subjid $ (X Y Z) (:$1.) Coll count ex ;
cards;
101|Y|N|Y|1|1|
101| | | |1|1|1
101|N|N|Y|1|1|
101|Y|Y|N|1|2|
101| | | |1|2|
101|Y|N|N|1|2|
102|Y|N|N| | |
102|N|N|N|1|1|
102| | | |1|1|1
102|Y|N|N|1|1|
102|N|N|N|1|1|
;
data expect ;
infile cards dsd dlm='|' truncover;
row+1;
input subjid $ (X Y Z) (:$1.) Coll count ex ;
cards;
101|Y|N|Y|1|1|
101|Y|N|Y|1|1|1 | Change Y to 'N'
101|N|N|Y|1|1|
101|Y|Y|N|1|2|
101|Y|Y|N|1|2| | Remvoed EX=1
101|Y|N|N|1|2|
102|Y|N|N| | |
102|N|N|N|1|1|
102|Y|N|N|1|1|1
102|Y|N|N|1|1|
102|N|N|N|1|1|
;
Now use PROC SQL and then use PROC COMPARE to test the results
proc sql ;
create table want as
select row
, subjid
, count
, Coll
, ex
, coalesce(X,max(X)) as X
, coalesce(Y,max(Y)) as Y
, coalesce(Z,max(Z)) as Z
from have
group by subjid,count
order by row
;
quit;
proc print data=want;
run;
proc compare data=want compare=expect;
id subjid count row;
run;
Obs row subjid count Coll ex X Y Z 1 1 101 1 1 . Y N Y 2 2 101 1 1 1 Y N Y 3 3 101 1 1 . N N Y 4 4 101 2 1 . Y Y N 5 5 101 2 1 . Y Y N 6 6 101 2 1 . Y N N 7 7 102 . . . Y N N 8 8 102 1 1 . N N N 9 9 102 1 1 1 Y N N 10 10 102 1 1 . Y N N 11 11 102 1 1 . N N N SAS 9.4 on WINDOWS 10:28 Wednesday, April 7, 2021 113 The COMPARE Procedure Comparison of WORK.WANT with WORK.EXPECT (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.WANT 10APR21:14:52:58 10APR21:14:52:58 8 11 WORK.EXPECT 10APR21:14:52:58 10APR21:14:52:58 8 11 Variables Summary Number of Variables in Common: 8. Number of ID Variables: 3. Observation Summary Observation Base Compare ID First Obs 1 1 subjid=101 count=1 row=1 Last Obs 11 11 subjid=102 count=1 row=11 Number of Observations in Common: 11. Total Number of Observations Read from WORK.WANT: 11. Total Number of Observations Read from WORK.EXPECT: 11. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 11. NOTE: No unequal values were found. All values compared are exactly equal.
Here a data step approach using a hash lookup table.
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'have(where=(whichc("Y",x,y,z)>0))');
h1.defineKey('subjid','count');
h1.defineData('subjid','count');
h1.defineDone();
end;
set have;
if h1.check()=0 then
do;
if missing(x) then x='Y';
if missing(y) then y='Y';
if missing(z) then z='Y';
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.