Set missing record to 'Y' if any record in the series has Y

nans_p · Posted 04-10-2021 12:48 PM

Hi again,

I am trying to fill in the missing records to Y if any record in the series is Y by subjid and count where coll is 1. In my dataset I have collapsed records, a new record generated is ex = 1.

My interest is if any record in variable X = Y by count (Which is series) then the missing value should be Y. similarly for variable Y and Z.

I have highlighted the output I want in bold.

Data have	subjid	X	Y	Z	Coll	count	ex
	101	Y	N	Y	1	1
	101				1	1	1
	101	N	N	Y	1	1
	101	Y	Y	N	1	2
	101				1	2
	101	Y	N	N	1	2
	102	Y	N	N
	102	N	N	N	1	1
	102				1	1	1
	102	Y	N	N	1	1
	102	N	N	N	1	1

Data want	subjid	X	Y	Z	Coll	count	ex
	101	Y	N	Y	1	1
	101	Y	Y	Y	1	1	1
	101	N	N	Y	1	1
	101	Y	Y	N	1	2
	101	Y	Y	N	1	2	1
	101	Y	N	N	1	2
	102	Y	N	N
	102	N	N	N	1	1
	102	Y	N	N	1	1	1
	102	Y	N	N	1	1
	102	N	N	N	1	1

Thanks in advance.

Oligolas · Posted 04-10-2021 02:10 PM

Hi,

I do not see why in obs=2 of your dataset want you get Y='Y' since there are only N in your group

The solution could be something like this:

(untested)

PROC SORT data=have; by subjid count descending X descending Y descending Z;RUN;
DATA want;
LENGTH x1 y1 z1 $1;
SET have;
BY subjid count;
RETAIN x1 y1 z1 '';
ARRAY val (3) x y z;
ARRAY lag (3) x1 y1 z1;
DO i=1 TO dim(val);
IF first.count then lag(i)='N';
IF val(i) eq 'Y' then lag(i)='Y';
IF coll eq 1 AND missing(val(i)) then val(i)=lag(i);
END;
RUN;

________________________
- Cheers -

sbxkoenk · Posted 04-10-2021 02:34 PM

Hello @nans_p ,

It took me a long time to figure out what you want exactly. Your explanation is not super-clear (probably due to having a variable Y and a value 'Y'). You are lucky it's a weekend day, otherwise I wouldn't have done the effort. 😉

Just like @Oligolas, one of the records in the dataset WANT is not as you specified.

I deliberately pulled it all apart in many different steps so that you can easily assess if this is what you want. If you confirm this is what you want then we can come up with more concise code (like the code provided by @Oligolas ). More concise code could come in handy (for performance reasons) if you have tens or hundreds of thousands of observations to deal with.

Here's my code:

/* one series is one combination of subjid and count? */
data have;
 input subjid X $ Y $ Z $ Coll count ex;
 informat X Y Z $1.;
 cards;
101	Y	N	Y	1	1	.
101	.   .   . 	1	1	1
101	N	N	Y	1	1	.
101	Y	Y	N	1	2	.
101	.   .   . 	1	2	.
101	Y	N	N	1	2	.
102	Y	N	N   .   .	.
102	N	N	N	1	1	.
102	.   .   . 	1	1	1
102	Y	N	N	1	1	.
102	N	N	N	1	1	.
;
run;

%MACRO LOOP_XYZ;
%DO i=1 %TO 3;
%IF       &i.=1 %THEN %LET LETTER=x;
%ELSE %IF &i.=2 %THEN %LET LETTER=y;
%ELSE %IF &i.=3 %THEN %LET LETTER=z;
%ELSE;
data missing_&LETTER._shouldbe_Y(keep=subjid count);
 set have;
 by subjid count;
 where &LETTER.='Y';
 where also Coll=1;
run;
PROC SORT data=missing_&LETTER._shouldbe_Y NODUPKEY;
 by subjid count;
run;
%END;
%MEND  LOOP_XYZ;
options mprint nosymbolgen;
%LOOP_XYZ

data want;
 merge have 
       missing_x_shouldbe_Y(in=in2)
	   missing_y_shouldbe_Y(in=in3)
	   missing_z_shouldbe_Y(in=in4);
 by subjid count;
 if X='' and Y='' and Z='' then ex=1;
 if in2 and x='' then x='Y'; else x='N';
 if in3 and y='' then y='Y'; else y='N';
 if in4 and z='' then z='Y'; else z='N';
run;
/* end of program */

Cheers,

Koen

sbxkoenk · Posted 04-10-2021 02:41 PM

The difference between my code and the code of @Oligolas (I just looked at it) is that @Oligolascode only looks at the past records while I look at all records of a series (subjid + count) but maybe that is not necessary at all?

Please clarify!

Koen

nans_p · Posted 04-10-2021 02:59 PM

Hi Koen,

Thanks for looking at my query. I am sorry I couldn't be clear to explain.

What I am looking for is if any of the values in the variable X , Y , Z is 'Y' ( 'Y' basically stands for Yes & N stands for 'No' ), by subjid and count (count is what I call series) then the missing value should be set to 'Y' (Yes).

Thanks.

Tom · Posted 04-10-2021 04:09 PM

It sounds like what you are trying to say is you want to replace all of the missing values with the same value for the BY group independent of which variable has the missing value?

This would be much easier if you coded the variables as 0/1 instead of 'N'/'Y'. Then you caould use the two different MAX() function (SQL aggregate and SAS maximum functions) to find the maximum value over all three variable.

select ...,coalesce(X,max(max(x,y,z))) as X....

Oligolas · Posted 04-12-2021 05:35 AM

Hi,

since I'm resorting the records by placing the filled ones ahead, I do consider all records in a group. The assumption I take is take x,y and z will be missing simultaneously.

________________________
- Cheers -

Tom · Posted 04-10-2021 02:55 PM

So you want to replace missing values of X with the MAX() value of X for the group? Simple enough with PROC SQL.

First we need to clean up your example results so they match the rules. Let's add a ROW variable so we can recreate the order.

data have ;
  infile cards dsd dlm='|' truncover;
  row+1;
  input subjid $ (X Y Z) (:$1.) Coll count ex ;
cards;
101|Y|N|Y|1|1| 
101| | | |1|1|1
101|N|N|Y|1|1| 
101|Y|Y|N|1|2| 
101| | | |1|2| 
101|Y|N|N|1|2| 
102|Y|N|N| | | 
102|N|N|N|1|1| 
102| | | |1|1|1
102|Y|N|N|1|1| 
102|N|N|N|1|1| 
;

data expect ;
  infile cards dsd dlm='|' truncover;
  row+1;
  input subjid $ (X Y Z) (:$1.) Coll count ex ;
cards;
101|Y|N|Y|1|1|  
101|Y|N|Y|1|1|1 | Change Y to 'N' 
101|N|N|Y|1|1|  
101|Y|Y|N|1|2|  
101|Y|Y|N|1|2|  |  Remvoed EX=1 
101|Y|N|N|1|2|  
102|Y|N|N| | |  
102|N|N|N|1|1|  
102|Y|N|N|1|1|1 
102|Y|N|N|1|1|  
102|N|N|N|1|1|  
;

Now use PROC SQL and then use PROC COMPARE to test the results

proc sql ;
 create table want as 
 select row
      , subjid 
      , count
      , Coll
      , ex
      , coalesce(X,max(X)) as X
      , coalesce(Y,max(Y)) as Y
      , coalesce(Z,max(Z)) as Z
 from have
 group by subjid,count
 order by row
 ;
quit;

proc print data=want;
run;

proc compare data=want compare=expect;
  id subjid count row;
run;

Obs    row    subjid    count    Coll    ex    X    Y    Z

  1      1     101        1        1      .    Y    N    Y
  2      2     101        1        1      1    Y    N    Y
  3      3     101        1        1      .    N    N    Y
  4      4     101        2        1      .    Y    Y    N
  5      5     101        2        1      .    Y    Y    N
  6      6     101        2        1      .    Y    N    N
  7      7     102        .        .      .    Y    N    N
  8      8     102        1        1      .    N    N    N
  9      9     102        1        1      1    Y    N    N
 10     10     102        1        1      .    Y    N    N
 11     11     102        1        1      .    N    N    N

SAS 9.4 on WINDOWS                                                                                  10:28 Wednesday, April 7, 2021 113

The COMPARE Procedure
Comparison of WORK.WANT with WORK.EXPECT
(Method=EXACT)

Data Set Summary

Dataset               Created          Modified  NVar    NObs

WORK.WANT    10APR21:14:52:58  10APR21:14:52:58     8      11
WORK.EXPECT  10APR21:14:52:58  10APR21:14:52:58     8      11


Variables Summary

Number of Variables in Common: 8.
Number of ID Variables: 3.


Observation Summary

Observation      Base  Compare  ID

First Obs           1        1  subjid=101 count=1 row=1
Last  Obs          11       11  subjid=102 count=1 row=11

Number of Observations in Common: 11.
Total Number of Observations Read from WORK.WANT: 11.
Total Number of Observations Read from WORK.EXPECT: 11.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 11.

NOTE: No unequal values were found. All values compared are exactly equal.

Patrick · Posted 04-10-2021 09:08 PM

Here a data step approach using a hash lookup table.

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(where=(whichc("Y",x,y,z)>0))');
      h1.defineKey('subjid','count');
      h1.defineData('subjid','count');
      h1.defineDone();
    end;
  set have;
  if h1.check()=0 then 
    do;
      if missing(x) then x='Y';
      if missing(y) then y='Y';
      if missing(z) then z='Y';
    end;
run;

Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Re: Set missing record to 'Y' if any record in the series has Y

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!