Solved: Re: How do I impute missing data with the mode of the row it is in?

SAS49 · Posted 04-26-2022 10:23 AM

Hi all,

I am very new and have no idea where to start on this question. I have a data set a sample of which I have placed below. It has two columns with chromosomes and genetic positions, and then each following column belongs to a different ID. For all columns after the first two, the desired values are 0, 1, or 2. However, I have missing data. How do I impute the missing values with the most common value in that row (i.e., for the first row, the missing value would be changed to 0, and for the second row, it would be changed to two, and so on). Thanks for any guidance anyone can provide. I am working in SAS 9.4.

Chrom	Pos	ID1	ID2	ID3	ID4	ID5	ID6	ID7	ID8
1	8456	0	0	0	0	0	.	1	1
1	9948	0	0	2	.	1	2	2	2
1	10058	1	1	1	1	1	1	.	1
2	6535	0	.	0	0	0	0	1	1
2	8967	1	1	0	1	1	1	0	.
2	10234	1	1	1	0	0	1	1	1
3	1023	0	0	1	2	0	0	0	0
3	9978	1	2	0	.	2	2	2	2
3	15032	0	0	0	1	1	2	1	1

PaigeMiller · Posted 04-26-2022 10:34 AM

If you are going to use built-in SAS functions, you could use PROC MEANS/SUMMARY or PROC UNIVARIATE to get the mode from a column of data. Or you can write your own function to get the mode from a row (since I don't think SAS provides a way to do this from a row).

I would do it via PROC SUMMARY, after changing the data to long format via PROC TRANSPOSE.

UNTESTED CODE

proc transpose data=have out=have_t;
    by pos notsorted;
    var id1-id8;
run;
proc summary data=have_t nway;
    class pos;
    var col1;
    output out=_mode_ mode=mode_value;
run;

From there you can merge the mode into the original data set and replace missings with the mode

proc sort data=have;
    by pos;
run;
data want;
    merge have _mode_;
    by pos;
    array id id1-id8;
    do i=1 to dim(id);
        if missing(id(i)) then id(i)=mode_value;
    end;
    drop i;
run;

If you want tested code, please provide the original data as SAS data step code, which you can type in yourself, or via these instructions.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 04-26-2022 10:34 AM

If you are going to use built-in SAS functions, you could use PROC MEANS/SUMMARY or PROC UNIVARIATE to get the mode from a column of data. Or you can write your own function to get the mode from a row (since I don't think SAS provides a way to do this from a row).

I would do it via PROC SUMMARY, after changing the data to long format via PROC TRANSPOSE.

UNTESTED CODE

proc transpose data=have out=have_t;
    by pos notsorted;
    var id1-id8;
run;
proc summary data=have_t nway;
    class pos;
    var col1;
    output out=_mode_ mode=mode_value;
run;

From there you can merge the mode into the original data set and replace missings with the mode

proc sort data=have;
    by pos;
run;
data want;
    merge have _mode_;
    by pos;
    array id id1-id8;
    do i=1 to dim(id);
        if missing(id(i)) then id(i)=mode_value;
    end;
    drop i;
run;

If you want tested code, please provide the original data as SAS data step code, which you can type in yourself, or via these instructions.

--
Paige Miller

average_joe · Posted 04-26-2022 12:29 PM

While @PaigeMiller 's solution is direct and easy to understand, here's another option that uses a single data step.

data have;
infile cards dlm='09'x;
input Chrom	Pos	ID1	ID2	ID3	ID4	ID5	ID6	ID7	ID8;
cards;
1	8456	0	0	0	0	0	.	1	1
1	9948	0	0	2	.	1	2	2	2
1	10058	1	1	1	1	1	1	.	1
2	6535	0	.	0	0	0	0	1	1
2	8967	1	1	0	1	1	1	0	.
2	10234	1	1	1	0	0	1	1	1
3	1023	0	0	1	2	0	0	0	0
3	9978	1	2	0	.	2	2	2	2
3	15032	0	0	0	1	1	2	1	1
;
run;


data want;
set have;

array freqs(0:2) _temporary_;  * to capture freq of each value 0-2;
array ids(8) id1-id8;

* calc freqs of id values;
do _i = 1 to dim(ids);
    if not missing(ids(_i)) then freqs(ids(_i)) + 1; 
end;
_mode = whichn(max(of freqs(*)), of freqs(*)) - 1; * mode is the index with highest freq;
put _mode= freqs(0)=  freqs(1)=  freqs(2)=;

* set missing value(s) to the mode;
do _i = 1 to dim(ids);
    if missing(ids(_i)) then ids(_i) = _mode; 
end;

call missing(of freqs(*));
drop _:;
run;

PaigeMiller · Posted 04-26-2022 12:45 PM

Nice job, @average_joe

My reluctance to write my own code to compute the mode is that SAS has already done so, and SAS has debugged it carefully and it is proven in dozens (or maybe millions) of real world applications. So my general advice, especially to less experienced programmers like @SAS49, is to use those built-in SAS functions as much as possible. We have seen, here in the SAS communities, people write their own version of code for simple statistics, and get it wrong. I have even seen people insist on writing their own calculations of the mean, and get it wrong.

This applies less to more experienced programmers (such as @average_joe ) and to anyone who knows how to use the WHICHN function. But even though I would say this advice doesn't apply to me, I still try to follow this advice as much as I can. And so you won't find me writing my own data step code for mode or average or kurtosis or any similar statistics. Plus, writing my own code for these statistics takes a lot more time than using the built-in functions, and it requires me to spend time to convince myself that it is providing the right answers in the use cases that I have, and to debug and fix it if needed.

--
Paige Miller

Ksharp · Posted 04-27-2022 09:21 AM

joe's code is not in general .If there are some value other than 0 1 2, or ID1 ID2 ...... all are character variables ?

If you have SAS/IML .

data have;
infile cards expandtabs;
input Chrom	Pos	ID1	ID2	ID3	ID4	ID5	ID6	ID7	ID8;
cards;
1	8456	0	0	0	0	0	.	1	1
1	9948	0	0	2	.	1	2	2	2
1	10058	1	1	1	1	1	1	.	1
2	6535	0	.	0	0	0	0	1	1
2	8967	1	1	0	1	1	1	0	.
2	10234	1	1	1	0	0	1	1	1
3	1023	0	0	1	2	0	0	0	0
3	9978	1	2	0	.	2	2	2	2
3	15032	0	0	0	1	1	2	1	1
;
run;
proc iml;
use have(keep=ID:);
read all var _all_ into x[c=vname]; 
close;
do i=1 to nrow(x);
 row=x[i,];
 call tabulate(level,freq,row); 
 if countmiss(row) then do;
  idx=loc(row=.); 
  row[idx]=level[freq[<:>]];
   x[i,]=row;
 end;
end;
create temp from x[c=vname];
append from x;
close;
quit;
data want;
 merge have(drop=ID:) temp;
run;

How do I impute missing data with the mode of the row it is in?

Re: How do I impute missing data with the mode of the row it is in?

Re: How do I impute missing data with the mode of the row it is in?

Re: How do I impute missing data with the mode of the row it is in?

Re: How do I impute missing data with the mode of the row it is in?

Re: How do I impute missing data with the mode of the row it is in?

Click image to register for webinar

Classroom Training Available!