Solved: Re: Imputing missing values of character variables in a large data set

thesasuser · Posted 05-28-2018 10:05 AM

Imputing missing numeric values is well discussed and solutions are well defined.
Is there a way we can replace a missing value of character variable with the values of observation with largest frequency?

Kurt_Bremser · Posted 05-28-2018 10:57 AM

See a rather brute force example:

/* create example data */
data class;
set sashelp.class end=done;
output;
if done
then do;
  call missing(of _all_);
  output;
end;
run;

/* update  with largest freq */
proc sql;
create table upd (drop=count) as
select sex, count(*) as count from class
where sex ne ' '
group by sex
order by count descending;
create table want as
select name, coalesce(a.sex,b.sex) as sex, age, height, weight
from class a, upd (obs=1) b;
drop table upd;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 05-28-2018 10:57 AM

See a rather brute force example:

/* create example data */
data class;
set sashelp.class end=done;
output;
if done
then do;
  call missing(of _all_);
  output;
end;
run;

/* update  with largest freq */
proc sql;
create table upd (drop=count) as
select sex, count(*) as count from class
where sex ne ' '
group by sex
order by count descending;
create table want as
select name, coalesce(a.sex,b.sex) as sex, age, height, weight
from class a, upd (obs=1) b;
drop table upd;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 05-28-2018 10:57 AM

Yes there are ways. Show us your data to get a code answer

The DATA to DATA Step Macro
Blog: SASnrd

SuryaKiran · Posted 05-28-2018 10:57 AM

What do you mean by values of observation with largest frequency, Can you be more specific.

If you want all the missing characters to be replaced by some other values then you can use arrays as following:

data want;
set sashelp.class;
array char _Character_;
do over char;
if missing(char) then char="Some value";
end;
run;

Thanks,
Suryakiran

DanielLangley · Posted 05-28-2018 10:59 AM

Here's a solution. I would imagine that there are plenty of better solutions. For example if there is an equal amount of the most common it will just pick one of them.

DATA have;
INPUT ID Imp $;
*---+----1----+----2----+----3----+----4----+----5;
CARDS;
1 A
2 A
3 .
4 D
5 D
6 D
;

Proc sql outobs=1;
create table mostcommon as
select t1.imp, count(t1.imp) AS Count
from work.have as t1
where t1.imp is not missing
group by t1.imp
order by CALCULATED count desc
;
quit;

Proc sql;
create table want as
select have.ID, COALESCEC(have.Imp,mc.Imp) as Imp
from work.have as have
left join work.mostcommon as mc on 1=1
;
quit;

SuryaKiran · Posted 05-28-2018 11:23 AM

DATA have;
infile datalines missover;
INPUT ID Imp $;
CARDS;
1 A
2 A
3
4 D
5 D
6 D
;

proc sql noprint;
select Imp INTO:Rep
from (select Imp,COUNT(Imp) as count
		from have
			Group by Imp)
Having count=max(Count)
;
quit;


data want;
set Have;
array char _Character_;
do over char;
if missing(char) then char="&Rep";
end;
run;

NOTE: Character missing values are represented by blank, not period.

Thanks,
Suryakiran

DanielLangley · Posted 05-28-2018 01:35 PM

Thank you.

I didn't know about

infile datalines missover;

s_lassen · Posted 05-29-2018 04:46 AM

First, you have to find the most common values, e.g.:

proc freq data=sashelp.class noprint order=freq;

tables age /out=age;

tables height/ out=height;

tables sex/ out=sex;

run;

Then, the easiest may be to put the values in macro variables:

data _null_;

merge age sex height; /* with ORDER=FREQ, the first obs is the most common value */

call symputx('age',age);

call symputx('height',height);

call symputx('sex',sex);

stop;

run;

Given this test data:

data test;

set sashelp.class;

if _N_>8 then

call missing(height,age);

if name =:'L' then

sex=' ';

run;

We can now fill in the blanks:

data want;

set test;

age=coalesce(age,&age);

height=coalesce(height,&height);

sex=coalescec(sex,"&sex");

run;

thesasuser · Posted 05-31-2018 07:22 AM

Thank you all for the response.

All solutions were good . But as I understand it I can accept one as the solution.

So selecting the first one.

Thanks again

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away