Solved
Contributor
Posts: 43

# Imputing missing values of character variables in a large data set

Imputing missing numeric values is well discussed and solutions are well defined.
Is there a way we can replace a missing value of character variable with the values of observation with largest frequency?

Accepted Solutions
Solution
‎05-31-2018 07:21 AM
Super User
Posts: 10,280

## Re: Imputing missing values of character variables in a large data set

See a rather brute force example:

``````/* create example data */
data class;
set sashelp.class end=done;
output;
if done
then do;
call missing(of _all_);
output;
end;
run;

/* update  with largest freq */
proc sql;
create table upd (drop=count) as
select sex, count(*) as count from class
where sex ne ' '
group by sex
order by count descending;
create table want as
select name, coalesce(a.sex,b.sex) as sex, age, height, weight
from class a, upd (obs=1) b;
drop table upd;
quit;``````
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code

All Replies
Solution
‎05-31-2018 07:21 AM
Super User
Posts: 10,280

## Re: Imputing missing values of character variables in a large data set

See a rather brute force example:

``````/* create example data */
data class;
set sashelp.class end=done;
output;
if done
then do;
call missing(of _all_);
output;
end;
run;

/* update  with largest freq */
proc sql;
create table upd (drop=count) as
select sex, count(*) as count from class
where sex ne ' '
group by sex
order by count descending;
create table want as
select name, coalesce(a.sex,b.sex) as sex, age, height, weight
from class a, upd (obs=1) b;
drop table upd;
quit;``````
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
PROC Star
Posts: 1,283

## Re: Imputing missing values of character variables in a large data set

Yes there are ways. Show us your data to get a code answer

Valued Guide
Posts: 597

## Re: Imputing missing values of character variables in a large data set

What do you mean by values of observation with largest frequency, Can you be more specific.

If you want all the missing characters to be replaced by some other values then you can use arrays as following:

``````data want;
set sashelp.class;
array char _Character_;
do over char;
if missing(char) then char="Some value";
end;
run;``````
Thanks,
Suryakiran
Occasional Contributor
Posts: 10

## Re: Imputing missing values of character variables in a large data set

Here's a solution. I would imagine that there are plenty of better solutions. For example if there is an equal amount of the most common it will just pick one of them.

``````DATA have;
INPUT ID Imp \$;
*---+----1----+----2----+----3----+----4----+----5;
CARDS;
1 A
2 A
3 .
4 D
5 D
6 D
;

Proc sql outobs=1;
create table mostcommon as
select t1.imp, count(t1.imp) AS Count
from work.have as t1
where t1.imp is not missing
group by t1.imp
order by CALCULATED count desc
;
quit;

Proc sql;
create table want as
select have.ID, COALESCEC(have.Imp,mc.Imp) as Imp
from work.have as have
left join work.mostcommon as mc on 1=1
;
quit;``````

Valued Guide
Posts: 597

## Re: Imputing missing values of character variables in a large data set

``````DATA have;
infile datalines missover;
INPUT ID Imp \$;
CARDS;
1 A
2 A
3
4 D
5 D
6 D
;

proc sql noprint;
select Imp INTO:Rep
from (select Imp,COUNT(Imp) as count
from have
Group by Imp)
Having count=max(Count)
;
quit;

data want;
set Have;
array char _Character_;
do over char;
if missing(char) then char="&Rep";
end;
run;``````

NOTE: Character missing values are represented by blank, not period.

Thanks,
Suryakiran
Occasional Contributor
Posts: 10

## Re: Imputing missing values of character variables in a large data set

Thank you.

``infile datalines missover;``

PROC Star
Posts: 266

## Re: Imputing missing values of character variables in a large data set

First, you have to find the most common values, e.g.:

proc freq data=sashelp.class noprint order=freq;

tables age /out=age;

tables height/ out=height;

tables sex/ out=sex;

run;

Then, the easiest may be to put the values in macro variables:

data _null_;

merge age sex height; /* with ORDER=FREQ, the first obs is the most common value */

call symputx('age',age);

call symputx('height',height);

call symputx('sex',sex);

stop;

run;

Given this test data:

data test;

set sashelp.class;

if _N_>8 then

call missing(height,age);

if name =:'L' then

sex=' ';

run;

We can now fill in the blanks:

data want;

set test;

age=coalesce(age,&age);

height=coalesce(height,&height);

sex=coalescec(sex,"&sex");

run;

Contributor
Posts: 43

## Re: Imputing missing values of character variables in a large data set

Thank you all for the response.

All solutions were good . But as I understand it I can accept one as the solution.

So selecting the first one.

Thanks again

☑ This topic is solved.