DATA Step, Macro, Functions and more

Imputing missing values of character variables in a large data set

Accepted Solution Solved
Reply
Contributor
Posts: 43
Accepted Solution

Imputing missing values of character variables in a large data set

Imputing missing numeric values is well discussed and solutions are well defined.
Is there a way we can replace a missing value of character variable with the values of observation with largest frequency?


Accepted Solutions
Solution
‎05-31-2018 07:21 AM
Super User
Posts: 10,280

Re: Imputing missing values of character variables in a large data set

Posted in reply to thesasuser

See a rather brute force example:

/* create example data */
data class;
set sashelp.class end=done;
output;
if done
then do;
  call missing(of _all_);
  output;
end;
run;

/* update  with largest freq */
proc sql;
create table upd (drop=count) as
select sex, count(*) as count from class
where sex ne ' '
group by sex
order by count descending;
create table want as
select name, coalesce(a.sex,b.sex) as sex, age, height, weight
from class a, upd (obs=1) b;
drop table upd;
quit;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code

View solution in original post


All Replies
Solution
‎05-31-2018 07:21 AM
Super User
Posts: 10,280

Re: Imputing missing values of character variables in a large data set

Posted in reply to thesasuser

See a rather brute force example:

/* create example data */
data class;
set sashelp.class end=done;
output;
if done
then do;
  call missing(of _all_);
  output;
end;
run;

/* update  with largest freq */
proc sql;
create table upd (drop=count) as
select sex, count(*) as count from class
where sex ne ' '
group by sex
order by count descending;
create table want as
select name, coalesce(a.sex,b.sex) as sex, age, height, weight
from class a, upd (obs=1) b;
drop table upd;
quit;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
PROC Star
Posts: 1,283

Re: Imputing missing values of character variables in a large data set

Posted in reply to thesasuser

Yes there are ways. Show us your data to get a code answer

 

Valued Guide
Posts: 597

Re: Imputing missing values of character variables in a large data set

Posted in reply to thesasuser

What do you mean by values of observation with largest frequency, Can you be more specific. 

 

If you want all the missing characters to be replaced by some other values then you can use arrays as following:

data want;
set sashelp.class;
array char _Character_;
do over char;
if missing(char) then char="Some value";
end;
run;
Thanks,
Suryakiran
Occasional Contributor
Posts: 10

Re: Imputing missing values of character variables in a large data set

Posted in reply to thesasuser

Here's a solution. I would imagine that there are plenty of better solutions. For example if there is an equal amount of the most common it will just pick one of them.

 

DATA have;
INPUT ID Imp $;
*---+----1----+----2----+----3----+----4----+----5;
CARDS;
1 A
2 A
3 .
4 D
5 D
6 D
;

Proc sql outobs=1;
create table mostcommon as
select t1.imp, count(t1.imp) AS Count
from work.have as t1
where t1.imp is not missing
group by t1.imp
order by CALCULATED count desc
;
quit;

Proc sql;
create table want as
select have.ID, COALESCEC(have.Imp,mc.Imp) as Imp
from work.have as have
left join work.mostcommon as mc on 1=1
;
quit;

 

Valued Guide
Posts: 597

Re: Imputing missing values of character variables in a large data set

Posted in reply to DanielLangley
DATA have;
infile datalines missover;
INPUT ID Imp $;
CARDS;
1 A
2 A
3
4 D
5 D
6 D
;

proc sql noprint;
select Imp INTO:Rep
from (select Imp,COUNT(Imp) as count
		from have
			Group by Imp)
Having count=max(Count)
;
quit;


data want;
set Have;
array char _Character_;
do over char;
if missing(char) then char="&Rep";
end;
run;

NOTE: Character missing values are represented by blank, not period.

 

 

Thanks,
Suryakiran
Occasional Contributor
Posts: 10

Re: Imputing missing values of character variables in a large data set

Posted in reply to SuryaKiran

Thank you.

 

I didn't know about

 

infile datalines missover;

 

PROC Star
Posts: 266

Re: Imputing missing values of character variables in a large data set

Posted in reply to thesasuser

First, you have to find the most common values, e.g.:

proc freq data=sashelp.class noprint order=freq;

  tables age /out=age;

  tables height/ out=height;

  tables sex/ out=sex;

run;

Then, the easiest may be to put the values in macro variables:

data _null_;

  merge age sex height; /* with ORDER=FREQ, the first obs is the most common value */

  call symputx('age',age);

  call symputx('height',height);

  call symputx('sex',sex);

  stop;

run;

Given this test data:

data test;

  set sashelp.class;

  if _N_>8 then

    call missing(height,age);

  if name =:'L' then

    sex=' ';

run;

We can now fill in the blanks:

data want;

  set test;

  age=coalesce(age,&age);

  height=coalesce(height,&height);

  sex=coalescec(sex,"&sex");

run;

 

 

Contributor
Posts: 43

Re: Imputing missing values of character variables in a large data set

Thank you all for the response.

All solutions were good . But as I understand it I can accept one as the solution.

So selecting the first one.

Thanks again

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 249 views
  • 2 likes
  • 6 in conversation