DATA Step, Macro, Functions and more

Matching variables in a dataset to impute missing data

Reply
Occasional Contributor
Posts: 8

Matching variables in a dataset to impute missing data

Hi,

 

 I'm working with a dataset that looks similar to the following:

 

Construction_date     Material

1991                           Cement

1972                           Wood

1991                           Cement

1991                          

2002                           Wood

1983                           Wood

1991                           Wood

 

I would like to impute missing material data through the following condition:

 

if there are more than one identical construction_date and the material from at least two of those dates match, then make it so that all other identical missing construction_date have the same material.

 

From the little I know, I would first have to sort my construction_date and material variables, and then I get lost from that point on.

 

proc sort data=data;
by construction_date material;
run;

 

Note that I am a very new to coding language, and this is a large dataset.

 

Would appreciate any and all help!

Respected Advisor
Posts: 4,644

Re: Matching variables in a dataset to impute missing data

What's the imputed material here?

 

Construction_date     Material

1991                           Cement

1972                           Wood

1991                           Cement

1991                          ?

2002                           Wood

1983                           Wood

1991                           Wood

1991                           Wood

PG
Trusted Advisor
Posts: 1,372

Re: Matching variables in a dataset to impute missing data

 

your data contains 4 records with constructionb_date = 1991:

   2 of then with material = cement, 1 with wood and one missing value.

 

what output and how many records do you expect for this date 1991 ? 

 

Occasional Contributor
Posts: 8

Re: Matching variables in a dataset to impute missing data

My dataset contains over 10,000 dates, and so there will be a variety of date repititions. 1991 will be one of the many repeating dates, and I'm hoping to fill in the missing material variable with the most common material for that construction_date.

 

Does that answer your question?

Trusted Advisor
Posts: 1,372

Re: Matching variables in a dataset to impute missing data

what will be your preferences if you have N times material A and same number of times material B -

what would you put to replace missing values?

 

the situation needs analyze before defining the algoritm to bo done by SAS

Occasional Contributor
Posts: 8

Re: Matching variables in a dataset to impute missing data

Good question! In that case I would just leave the missing as missing.

Trusted Advisor
Posts: 1,372

Re: Matching variables in a dataset to impute missing data

in such case you need:

1) count frequency of each material per construction_date

2) select those dates with one most common material

3) update your data where material is missing using 2ns step output

 

to count frequencies you may use severeal methods:

either count by SQL, group by construction_date and material

or use proc FREQ with OUTPUT statement.

or use data step and retain counters

 

Would you like to struggle it by yourself? 

 

Occasional Contributor
Posts: 8

Re: Matching variables in a dataset to impute missing data

I would love to have some immediate help! 

 

Thank you!

Occasional Contributor
Posts: 8

Re: Matching variables in a dataset to impute missing data

I do have at least a thousand dates though, and was hoping for a more automated process if possible?

Trusted Advisor
Posts: 1,372

Re: Matching variables in a dataset to impute missing data

run the attached program and follow the steps, as I designed previously.

 

try to understand each step what it is doing, how and why.

 

I hope you will be able to addapt it to your needs.

Attachment
Ask a Question
Discussion stats
  • 9 replies
  • 418 views
  • 1 like
  • 3 in conversation