BookmarkSubscribeRSS Feed
triley
Obsidian | Level 7

I was curious of the best ways to handle/model missing values when all values for certain states are missing. I am trying to predict purchase volume based on customer, and I have customers in all 50 states. There are a few variables in which i only have data for in about half of the states, but i want to use other independent variables that are available in all states as well. 

 

My 2 initial thoughts were to 1) include a "state_has_data" indicator and use that in the model, or 2) create a model that estimates the volume for the states that do have data, and use that prediction when there are missing values. 

 

Are there better ways of handling this?

 

I've included an example below of something similar to what i'm trying to accomplish using the sashelp.cars data set. In the example, "x1" where origin='Asia' would be equivalent to a state with missing values for the independent variable. I've also added a regression procedure at the end to help add context. I am still exploring other regression procedures as well (e.g. proc COUNTREG, proc GLM, etc.). Also, I only have SAS E.G., without miner or other modeling 'add-ons'. 

 

proc sql;
create table have as
select
make
,Origin
,avg(MSRP) as x1
,avg(Horsepower) as x2
,avg(MPG_Highway) as x3
,count(1) as y
from
(select
make
,Origin
,case when Origin = 'Asia' then . else MSRP end as MSRP
,Horsepower
,MPG_Highway
from sashelp.cars
)
group by make
,Origin
order by Origin
;quit;


proc reg data=have;
model y = x1-x3;
run;

 

2 REPLIES 2
StatDave
SAS Super FREQ

Multiple imputation, as can be done in PROC MI, is one way of dealing with this sort of situation.

triley
Obsidian | Level 7

Thanks. I am trying to see if there is anything besides basic imputation (should have been more clear in the original question). I would like to almost treat the prediction separately based if the state has the data or not (i.e. have a model for states with the extra variables and one for the states without type of thing).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 534 views
  • 1 like
  • 2 in conversation