I want to substitute missing values for price with a random selection from the non-missing same product prices within the same date range. For example:
proc mi doesn’t seem to have such an option.
data one; input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
My actual dataset has 1.3 million observations with 25 products and 6 date ranges. Each subject observed only one price with approx. 20 missing prices for the other products. Not all subjects were exposed to all 25 products. The number of products each subject saw wasn’t fixed, some saw more, some saw less. The actual data has about 65,000 observed prices with the rest missing.
There might be easier solutions but this works:
So I'm first calculating number of product / date. Then I'm generating lower and upper ranges for values then I'm generating random number for each observation and then joining observations with missing price with this random number.
data one;
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
run;
data product_date;
set one(keep=date product price where=(price ne .));
proc sort;
by date product price;
run;
proc sql;
create table P_D_CNTS as
select date,
product,
count(price) as count
from product_date
group by date, product;
quit;
proc sql;
create table PD2 as
select pd.date,
pd.product,
pd.price,
cnts.count
from product_date pd,
P_D_CNTS cnts
where pd.date = cnts.date
and pd.product = cnts.product;
quit;
data PD3;
set Pd2;
by date product price;
retain lower upper 0;
if first.product then do;
lower = 0;
upper = 1/count;
end;
else do;
lower = lower + 1/count;
upper = upper + 1/count;
end;
run;
data two;
set one;
random=rand('uniform');
run;
proc sql;
create table want as
select two.subject,
two.date,
two.product,
two.price
from two
where price ne .
union
select two.subject,
two.date,
two.product,
PD3.price
from two,
PD3
where two.date = PD3.date
and two.product = pd3.product
and two.random between pd3.lower and pd3.upper
and two.price eq .;
quit;
I can only think of rather "complicated" approaches using hash tables to get what you're asking for. Do you have SAS/ETS licensed. If yes then doing things a bit differently using PROC EXPAND is eventually much simpler to code - and it uses an accepted approach of how to create time series without missings.
There might be easier solutions but this works:
So I'm first calculating number of product / date. Then I'm generating lower and upper ranges for values then I'm generating random number for each observation and then joining observations with missing price with this random number.
data one;
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
run;
data product_date;
set one(keep=date product price where=(price ne .));
proc sort;
by date product price;
run;
proc sql;
create table P_D_CNTS as
select date,
product,
count(price) as count
from product_date
group by date, product;
quit;
proc sql;
create table PD2 as
select pd.date,
pd.product,
pd.price,
cnts.count
from product_date pd,
P_D_CNTS cnts
where pd.date = cnts.date
and pd.product = cnts.product;
quit;
data PD3;
set Pd2;
by date product price;
retain lower upper 0;
if first.product then do;
lower = 0;
upper = 1/count;
end;
else do;
lower = lower + 1/count;
upper = upper + 1/count;
end;
run;
data two;
set one;
random=rand('uniform');
run;
proc sql;
create table want as
select two.subject,
two.date,
two.product,
two.price
from two
where price ne .
union
select two.subject,
two.date,
two.product,
PD3.price
from two,
PD3
where two.date = PD3.date
and two.product = pd3.product
and two.random between pd3.lower and pd3.upper
and two.price eq .;
quit;
You don't mention how the random values to select from are decided on. If you have that information, then put that into an array, and then just do a random number for the array, example:
data one;
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
;
run;
data want (drop=rand_sel1-rand_sel3);
set one;
call streaminit(123); /* seed */
array rand_sel{3} 8. (120,130,80);
if price=. then price=rand_sel{ceil(rand("Uniform")*4)};
run;
You should take a look the book (simulate data in sas) written by @Rick . and You will know how to simulate it .
data have;
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
run;
data has_one has_two;
set have;
if product=1 and not missing(price) then output has_one;
if product=2 and not missing(price) then output has_two;
keep date product price;
run;
data has_one;
set has_one;
by date;
if first.date then n=0;
n+1;
run;
data has_two;
set has_two;
by date;
if first.date then n=0;
n+1;
run;
proc sql;
create table temp as
select *,sum(product=1 and not missing(price)) as n_one_not_missing,
sum(product=2 and not missing(price)) as n_two_not_missing
from have
group by date
order by subject,date,product;
quit;
data want;
if _n_ eq 1 then do;
if 0 then set has_one;
declare hash ha1(dataset:'has_one');
ha1.definekey('date','product','n');
ha1.definedata('price');
ha1.definedone();
if 0 then set has_two;
declare hash ha2(dataset:'has_two');
ha2.definekey('date','product','n');
ha2.definedata('price');
ha2.definedone();
end;
call missing(of _all_);
set temp;
call streaminit(1234);
if product=1 and missing(price) then do;
n=ceil(n_one_not_missing*rand('uniform'));
ha1.find();
end;
if product=2 and missing(price) then do;
n=ceil(n_two_not_missing*rand('uniform'));
ha2.find();
end;
drop n_one_not_missing n_two_not_missing n;
run;
Xia Keshan
Maybe you can create a list of possible observations and then draw randomly numbers from this list; could be similar to:
data one;
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
run;
Proc Sort Data=one;
By date product;
Run;
Proc SQL;
Create Table Lookup As Select Distinct date, product, price From one Where price is not missing Order By date, product; * .. does not recognize weights;
Quit;
Data Lookup; * count max number - only if there aren't always 3 values;
Retain Nr;
Set Lookup;
By date product;
If First.date or First.product Then Nr=0;
Nr+1;
If Last.date or Last.product Then Flag=1; Else Flag=0;
Run;
Data Want (Drop=Nr Flag);
Merge one (in=inone) Lookup (Where=(Flag=1) Keep=date product Nr Flag);
By date product;
If Missing (price) Then Lookup=Int(Ranuni(1)*Nr)+1; * random key ;
If inone;
Run;
Proc Sort Data=Want;
By date product Lookup;
Run;
Data Want (Drop=Lookup price_lookup Flag);
Merge Want (in=inone) Lookup (Rename=(Nr=Lookup price=price_Lookup));
By date product Lookup;
If Missing (price) Then price=price_Lookup;
If inone;
Run;
Proc Sort Data=Want;
By date product subject;
Run;
You must take care of the two limit cases:
Otherwise, it is fairly simple, using SurveySelect to draw random prices
proc sql;
create table two as
select *, count(price) > 0 and nmiss(price) > 0 as imputable
from one
group by date, product;
create table missCounts as
select date, product, nmiss(price) as SampleSize
from two
where imputable
group by date, product;
quit;
proc surveyselect data=two(where=(imputable and price is not missing))
out=rndPrices(keep=date product price)
sampSize=missCounts method=urs outhits
seed=74645446;
strata date product;
run;
data three;
set two;
if missing(price) and imputable and not skip
then set rndPrices end=skip;
drop imputable;
run;
proc sort data=three; by subject date product; run;
proc print data=three noobs; run;
PG
A single data step program gets what you want. First sort the data set. Assume that there will be no more than 10 non-missing values within the date-product combination. If necessary, you can set to 1000 or even larger. A count keeps a tag on the number of cells filled in the array k[ ] and the count is used for selecting the random sample. In the first do until loop, the non-missing values are filled into the array. In the second do until loop, each price of the record is checked and if found missing, a random sample is selected from the elements of the array, and the price is imputed.
proc sort data = have;
by date product;
run;
data want;
array k[10] _temporary_;
count = 0;
do until(last.product);
set have ;
by date product;
if price then do; count + 1; k[count] = price; end;
end;
do until(last.product);
set have;
by date product;
if price = . then do;
i = ceil(ranuni(123) * count);
price = k;
end;
output;
end;
drop i count;
run;
Neat!
I suggest you replace
array k[10] _temporary_;
with
array k[0:10] _temporary_;
to guard against cases where count is zero.
PG
PG
Thank you for your kind word.
In case the Price with zero value is valid, my program misses it since I am checking for Price > 0. If 0 is valid value then
I must check for non-missing value to store the price in the array. In that case, array index starting from 1 is OK.
Here is the program catching the non-missing Price value:
data want;
array k[10] _temporary_;
count = 0;
do until(last.product);
set have ;
by date product;
if not missing(price) then do; count + 1; k[count] = price; end;
end;
do until(last.product);
set have;
by date product;
if price = . then do;
i = ceil(ranuni(123) * count);
price = k;
end;
output;
end;
drop i count;
run;
Regards,
DataSP
But I must insist, you need the [0:10] index range. Try your code with this dataset:
data have;
input subject date product price;
datalines;
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 0
12 2 2 .
13 3 1 .
14 3 2 .
15 3 2 99
16 3 3 1
;
PG
PG:
Your observation to replace the lower boundary of the array to 0 is correct.
The program works if the array starts with 0-index for your example.
On practical consideration, such a situation may not arise.
Look at the subgroup:
subject = 13, date = 3 and product = 1.
It is a lone observation with a missing price value. In this case,
my COUNT which counts the non-missing price-values
in the subset, is zero. This results in imputing a missing value by another
missing value -- helped by 0-index.
So, either I change to 0-start or add a condition to check that there is
at least one nonmissing price value in a subgroup before doing the
imputation -- in that case there is no need to change the array index.
For the benefit of the Original Poster, I place the revised program here:
data want;
array k[10] _temporary_;
count = 0;
do until(last.product);
set have ;
by date product;
if not missing(price) then do; count + 1; k[count] = price; end;
end;
do until(last.product);
set have;
by date product;
if price = . and count then do;
i = ceil(ranuni(123) * count);
price = k;
end;
output;
end;
drop i count;
run;
-------------
DataSP
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.