Solved: replacing missing values with random selection from non-missing values...

MichaelWku · Posted 08-03-2015 04:54 PM

I want to substitute missing values for price with a random selection from the non-missing same product prices within the same date range. For example:

replace the missing price for product 2 in data line 2 with a random selection from 120, 130, or 80.
replace the missing price for product 1 in data line 3 with a random selection from 100, 110, or 90.

proc mi doesn’t seem to have such an option.

data one; input subject date product price;

datalines;

1 1 1 100

1 1 2 .

2 1 1 .

2 1 2 120

3 1 1 110

3 1 2 .

4 1 1 .

4 1 2 130

5 1 1 90

5 1 2 .

6 1 1 .

6 1 2 80

7 2 1 .

7 2 2 140

8 2 1 120

8 2 2 .

9 2 1 .

9 2 2 100

10 2 1 .

10 2 2 70

11 2 1 80

11 2 2 .

12 2 1 150

12 2 2 .

;

My actual dataset has 1.3 million observations with 25 products and 6 date ranges. Each subject observed only one price with approx. 20 missing prices for the other products. Not all subjects were exposed to all 25 products. The number of products each subject saw wasn’t fixed, some saw more, some saw less. The actual data has about 65,000 observed prices with the rest missing.

AskoLötjönen · Posted 08-04-2015 03:29 AM

There might be easier solutions but this works:

So I'm first calculating number of product / date. Then I'm generating lower and upper ranges for values then I'm generating random number for each observation and then joining observations with missing price with this random number.

data one;

input subject date product price;

datalines;

1 1 1 100

1 1 2 .

2 1 1 .

2 1 2 120

3 1 1 110

3 1 2 .

4 1 1 .

4 1 2 130

5 1 1 90

5 1 2 .

6 1 1 .

6 1 2 80

7 2 1 .

7 2 2 140

8 2 1 120

8 2 2 .

9 2 1 .

9 2 2 100

10 2 1 .

10 2 2 70

11 2 1 80

11 2 2 .

12 2 1 150

12 2 2 .

;

run;

data product_date;

set one(keep=date product price where=(price ne .));

proc sort;

by date product price;

run;

proc sql;

create table P_D_CNTS as

select date,

product,

count(price) as count

from product_date

group by date, product;

quit;

proc sql;

create table PD2 as

select pd.date,

pd.product,

pd.price,

cnts.count

from product_date pd,

P_D_CNTS cnts

where pd.date = cnts.date

and pd.product = cnts.product;

quit;

data PD3;

set Pd2;

by date product price;

retain lower upper 0;

if first.product then do;

lower = 0;

upper = 1/count;

end;

else do;

lower = lower + 1/count;

upper = upper + 1/count;

end;

run;

data two;

set one;

random=rand('uniform');

run;

proc sql;

create table want as

select two.subject,

two.date,

two.product,

two.price

from two

where price ne .

union

select two.subject,

two.date,

two.product,

PD3.price

from two,

PD3

where two.date = PD3.date

and two.product = pd3.product

and two.random between pd3.lower and pd3.upper

and two.price eq .;

quit;

View solution in original post

Patrick · Posted 08-03-2015 07:58 PM

I can only think of rather "complicated" approaches using hash tables to get what you're asking for. Do you have SAS/ETS licensed. If yes then doing things a bit differently using PROC EXPAND is eventually much simpler to code - and it uses an accepted approach of how to create time series without missings.

AskoLötjönen · Posted 08-04-2015 03:29 AM

There might be easier solutions but this works:

So I'm first calculating number of product / date. Then I'm generating lower and upper ranges for values then I'm generating random number for each observation and then joining observations with missing price with this random number.

data one;

input subject date product price;

datalines;

1 1 1 100

1 1 2 .

2 1 1 .

2 1 2 120

3 1 1 110

3 1 2 .

4 1 1 .

4 1 2 130

5 1 1 90

5 1 2 .

6 1 1 .

6 1 2 80

7 2 1 .

7 2 2 140

8 2 1 120

8 2 2 .

9 2 1 .

9 2 2 100

10 2 1 .

10 2 2 70

11 2 1 80

11 2 2 .

12 2 1 150

12 2 2 .

;

run;

data product_date;

set one(keep=date product price where=(price ne .));

proc sort;

by date product price;

run;

proc sql;

create table P_D_CNTS as

select date,

product,

count(price) as count

from product_date

group by date, product;

quit;

proc sql;

create table PD2 as

select pd.date,

pd.product,

pd.price,

cnts.count

from product_date pd,

P_D_CNTS cnts

where pd.date = cnts.date

and pd.product = cnts.product;

quit;

data PD3;

set Pd2;

by date product price;

retain lower upper 0;

if first.product then do;

lower = 0;

upper = 1/count;

end;

else do;

lower = lower + 1/count;

upper = upper + 1/count;

end;

run;

data two;

set one;

random=rand('uniform');

run;

proc sql;

create table want as

select two.subject,

two.date,

two.product,

two.price

from two

where price ne .

union

select two.subject,

two.date,

two.product,

PD3.price

from two,

PD3

where two.date = PD3.date

and two.product = pd3.product

and two.random between pd3.lower and pd3.upper

and two.price eq .;

quit;

RW9 · Posted 08-04-2015 05:01 AM

You don't mention how the random values to select from are decided on. If you have that information, then put that into an array, and then just do a random number for the array, example:

data one;

input subject date product price;

datalines;

1 1 1 100

1 1 2 .

2 1 1 .

2 1 2 120

3 1 1 110

3 1 2 .

4 1 1 .

4 1 2 130

5 1 1 90

;

run;

data want (drop=rand_sel1-rand_sel3);

set one;

call streaminit(123); /* seed */

array rand_sel{3} 8. (120,130,80);

if price=. then price=rand_sel{ceil(rand("Uniform")*4)};

run;

Ksharp · Posted 08-04-2015 09:27 AM

You should take a look the book (simulate data in sas) written by @Rick . and You will know how to simulate it .

Code: Program

data have; 
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
run;
data has_one has_two;
 set have;
 if product=1 and not missing(price) then output has_one;
 if product=2 and not missing(price) then output has_two;
 keep date product price;
run;
data has_one;
 set has_one;
 by date;
 if first.date then n=0;
 n+1;
run;
data has_two;
 set has_two;
 by date;
 if first.date then n=0;
 n+1;
run;
proc sql;
 create table temp as
  select *,sum(product=1 and not missing(price)) as n_one_not_missing,
   sum(product=2 and not missing(price)) as n_two_not_missing
   from have
   group by date 
   order by subject,date,product;
quit;
data want;
 if _n_ eq 1 then do;
  if 0 then set has_one;
  declare hash ha1(dataset:'has_one');
  ha1.definekey('date','product','n');
  ha1.definedata('price');
  ha1.definedone();
  
  if 0 then set has_two;
  declare hash ha2(dataset:'has_two');
  ha2.definekey('date','product','n');
  ha2.definedata('price');
  ha2.definedone();
 end; 
call missing(of _all_);
set temp;
call streaminit(1234);
if product=1 and missing(price) then do;
   n=ceil(n_one_not_missing*rand('uniform'));
   ha1.find();
   end;
   
if product=2 and missing(price) then do;
   n=ceil(n_two_not_missing*rand('uniform'));
   ha2.find();
   end;
drop n_one_not_missing n_two_not_missing n;
run;

Xia Keshan

user24feb · Posted 08-04-2015 10:18 AM

Maybe you can create a list of possible observations and then draw randomly numbers from this list; could be similar to:

data one;
input subject date product price;
datalines;
1     1     1     100
1     1     2     .
2     1     1     .
2     1     2     120
3     1     1     110
3     1     2     .
4     1     1     .
4     1     2     130
5     1     1     90
5     1     2     .
6     1     1     .
6     1     2     80
7     2     1     .
7     2     2     140
8     2     1     120
8     2     2     .
9     2     1     .
9     2     2     100
10    2     1     .
10    2     2     70
11    2     1     80
11    2     2     .
12    2     1     150
12    2     2     .
;
run;

Proc Sort Data=one;
By date product;
Run;

Proc SQL;
Create Table Lookup As Select Distinct date, product, price From one Where price is not missing Order By date, product; * .. does not recognize weights;
Quit;

Data Lookup; * count max number - only if there aren't always 3 values;
Retain Nr;
Set Lookup;
By date product;
If First.date or First.product Then Nr=0;
Nr+1;
If Last.date or Last.product Then Flag=1; Else Flag=0;
Run;

Data Want (Drop=Nr Flag);
Merge one (in=inone) Lookup (Where=(Flag=1) Keep=date product Nr Flag);
By date product;
If Missing (price) Then Lookup=Int(Ranuni(1)*Nr)+1; * random key ;
If inone;
Run;

Proc Sort Data=Want;
By date product Lookup;
Run;

Data Want (Drop=Lookup price_lookup Flag);
Merge Want (in=inone) Lookup (Rename=(Nr=Lookup price=price_Lookup));
By date product Lookup;
If Missing (price) Then price=price_Lookup;
If inone;
Run;

Proc Sort Data=Want;
By date product subject;
Run;

PGStats · Posted 08-04-2015 02:28 PM

You must take care of the two limit cases:

When no price information is available for a given date and product
When there is no missing price information for a given date and product.

Otherwise, it is fairly simple, using SurveySelect to draw random prices

proc sql;

create table two as

select *, count(price) > 0 and nmiss(price) > 0 as imputable

from one

group by date, product;

create table missCounts as

select date, product, nmiss(price) as SampleSize

from two

where imputable

group by date, product;

quit;

proc surveyselect data=two(where=(imputable and price is not missing))

out=rndPrices(keep=date product price)

sampSize=missCounts method=urs outhits

seed=74645446;

strata date product;

run;

data three;

set two;

if missing(price) and imputable and not skip

then set rndPrices end=skip;

drop imputable;

run;

proc sort data=three; by subject date product; run;

proc print data=three noobs; run;

PG

KachiM · Posted 08-04-2015 04:52 PM

A single data step program gets what you want. First sort the data set. Assume that there will be no more than 10 non-missing values within the date-product combination. If necessary, you can set to 1000 or even larger. A count keeps a tag on the number of cells filled in the array k[ ] and the count is used for selecting the random sample. In the first do until loop, the non-missing values are filled into the array. In the second do until loop, each price of the record is checked and if found missing, a random sample is selected from the elements of the array, and the price is imputed.

proc sort data = have;

by date product;

run;

data want;

array k[10] _temporary_;

count = 0;

do until(last.product);

set have ;

by date product;

if price then do; count + 1; k[count] = price; end;

end;

do until(last.product);

set have;

by date product;

if price = . then do;

i = ceil(ranuni(123) * count);

price = k;

end;

output;

end;

drop i count;

run;

PGStats · Posted 08-04-2015 05:50 PM

Neat!

I suggest you replace

array k[10] _temporary_;

with

array k[0:10] _temporary_;

to guard against cases where count is zero.

PG

KachiM · Posted 08-04-2015 07:58 PM

PG

Thank you for your kind word.

In case the Price with zero value is valid, my program misses it since I am checking for Price > 0. If 0 is valid value then

I must check for non-missing value to store the price in the array. In that case, array index starting from 1 is OK.

Here is the program catching the non-missing Price value:

data want;

array k[10] _temporary_;

count = 0;

do until(last.product);

set have ;

by date product;

if not missing(price) then do; count + 1; k[count] = price; end;

end;

do until(last.product);

set have;

by date product;

if price = . then do;

i = ceil(ranuni(123) * count);

price = k;

end;

output;

end;

drop i count;

run;

Regards,

DataSP

PGStats · Posted 08-04-2015 10:36 PM

But I must insist, you need the [0:10] index range. Try your code with this dataset:

data have;

input subject date product price;

datalines;

10 2 1 .

10 2 2 70

11 2 1 80

11 2 2 .

12 2 1 0

12 2 2 .

13 3 1 .

14 3 2 .

15 3 2 99

16 3 3 1

;

PG

KachiM · Posted 08-05-2015 10:37 AM

PG:

Your observation to replace the lower boundary of the array to 0 is correct.

The program works if the array starts with 0-index for your example.

On practical consideration, such a situation may not arise.

Look at the subgroup:

subject = 13, date = 3 and product = 1.

It is a lone observation with a missing price value. In this case,

my COUNT which counts the non-missing price-values

in the subset, is zero. This results in imputing a missing value by another

missing value -- helped by 0-index.

So, either I change to 0-start or add a condition to check that there is

at least one nonmissing price value in a subgroup before doing the

imputation -- in that case there is no need to change the array index.

For the benefit of the Original Poster, I place the revised program here:

data want;

array k[10] _temporary_;

count = 0;

do until(last.product);

set have ;

by date product;

if not missing(price) then do; count + 1; k[count] = price; end;

end;

do until(last.product);

set have;

by date product;

if price = . and count then do;

i = ceil(ranuni(123) * count);

price = k;

end;

output;

end;

drop i count;

run;

-------------

DataSP

replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Code: Program

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Re: replacing missing values with random selection from non-missing values.

Code: Program

Registration is open