SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

replacing missing values with random selection from non-missing values.

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

replacing missing values with random selection from non-missing values.

I want to substitute missing values for price with a random selection from the non-missing same product prices within the same date range. For example:

  • replace the missing price for product 2 in data line 2 with a random selection from 120, 130, or 80.
  • replace the missing price for product 1 in data line 3 with a random selection from 100, 110, or 90.

proc mi doesn’t seem to have such an option.

data one; input subject date product price;

datalines;

1     1     1     100

1     1     2     .

2     1     1     .

2     1     2     120

3     1     1     110

3     1     2     .

4     1     1     .

4     1     2     130

5     1     1     90

5     1     2     .

6     1     1     .

6     1     2     80

7     2     1     .

7     2     2     140

8     2     1     120

8     2     2     .

9     2     1     .

9     2     2     100

10    2     1     .

10    2     2     70

11    2     1     80

11    2     2     .

12    2     1     150

12    2     2     .

;


My actual dataset has 1.3 million observations with 25 products and 6 date ranges. Each subject observed only one price with approx. 20 missing prices for the other products.  Not all subjects were exposed to all 25 products.  The number of products each subject saw wasn’t fixed, some saw more, some saw less.  The actual data has about 65,000 observed prices with the rest missing.


Accepted Solutions
Solution
‎08-04-2015 03:29 AM
Contributor
Posts: 44

Re: replacing missing values with random selection from non-missing values.

There might be easier solutions but this works:

So I'm first calculating number of product / date. Then I'm generating lower and upper ranges for values then I'm generating random number for each observation and then joining observations with missing price with this random number.

data one;

input subject date product price;

datalines;

1     1     1     100

1     1     2     .

2     1     1     .

2     1     2     120

3     1     1     110

3     1     2     .

4     1     1     .

4     1     2     130

5     1     1     90

5     1     2     .

6     1     1     .

6     1     2     80

7     2     1     .

7     2     2     140

8     2     1     120

8     2     2     .

9     2     1     .

9     2     2     100

10    2     1     .

10    2     2     70

11    2     1     80

11    2     2     .

12    2     1     150

12    2     2     .

;

run;

data product_date;

  set one(keep=date product  price where=(price ne .));

proc sort;

  by date product price;

run;

proc sql;

   create table P_D_CNTS as

   select date,

          product,

          count(price) as count

   from   product_date

   group by date, product;

quit;

proc sql;

create table PD2 as

select pd.date,

        pd.product,

        pd.price,

        cnts.count

from   product_date pd,

        P_D_CNTS cnts

where  pd.date = cnts.date

    and pd.product = cnts.product;

quit;

data PD3;

   set Pd2;

by date product price;

retain lower upper 0;

if first.product then do;

   lower = 0;

   upper = 1/count;

end;

else do;

   lower = lower + 1/count;

   upper = upper + 1/count;

end;

run;

data two;

  set one;

  random=rand('uniform');

run;

proc sql;

  create table want as

  select two.subject,

         two.date,

         two.product,

         two.price

  from   two

  where  price ne .

  union

  select two.subject,

         two.date,

         two.product,

         PD3.price

  from   two,

         PD3

  where  two.date = PD3.date

     and two.product = pd3.product

     and two.random between pd3.lower and pd3.upper

     and two.price eq .;

quit;

View solution in original post


All Replies
Respected Advisor
Posts: 3,895

Re: replacing missing values with random selection from non-missing values.

I can only think of rather "complicated" approaches using hash tables to get what you're asking for. Do you have SAS/ETS licensed. If yes then doing things a bit differently using PROC EXPAND is eventually much simpler to code - and it uses an accepted approach of how to create time series without missings.

Solution
‎08-04-2015 03:29 AM
Contributor
Posts: 44

Re: replacing missing values with random selection from non-missing values.

There might be easier solutions but this works:

So I'm first calculating number of product / date. Then I'm generating lower and upper ranges for values then I'm generating random number for each observation and then joining observations with missing price with this random number.

data one;

input subject date product price;

datalines;

1     1     1     100

1     1     2     .

2     1     1     .

2     1     2     120

3     1     1     110

3     1     2     .

4     1     1     .

4     1     2     130

5     1     1     90

5     1     2     .

6     1     1     .

6     1     2     80

7     2     1     .

7     2     2     140

8     2     1     120

8     2     2     .

9     2     1     .

9     2     2     100

10    2     1     .

10    2     2     70

11    2     1     80

11    2     2     .

12    2     1     150

12    2     2     .

;

run;

data product_date;

  set one(keep=date product  price where=(price ne .));

proc sort;

  by date product price;

run;

proc sql;

   create table P_D_CNTS as

   select date,

          product,

          count(price) as count

   from   product_date

   group by date, product;

quit;

proc sql;

create table PD2 as

select pd.date,

        pd.product,

        pd.price,

        cnts.count

from   product_date pd,

        P_D_CNTS cnts

where  pd.date = cnts.date

    and pd.product = cnts.product;

quit;

data PD3;

   set Pd2;

by date product price;

retain lower upper 0;

if first.product then do;

   lower = 0;

   upper = 1/count;

end;

else do;

   lower = lower + 1/count;

   upper = upper + 1/count;

end;

run;

data two;

  set one;

  random=rand('uniform');

run;

proc sql;

  create table want as

  select two.subject,

         two.date,

         two.product,

         two.price

  from   two

  where  price ne .

  union

  select two.subject,

         two.date,

         two.product,

         PD3.price

  from   two,

         PD3

  where  two.date = PD3.date

     and two.product = pd3.product

     and two.random between pd3.lower and pd3.upper

     and two.price eq .;

quit;

Super User
Super User
Posts: 7,407

Re: replacing missing values with random selection from non-missing values.

You don't mention how the random values to select from are decided on.  If you have that information, then put that into an array, and then just do a random number for the array, example:

data one;

  input subject date product price;

datalines;

1     1     1     100

1     1     2     .

2     1     1     .

2     1     2     120

3     1     1     110

3     1     2     .

4     1     1     .

4     1     2     130

5     1     1     90

;

run;

data want (drop=rand_sel1-rand_sel3);

  set one;

  call streaminit(123); /* seed */

  array rand_sel{3} 8. (120,130,80);

  if price=. then price=rand_sel{ceil(rand("Uniform")*4)}; 

run;

Super User
Posts: 9,682

Re: replacing missing values with random selection from non-missing values.

You should take a look the book (simulate data in sas) written by @Rick . and You will know how to simulate it .

Code: Program

data have; 
input subject date product price;
datalines;
1 1 1 100
1 1 2 .
2 1 1 .
2 1 2 120
3 1 1 110
3 1 2 .
4 1 1 .
4 1 2 130
5 1 1 90
5 1 2 .
6 1 1 .
6 1 2 80
7 2 1 .
7 2 2 140
8 2 1 120
8 2 2 .
9 2 1 .
9 2 2 100
10 2 1 .
10 2 2 70
11 2 1 80
11 2 2 .
12 2 1 150
12 2 2 .
;
run;
data has_one has_two;
set have;
if product=1 and not missing(price) then output has_one;
if product=2 and not missing(price) then output has_two;
keep date product price;
run;
data has_one;
set has_one;
by date;
if first.date then n=0;
n+1;
run;
data has_two;
set has_two;
by date;
if first.date then n=0;
n+1;
run;
proc sql;
create table temp as
  select *,sum(product=1 and not missing(price)) as n_one_not_missing,
   sum(product=2 and not missing(price)) as n_two_not_missing
   from have
   group by date
   order by subject,date,product;
quit;
data want;
if _n_ eq 1 then do;
  if 0 then set has_one;
  declare hash ha1(dataset:'has_one');
  ha1.definekey('date','product','n');
  ha1.definedata('price');
  ha1.definedone();
 
  if 0 then set has_two;
  declare hash ha2(dataset:'has_two');
  ha2.definekey('date','product','n');
  ha2.definedata('price');
  ha2.definedone();
end;
call missing(of _all_);
set temp;
call streaminit(1234);
if product=1 and missing(price) then do;
   n=ceil(n_one_not_missing*rand('uniform'));
   ha1.find();
   end;
  
if product=2 and missing(price) then do;
   n=ceil(n_two_not_missing*rand('uniform'));
   ha2.find();
   end;
drop n_one_not_missing n_two_not_missing n;
run;

Xia Keshan

Super Contributor
Posts: 336

Re: replacing missing values with random selection from non-missing values.

Maybe you can create a list of possible observations and then draw randomly numbers from this list; could be similar to:

data one;
  input subject date product price;
  datalines;
1     1     1     100
1     1     2     .
2     1     1     .
2     1     2     120
3     1     1     110
3     1     2     .
4     1     1     .
4     1     2     130
5     1     1     90
5     1     2     .
6     1     1     .
6     1     2     80
7     2     1     .
7     2     2     140
8     2     1     120
8     2     2     .
9     2     1     .
9     2     2     100
10    2     1     .
10    2     2     70
11    2     1     80
11    2     2     .
12    2     1     150
12    2     2     .
;
run;

Proc Sort Data=one;
  By date product;
Run;

Proc SQL;
  Create Table Lookup As Select Distinct date, product, price From one Where price is not missing Order By date, product; * .. does not recognize weights;
Quit;

Data Lookup; * count max number - only if there aren't always 3 values;
  Retain Nr;
  Set Lookup;
  By date product;
  If First.date or First.product Then Nr=0;
  Nr+1;
  If Last.date or Last.product Then Flag=1; Else Flag=0;
Run;

Data Want (Drop=Nr Flag);
  Merge one (in=inone) Lookup (Where=(Flag=1) Keep=date product Nr Flag);
  By date product;
  If Missing (price) Then Lookup=Int(Ranuni(1)*Nr)+1; * random key ;
  If inone;
Run;

Proc Sort Data=Want;
  By date product Lookup;
Run;

Data Want (Drop=Lookup price_lookup Flag);
  Merge Want (in=inone) Lookup (Rename=(Nr=Lookup price=price_Lookup));
  By date product Lookup;
  If Missing (price) Then price=price_Lookup;
  If inone;
Run;

Proc Sort Data=Want;
  By date product subject;
Run;

Respected Advisor
Posts: 4,651

Re: replacing missing values with random selection from non-missing values.

You must take care of the two limit cases:

  1. When no price information is available for a given date and product
  2. When there is no missing price information for a given date and product.

Otherwise, it is fairly simple, using SurveySelect to draw random prices

proc sql;

create table two as

select *, count(price) > 0 and nmiss(price) > 0 as imputable

from one

group by date, product;

create table missCounts as

select date, product, nmiss(price) as SampleSize

from two

where imputable

group by date, product;

quit;

proc surveyselect data=two(where=(imputable and price is not missing))

    out=rndPrices(keep=date product price)

    sampSize=missCounts method=urs outhits

    seed=74645446;

strata date product;

run;

data three;

set two;

if missing(price) and imputable and not skip

    then set rndPrices end=skip;

drop imputable;

run;

proc sort data=three; by subject date product; run;

proc print data=three noobs; run;

PG

PG
Super Contributor
Posts: 255

Re: replacing missing values with random selection from non-missing values.

A single data step program gets what you want. First sort the data set. Assume that there will be no more than 10 non-missing values within the date-product combination. If necessary, you can set to 1000 or even larger. A count keeps a tag on the number of cells filled in the array k[ ] and the count is used for selecting the random sample. In the first do until loop, the non-missing values are filled into the array. In the second do until loop, each price of the record is checked and if found missing, a random sample is selected from the elements of the array, and the price is imputed.

proc sort data = have;

by date product;

run;

data want;

   array k[10] _temporary_;

   count = 0;

   do until(last.product);

      set have ;

      by date product;

      if price then do; count + 1; k[count] = price; end;

   end;

   do until(last.product);

      set have;

      by date product;

      if price = . then do;

         i = ceil(ranuni(123) * count);

         price = k;

      end;

      output;

   end;

drop i count;

run;

Respected Advisor
Posts: 4,651

Re: replacing missing values with random selection from non-missing values.

Neat!

I suggest you replace

   array k[10] _temporary_;

with

   array k[0:10] _temporary_;


to guard against cases where count is zero.


PG

PG
Super Contributor
Posts: 255

Re: replacing missing values with random selection from non-missing values.

PG

Thank you for your kind word.

In case the Price with zero value is valid, my program misses it since I am checking for Price > 0. If 0 is valid value then

I must check for non-missing value to store the price in the array. In that case, array index starting from 1 is OK.

Here is the program catching the non-missing Price value:

data want;

   array k[10] _temporary_;

   count = 0;

   do until(last.product);

      set have ;

      by date product;

      if not missing(price) then do; count + 1; k[count] = price; end;

   end;

   do until(last.product);

      set have;

      by date product;

      if price = . then do;

         i = ceil(ranuni(123) * count);

         price = k;

      end;

      output;

   end;

drop i count;

run;

Regards,

DataSP

Respected Advisor
Posts: 4,651

Re: replacing missing values with random selection from non-missing values.

But I must insist, you need the [0:10] index range. Try your code with this dataset:

data have;

input subject date product price;

datalines;

10    2     1     .

10    2     2     70

11    2     1     80

11    2     2     .

12    2     1     0

12    2     2     .

13    3     1     .

14    3     2     .

15    3     2     99

16    3     3     1

;

PG

PG
Super Contributor
Posts: 255

Re: replacing missing values with random selection from non-missing values.

PG:

Your observation to replace the lower boundary of the array to 0 is correct.

The program works if the array starts with 0-index for your example.

On practical consideration, such a situation may not arise.

Look at the subgroup:

subject = 13, date = 3 and product = 1.

It is a lone observation with a missing price value. In this case,

my COUNT which counts the non-missing price-values

in the subset, is zero. This results in imputing a missing value by another

missing value -- helped by 0-index.

So, either I change to 0-start or add a condition to check that there is

at least one nonmissing price value in a subgroup before doing the

imputation -- in that case there is no need to change the array index.

For the benefit of the Original Poster, I place the revised program here:

data want;

   array k[10] _temporary_;

   count = 0;

   do until(last.product);

      set have ;

      by date product;

      if not missing(price) then do; count + 1; k[count] = price; end;

   end;

   do until(last.product);

      set have;

      by date product;

      if price = . and count then do;

         i = ceil(ranuni(123) * count);

         price = k;

      end;

      output;

   end;

drop i count;

run;

-------------

DataSP

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 718 views
  • 0 likes
  • 8 in conversation