BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ev_sfn
Obsidian | Level 7

Hello all,

 

So, the cells are grouped and sorted by month

I want for each cell to impute the missing values of the sales.

You can find a sample of the dataset in the attached file.

 

I am trying a way to do it with proc iml because it is a big dataset.

Is there a way to impute this values easy and efficient?

The problem is that the sales can be missing in different months. If the missing month is in the first date entry for each id we need to impute with the MA of the next three sales. 

Any help on this?

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

A few points.

 

If you have SAS/ETS, I agree with @PaigeMiller, that is probably the right tool. Yes, you can do this with IML and I'm sure users on the community can develop a flawless solution. However, I don't think it is the right tool. 

 

I am unsure what you mean by "but in that case i have to add 6 columns" What 6 columns. In your posted data, sales is the only column of interest?

 

Consider the approach below. If a sales value is missing, I simply sum the sales values for the next three dates within the cell and divide it by three. I think this is what you want. Let me know if this works for you and feel free to ask 🙂

 

Your sample data as a data set:

 

Spoiler
data have;
input cell date :yymmn6. sales :commax12.4;
infile datalines missover;
format date ddmmyy10.;
datalines;
1434 201811            
1434 201812            
1434 201901 214631,6803
1434 201902 201778,3392
1434 201903 165076,5954
1434 201904 252242,3617
1434 201905 170223,7731
1434 201906 294523,7793
1434 201907 398728,5192
1434 201908 583076,8272
1434 201909 399424,1947
1434 201910 494673,3063
1434 201911 333726,8307
1434 201912 318686,7712
1434 202001 294449,1546
1434 202002 429696,4421
1434 202003 452016,8626
1434 202004 431562,1449
1434 202005 5511516,627
1434 202006 6348228,364
1434 202007 6220044,55 
1434 202008 5842569,907
1434 202009 4890457,64 
1434 202010 5387110,588
1436 201811 2287962,899
1436 201812 1986564,927
1436 201901            
1436 201902 2134790,781
1436 201903 2317516,947
1436 201904 2540812,88 
1436 201905 2448318,949
1436 201906 7223569,947
1436 201907 8161358,457
1436 201908 10257010,78
1436 201909 7186520,174
1436 201910 4857060,553
1436 201911 7023168,827
1436 201912 10884517,67
1436 202001 11752482,8 
1436 202002 11924918,75
1436 202003 15974028,3 
1436 202004 11580794,99
1436 202005 4701277,569
1436 202006 2924980,048
1436 202007 2581084,998
1436 202008 2524992,138
1436 202009 2746781,349
1436 202010 3271639,218
1438 201811            
1438 201812 698798,4124
1438 201901 885871,8636
1438 201902 642623,4632
1438 201903 813226,042 
1438 201904 882585,1058
1438 201905 800612,9525
1438 201906 2032830,203
1438 201907 1935775,492
1438 201908 2530053,384
1438 201909 1821793,344
1438 201910 1871434,765
1438 201911 2501551,663
1438 201912 2768558,073
1438 202001 3152440,899
1438 202002 3153837,414
1438 202003 3196685,727
1438 202004 2961184,902
1438 202005 998758,7773
1438 202006 324890,9795
1438 202007 209524,6794
1438 202008 558644,5453
1438 202009            
1438 202010 1437799,276
1498 201811            
1498 201812 2104448,953
1498 201901            
1498 201902 1291173,037
1498 201903 1000758,412
1498 201904 721018,5833
1498 201905 1084681,96 
1498 201906 2576410,812
1498 201907 3075960,317
1498 201908 3046212,763
1498 201909 3042361,017
1498 201910 3633387,641
1498 201911 3176144,301
1498 201912 3083226,222
1498 202001 3278377,221
1498 202002 3228326,399
1498 202003 3851469,683
1498 202004 2137129,145
1498 202005 2717621,899
1498 202006 2736698,842
1498 202007 2654396,15 
1498 202008 3208936,351
1498 202009 3454502,452
1498 202010 528901,5462
;

The code:

 

data want(keep = cell date sales);
   dcl hash h();
   h.definekey("cell", "date");
   h.definedata("sales");
   h.definedone();

   do until (last.cell);
      set have;
      by cell;
      h.add();
   end;

   do until (last.cell);
      set have;
      by cell;
      s = 0;
      if sales = . then do;
         do i = 1 to 3;
            rc = h.find(key : cell, key : intnx('month', date, i, 'b'));
            s + sales;
         end;
         sales = divide(s, 3);
      end;
      output;
   end;

   h.clear();
run;

 

The result:

 

Spoiler
cell date       sales 
1434 01/11/2018 138803.34 
1434 01/12/2018 193828.87 
1434 01/01/2019 214631.68 
1434 01/02/2019 201778.34 
1434 01/03/2019 165076.60 
1434 01/04/2019 252242.36 
1434 01/05/2019 170223.77 
1434 01/06/2019 294523.78 
1434 01/07/2019 398728.52 
1434 01/08/2019 583076.83 
1434 01/09/2019 399424.19 
1434 01/10/2019 494673.31 
1434 01/11/2019 333726.83 
1434 01/12/2019 318686.77 
1434 01/01/2020 294449.15 
1434 01/02/2020 429696.44 
1434 01/03/2020 452016.86 
1434 01/04/2020 431562.14 
1434 01/05/2020 5511516.63 
1434 01/06/2020 6348228.36 
1434 01/07/2020 6220044.55 
1434 01/08/2020 5842569.91 
1434 01/09/2020 4890457.64 
1434 01/10/2020 5387110.59 
1436 01/11/2018 2287962.90 
1436 01/12/2018 1986564.93 
1436 01/01/2019 2331040.20 
1436 01/02/2019 2134790.78 
1436 01/03/2019 2317516.95 
1436 01/04/2019 2540812.88 
1436 01/05/2019 2448318.95 
1436 01/06/2019 7223569.95 
1436 01/07/2019 8161358.46 
1436 01/08/2019 10257010.78 
1436 01/09/2019 7186520.17 
1436 01/10/2019 4857060.55 
1436 01/11/2019 7023168.83 
1436 01/12/2019 10884517.67 
1436 01/01/2020 11752482.80 
1436 01/02/2020 11924918.75 
1436 01/03/2020 15974028.30 
1436 01/04/2020 11580794.99 
1436 01/05/2020 4701277.57 
1436 01/06/2020 2924980.05 
1436 01/07/2020 2581085.00 
1436 01/08/2020 2524992.14 
1436 01/09/2020 2746781.35 
1436 01/10/2020 3271639.22 
1438 01/11/2018 742431.25 
1438 01/12/2018 698798.41 
1438 01/01/2019 885871.86 
1438 01/02/2019 642623.46 
1438 01/03/2019 813226.04 
1438 01/04/2019 882585.11 
1438 01/05/2019 800612.95 
1438 01/06/2019 2032830.20 
1438 01/07/2019 1935775.49 
1438 01/08/2019 2530053.38 
1438 01/09/2019 1821793.34 
1438 01/10/2019 1871434.77 
1438 01/11/2019 2501551.66 
1438 01/12/2019 2768558.07 
1438 01/01/2020 3152440.90 
1438 01/02/2020 3153837.41 
1438 01/03/2020 3196685.73 
1438 01/04/2020 2961184.90 
1438 01/05/2020 998758.78 
1438 01/06/2020 324890.98 
1438 01/07/2020 209524.68 
1438 01/08/2020 558644.55 
1438 01/09/2020 1437799.28 
1438 01/10/2020 1437799.28 
1498 01/11/2018 1131874.00 
1498 01/12/2018 2104448.95 
1498 01/01/2019 1004316.68 
1498 01/02/2019 1291173.04 
1498 01/03/2019 1000758.41 
1498 01/04/2019 721018.58 
1498 01/05/2019 1084681.96 
1498 01/06/2019 2576410.81 
1498 01/07/2019 3075960.32 
1498 01/08/2019 3046212.76 
1498 01/09/2019 3042361.02 
1498 01/10/2019 3633387.64 
1498 01/11/2019 3176144.30 
1498 01/12/2019 3083226.22 
1498 01/01/2020 3278377.22 
1498 01/02/2020 3228326.40 
1498 01/03/2020 3851469.68 
1498 01/04/2020 2137129.15 
1498 01/05/2020 2717621.90 
1498 01/06/2020 2736698.84 
1498 01/07/2020 2654396.15 
1498 01/08/2020 3208936.35 
1498 01/09/2020 3454502.45 
1498 01/10/2020 528901.55 

 

View solution in original post

14 REPLIES 14
ev_sfn
Obsidian | Level 7

I tried to solve this by using leads and lags but in that case i have to add 6 columns (lag1,lag2,lag3,lead1,lead2,lead3) and then get the avg of the lags or leads if there is no measing in the fields of lags in each row else it will calculate the leads 

I thought it will be more efficient using proc iml and flags in positions 1 or 2 based on the place the sale is missing.

Do you think it would not work with proc iml?

PaigeMiller
Diamond | Level 26

I think you can do this in PROC EXPAND (if you have it in your SAS license)

--
Paige Miller
ev_sfn
Obsidian | Level 7

Thank you for your response.

I tried with proc expand but the way i did it was to create 6 columns with the leads and the lags so I can get the average of the three lags or leads based on the missing in the sales. Maybe is wrong my thinking on this because i dont know if i can fill up the missings with the lag / lead of the 3 previous/next values straight away.

PeterClemmensen
Tourmaline | Level 20

A few points.

 

If you have SAS/ETS, I agree with @PaigeMiller, that is probably the right tool. Yes, you can do this with IML and I'm sure users on the community can develop a flawless solution. However, I don't think it is the right tool. 

 

I am unsure what you mean by "but in that case i have to add 6 columns" What 6 columns. In your posted data, sales is the only column of interest?

 

Consider the approach below. If a sales value is missing, I simply sum the sales values for the next three dates within the cell and divide it by three. I think this is what you want. Let me know if this works for you and feel free to ask 🙂

 

Your sample data as a data set:

 

Spoiler
data have;
input cell date :yymmn6. sales :commax12.4;
infile datalines missover;
format date ddmmyy10.;
datalines;
1434 201811            
1434 201812            
1434 201901 214631,6803
1434 201902 201778,3392
1434 201903 165076,5954
1434 201904 252242,3617
1434 201905 170223,7731
1434 201906 294523,7793
1434 201907 398728,5192
1434 201908 583076,8272
1434 201909 399424,1947
1434 201910 494673,3063
1434 201911 333726,8307
1434 201912 318686,7712
1434 202001 294449,1546
1434 202002 429696,4421
1434 202003 452016,8626
1434 202004 431562,1449
1434 202005 5511516,627
1434 202006 6348228,364
1434 202007 6220044,55 
1434 202008 5842569,907
1434 202009 4890457,64 
1434 202010 5387110,588
1436 201811 2287962,899
1436 201812 1986564,927
1436 201901            
1436 201902 2134790,781
1436 201903 2317516,947
1436 201904 2540812,88 
1436 201905 2448318,949
1436 201906 7223569,947
1436 201907 8161358,457
1436 201908 10257010,78
1436 201909 7186520,174
1436 201910 4857060,553
1436 201911 7023168,827
1436 201912 10884517,67
1436 202001 11752482,8 
1436 202002 11924918,75
1436 202003 15974028,3 
1436 202004 11580794,99
1436 202005 4701277,569
1436 202006 2924980,048
1436 202007 2581084,998
1436 202008 2524992,138
1436 202009 2746781,349
1436 202010 3271639,218
1438 201811            
1438 201812 698798,4124
1438 201901 885871,8636
1438 201902 642623,4632
1438 201903 813226,042 
1438 201904 882585,1058
1438 201905 800612,9525
1438 201906 2032830,203
1438 201907 1935775,492
1438 201908 2530053,384
1438 201909 1821793,344
1438 201910 1871434,765
1438 201911 2501551,663
1438 201912 2768558,073
1438 202001 3152440,899
1438 202002 3153837,414
1438 202003 3196685,727
1438 202004 2961184,902
1438 202005 998758,7773
1438 202006 324890,9795
1438 202007 209524,6794
1438 202008 558644,5453
1438 202009            
1438 202010 1437799,276
1498 201811            
1498 201812 2104448,953
1498 201901            
1498 201902 1291173,037
1498 201903 1000758,412
1498 201904 721018,5833
1498 201905 1084681,96 
1498 201906 2576410,812
1498 201907 3075960,317
1498 201908 3046212,763
1498 201909 3042361,017
1498 201910 3633387,641
1498 201911 3176144,301
1498 201912 3083226,222
1498 202001 3278377,221
1498 202002 3228326,399
1498 202003 3851469,683
1498 202004 2137129,145
1498 202005 2717621,899
1498 202006 2736698,842
1498 202007 2654396,15 
1498 202008 3208936,351
1498 202009 3454502,452
1498 202010 528901,5462
;

The code:

 

data want(keep = cell date sales);
   dcl hash h();
   h.definekey("cell", "date");
   h.definedata("sales");
   h.definedone();

   do until (last.cell);
      set have;
      by cell;
      h.add();
   end;

   do until (last.cell);
      set have;
      by cell;
      s = 0;
      if sales = . then do;
         do i = 1 to 3;
            rc = h.find(key : cell, key : intnx('month', date, i, 'b'));
            s + sales;
         end;
         sales = divide(s, 3);
      end;
      output;
   end;

   h.clear();
run;

 

The result:

 

Spoiler
cell date       sales 
1434 01/11/2018 138803.34 
1434 01/12/2018 193828.87 
1434 01/01/2019 214631.68 
1434 01/02/2019 201778.34 
1434 01/03/2019 165076.60 
1434 01/04/2019 252242.36 
1434 01/05/2019 170223.77 
1434 01/06/2019 294523.78 
1434 01/07/2019 398728.52 
1434 01/08/2019 583076.83 
1434 01/09/2019 399424.19 
1434 01/10/2019 494673.31 
1434 01/11/2019 333726.83 
1434 01/12/2019 318686.77 
1434 01/01/2020 294449.15 
1434 01/02/2020 429696.44 
1434 01/03/2020 452016.86 
1434 01/04/2020 431562.14 
1434 01/05/2020 5511516.63 
1434 01/06/2020 6348228.36 
1434 01/07/2020 6220044.55 
1434 01/08/2020 5842569.91 
1434 01/09/2020 4890457.64 
1434 01/10/2020 5387110.59 
1436 01/11/2018 2287962.90 
1436 01/12/2018 1986564.93 
1436 01/01/2019 2331040.20 
1436 01/02/2019 2134790.78 
1436 01/03/2019 2317516.95 
1436 01/04/2019 2540812.88 
1436 01/05/2019 2448318.95 
1436 01/06/2019 7223569.95 
1436 01/07/2019 8161358.46 
1436 01/08/2019 10257010.78 
1436 01/09/2019 7186520.17 
1436 01/10/2019 4857060.55 
1436 01/11/2019 7023168.83 
1436 01/12/2019 10884517.67 
1436 01/01/2020 11752482.80 
1436 01/02/2020 11924918.75 
1436 01/03/2020 15974028.30 
1436 01/04/2020 11580794.99 
1436 01/05/2020 4701277.57 
1436 01/06/2020 2924980.05 
1436 01/07/2020 2581085.00 
1436 01/08/2020 2524992.14 
1436 01/09/2020 2746781.35 
1436 01/10/2020 3271639.22 
1438 01/11/2018 742431.25 
1438 01/12/2018 698798.41 
1438 01/01/2019 885871.86 
1438 01/02/2019 642623.46 
1438 01/03/2019 813226.04 
1438 01/04/2019 882585.11 
1438 01/05/2019 800612.95 
1438 01/06/2019 2032830.20 
1438 01/07/2019 1935775.49 
1438 01/08/2019 2530053.38 
1438 01/09/2019 1821793.34 
1438 01/10/2019 1871434.77 
1438 01/11/2019 2501551.66 
1438 01/12/2019 2768558.07 
1438 01/01/2020 3152440.90 
1438 01/02/2020 3153837.41 
1438 01/03/2020 3196685.73 
1438 01/04/2020 2961184.90 
1438 01/05/2020 998758.78 
1438 01/06/2020 324890.98 
1438 01/07/2020 209524.68 
1438 01/08/2020 558644.55 
1438 01/09/2020 1437799.28 
1438 01/10/2020 1437799.28 
1498 01/11/2018 1131874.00 
1498 01/12/2018 2104448.95 
1498 01/01/2019 1004316.68 
1498 01/02/2019 1291173.04 
1498 01/03/2019 1000758.41 
1498 01/04/2019 721018.58 
1498 01/05/2019 1084681.96 
1498 01/06/2019 2576410.81 
1498 01/07/2019 3075960.32 
1498 01/08/2019 3046212.76 
1498 01/09/2019 3042361.02 
1498 01/10/2019 3633387.64 
1498 01/11/2019 3176144.30 
1498 01/12/2019 3083226.22 
1498 01/01/2020 3278377.22 
1498 01/02/2020 3228326.40 
1498 01/03/2020 3851469.68 
1498 01/04/2020 2137129.15 
1498 01/05/2020 2717621.90 
1498 01/06/2020 2736698.84 
1498 01/07/2020 2654396.15 
1498 01/08/2020 3208936.35 
1498 01/09/2020 3454502.45 
1498 01/10/2020 528901.55 

 

ev_sfn
Obsidian | Level 7

Thank you! I mean that I thought to get 6 columns by using proc expand to get (lag1,lag2,lag3,lead1,lead2,lead3) and then everytime a missing value of sales found it could fill from the mean of the three values of the lags or leads. So if the row had missing in the 3 lags it would fill up with leads otherwise it would move on to the next missing. And then I would drop all this columns. Then I would made them again to impute the 2nd missing value of each id. I thought this would consume too many resources. With proc expand how could i make it differently?

 

This code works for the missing in the first dates but it doesnt work if the sales missing in the last 2 dates (the last 2 months of each ID).

 

ev_sfn
Obsidian | Level 7
this is not exactly what I am trying to do here. I am trying to impute the sales from 3 previous sales and if I am in the first 3 months(I cant take the 3 previous months because there are no sales) in that case I want to impute with the next 3 months.
PeterClemmensen
Tourmaline | Level 20

I may misunderstand you. But it seems to me like you describe two different logics. 

 

1) If a value is missing look at the three latest dates. If one of them is missing, look one date ahead and take that value into account. If two of the 'lags' are missing, consider two leads and so on..

 

2) If we are on one of the first three dates for a cell, consider the next three dates. If not, consider the three previous dates. 

 

Which one is it? Or is it both / a combination?

 

Please be specific 🙂

ev_sfn
Obsidian | Level 7

I am sorry for the misunderstanding and thank you for your further answer on this.

I am trying to do the 2) .

However, I should consider if I have 2 missing sales in the middle (for example in months 201808,201809) in that case i need to impute the sales from 201808 by calculating the average of the 3 previous sales and then calculate the missing sales of the month 201809 by getting the average of the 3 previous(the sales from months 201808,201807,201806).

 

Also, in case we have the 2 missing sales in the last 3 months (ex. 202008,202010) we need to fill up first the missing sale of 202008 and then impute the missing sale of 202010 from the previous 3 sales(and 1 of them is the sales we imputed for the month 202008).

 Ypu can check the sample to understand what i mean if we have missing sales in the last months, or in the middle or in the beggining.

I hope that's more clear now. 

PeterClemmensen
Tourmaline | Level 20

Still doesn't fully make sense to me. 

 

Let's keep things simple and consider the simplified sample data below. What should the desired result be given your logic? And why?

 

data have;
input cell date :yymmn6. sales;
infile datalines missover;
format date ddmmyy10.;
datalines;
1434 201811   
1434 201812   
1434 201901 1 
1434 201902 2 
1434 201903 3 
1434 201904   
1434 201905 4 
1434 201906 5 
1434 201907   
1434 201908   
1434 201909   
1434 201910 6 
1434 201911 7 
1434 201912 8 
;
ev_sfn
Obsidian | Level 7

Alright! Here we have only one ID. The problem is that for this ID we  have 2 missing records in the sales.

The 2 missing sales could be anywhere!

Please see here some of possible cases of missing sales for months:

1)we may have 2 missing sales in months 201811,201812 ->

in this case we want to get the average of the next 3 months so the program should impute the sales from month :201812 first (from months:201901,201902,201903) ,and then to impute 201811 sales from the average of the three next including sales of (201812,201901,201902)

2)we may have 2 missing sales in months 201811,201902 ->

in this case it should impute the sales of 201902 month first (from the avg of the sales from months:201903,201904,201905) and when it is filled with a value  it should impute the sales of month 201811 (from the avg of the sales of 201811 we imputed,from sales of 201812 and 201901)

3) we may have 2 missing sales in months 201907,201908->

in this case we need to impute first the sales of 201907 (from the avg of the previous 3 months: 201904,201905,201906) and then to impute the missing sales of 201908, because we will use the imputed value from month 201907.So,the sales of month 201908 should imputed from the avg of sales of months(201905,201906,201907-this should be the imputed value )

4) we may have 2 missing sales in months 201911,201912(the last 2 months of an ID) ->

in this case we need to impute the sales of of 201911 from the avg of the previous 3 months( 201908,201909,201910) and when it is imputed we will continue with the imputation of sales of 201912 from the avg of the previous 3 (incl. the new imputed value 201911) .So, it will be the avg of sales from months(201911,201910,201909).

 

I hope it makes sense now.

 

ev_sfn
Obsidian | Level 7

Hello all,

 

I found the solution to the problem.

I would like to thank you all for your responses!

You can find the code in the txt attached.

Maybe this is not the most efficient way but it worked. 

 

Any recommendations to make it more efficient (less code) - are welcome!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 2079 views
  • 6 likes
  • 4 in conversation