DATA Step, Macro, Functions and more

Fill in missing observations (if proc timeseries is not an option)

Accepted Solution Solved
Reply
Super Contributor
Posts: 336
Accepted Solution

Fill in missing observations (if proc timeseries is not an option)

Hello,

I have to create a dataset that shows an individual line for every combination of ID-variables (the character variables in the example below) even if there is a zero value for this record. This is it is, my problem is a bit like filling in missing observations using proc timeseries - but without time variables; searched the community for a while, but I always found something linked to timeseries. My actual data set "have" is large, which is why I would like to optimize the following code.

I mistrust the point statements I am using. My question is, could it be faster?

Data Have;
  Input HPG $ Plant $ Material $ X;
  Datalines;
FA  0004 200 4
FB  0004 209 5
MGG 0001 201 6
MGU 0001 203 8
MGG 0002 203 8
MGG 0004 210 1
;

Proc SQL NoPrint;
  Select Count(Distinct HPG) As N_HPG, Count(Distinct Plant) As N_Plant, Count (Distinct Material) As N_Material Into :N_HPG, :N_Plant, :N_Material From A;
  Create Table HPG As Select Distinct HPG From A;
  Create Table Plant As Select Distinct Plant From A;
  Create Table Material As Select Distinct Material From A;
Quit;
%Put **&N_HPG.**&N_Plant.**&N_Material.**;

Data Want (Drop=rc);
  Declare Hash H ();
  H.Definekey('HPG','Plant','Material');
  H.Definedata('X');
  H.Definedone();
  Do Until (Eof_H);
    Set Have End=Eof_H;
    H.Add();
  End;
  * critical part, beginning - I think;
  Do i=1 To &N_HPG.;
    Set HPG Point=i;
    Do j=1 To &N_Plant.;
   Set Plant Point=j;
   Do k=1 To &N_Material.;
        Set Material Point=k;
  rc=H.Find();
        If rc ne 0 Then X=0;
     Output;
   End;
    End;
  End;
  * critical part, end;
  Stop;
Run;

Thanks&kind regards


Accepted Solutions
Solution
‎02-04-2015 07:16 AM
Super Contributor
Posts: 254

Re: Fill in missing observations (if proc timeseries is not an option)

I assume that the 3 variables give uniqueness to the observation. Otherewise , MULTIDATA option can be used. Because You have large dataset use od hashexp:20 will be useful.

data want;

   if _n_ = 1 then do;

      if 0 then set have;

      declare hash h(hashexp:20);

      h.definekey('HPG','Plant','Material');

      h.definedata('HPG','Plant','Material','X');

      h.definedone();

      declare hash hH(ordered:'a');

      hH.definekey('HPG');

      hH.definedone();

      declare hash hP(ordered:'a');

      hP.definekey('Plant');

      hP.definedone();

      declare hash hM(ordered:'a');

      hM.definekey('Material');

      hM.definedone();

      do until(eof);

         set have end = eof;

         if hH.find() ^= 0 then hH.add();

         if hP.find() ^= 0 then hP.add();

         if hM.find() ^= 0 then hM.add();

         h.add();

      end;

   end;

      declare hiter hiH('hH');

      declare hiter hiP('hP');

      declare hiter hiM('hM');

      do while(hiH.next() = 0);

         do while(hiP.next() = 0);

            do while(hiM.next() = 0);

               if h.find() ^= 0 then X = 0;;

               output;

            end;

         end;

      end;

stop;

run;

proc print data = want;

run;

Obs    HPG    Plant    Material    X

   1    FA     0001       200       0

   2    FA     0001       201       0

   3    FA     0001       203       0

   4    FA     0001       209       0

   5    FA     0001       210       0

   6    FA     0002       200       0

   7    FA     0002       201       0

   8    FA     0002       203       0

   9    FA     0002       209       0

  10    FA     0002       210       0

  11    FA     0004       200       4

  12    FA     0004       201       0

  13    FA     0004       203       0

  14    FA     0004       209       0

  15    FA     0004       210       0

  16    FB     0001       200       0

  17    FB     0001       201       0

  18    FB     0001       203       0

  19    FB     0001       209       0

  20    FB     0001       210       0

  21    FB     0002       200       0

  22    FB     0002       201       0

  23    FB     0002       203       0

  24    FB     0002       209       0

  25    FB     0002       210       0

  26    FB     0004       200       0

  27    FB     0004       201       0

  28    FB     0004       203       0

  29    FB     0004       209       5

  30    FB     0004       210       0

  31    MGG    0001       200       0

  32    MGG    0001       201       6

  33    MGG    0001       203       0

  34    MGG    0001       209       0

  35    MGG    0001       210       0

  36    MGG    0002       200       0

  37    MGG    0002       201       0

  38    MGG    0002       203       8

  39    MGG    0002       209       0

  40    MGG    0002       210       0

  41    MGG    0004       200       0

  42    MGG    0004       201       0

  43    MGG    0004       203       0

  44    MGG    0004       209       0

  45    MGG    0004       210       1

  46    MGU    0001       200       0

  47    MGU    0001       201       0

  48    MGU    0001       203       8

  49    MGU    0001       209       0

  50    MGU    0001       210       0

  51    MGU    0002       200       0

  52    MGU    0002       201       0

  53    MGU    0002       203       0

  54    MGU    0002       209       0

  55    MGU    0002       210       0

  56    MGU    0004       200       0

  57    MGU    0004       201       0

  58    MGU    0004       203       0

  59    MGU    0004       209       0

  60    MGU    0004       210       0

View solution in original post


All Replies
SAS Employee
Posts: 340

Re: Fill in missing observations (if proc timeseries is not an option)

Hi,

I don’t see any problem with the set point= method. It causes performance problems only if data step needs to re-load different pages (blocks) of the input SAS dataset from the disk all the time. You can prevent this if you allocate enough memory. For example:

set HPG(bufno=100) Point=i;

On the other hand your full program reads the data 5 times. 3 times to generate distinct values of the 3 columns. You could use 3 additional hash objects to store only those distinct values, then you could iterate through them. In this case instead of “do i=1 to &N; set point=” you will use “do unti(hashiterator.next());” Only one pass through the data is needed then.

Have you already run your program on your full dataset? You are reading the full dataset into memory (hash object). If data is really big that could cause problems.

Thanks,

Gergely

Super User
Posts: 9,681

Re: Fill in missing observations (if proc timeseries is not an option)

What does your output look like ?

Super Contributor
Posts: 336

Re: Fill in missing observations (if proc timeseries is not an option)

My file-upload seems to be stuck, hope this is o.k. for you. The output the code above creates would be right:

HPG;Plant;Material;X

FA;0001;200;0

FA;0001;201;0

FA;0001;203;0

FA;0001;209;0

FA;0001;210;0

FA;0002;200;0

FA;0002;201;0

FA;0002;203;0

FA;0002;209;0

FA;0002;210;0

FA;0004;200;4

FA;0004;201;0

FA;0004;203;0

FA;0004;209;0

FA;0004;210;0

FB;0001;200;0

FB;0001;201;0

FB;0001;203;0

FB;0001;209;0

FB;0001;210;0

FB;0002;200;0

FB;0002;201;0

FB;0002;203;0

FB;0002;209;0

FB;0002;210;0

FB;0004;200;0

FB;0004;201;0

FB;0004;203;0

FB;0004;209;5

FB;0004;210;0

MGG;0001;200;0

MGG;0001;201;6

MGG;0001;203;0

MGG;0001;209;0

MGG;0001;210;0

MGG;0002;200;0

MGG;0002;201;0

MGG;0002;203;8

MGG;0002;209;0

MGG;0002;210;0

MGG;0004;200;0

MGG;0004;201;0

MGG;0004;203;0

MGG;0004;209;0

MGG;0004;210;1

MGU;0001;200;0

MGU;0001;201;0

MGU;0001;203;8

MGU;0001;209;0

MGU;0001;210;0

MGU;0002;200;0

MGU;0002;201;0

MGU;0002;203;0

MGU;0002;209;0

MGU;0002;210;0

MGU;0004;200;0

MGU;0004;201;0

MGU;0004;203;0

MGU;0004;209;0

MGU;0004;210;0

Solution
‎02-04-2015 07:16 AM
Super Contributor
Posts: 254

Re: Fill in missing observations (if proc timeseries is not an option)

I assume that the 3 variables give uniqueness to the observation. Otherewise , MULTIDATA option can be used. Because You have large dataset use od hashexp:20 will be useful.

data want;

   if _n_ = 1 then do;

      if 0 then set have;

      declare hash h(hashexp:20);

      h.definekey('HPG','Plant','Material');

      h.definedata('HPG','Plant','Material','X');

      h.definedone();

      declare hash hH(ordered:'a');

      hH.definekey('HPG');

      hH.definedone();

      declare hash hP(ordered:'a');

      hP.definekey('Plant');

      hP.definedone();

      declare hash hM(ordered:'a');

      hM.definekey('Material');

      hM.definedone();

      do until(eof);

         set have end = eof;

         if hH.find() ^= 0 then hH.add();

         if hP.find() ^= 0 then hP.add();

         if hM.find() ^= 0 then hM.add();

         h.add();

      end;

   end;

      declare hiter hiH('hH');

      declare hiter hiP('hP');

      declare hiter hiM('hM');

      do while(hiH.next() = 0);

         do while(hiP.next() = 0);

            do while(hiM.next() = 0);

               if h.find() ^= 0 then X = 0;;

               output;

            end;

         end;

      end;

stop;

run;

proc print data = want;

run;

Obs    HPG    Plant    Material    X

   1    FA     0001       200       0

   2    FA     0001       201       0

   3    FA     0001       203       0

   4    FA     0001       209       0

   5    FA     0001       210       0

   6    FA     0002       200       0

   7    FA     0002       201       0

   8    FA     0002       203       0

   9    FA     0002       209       0

  10    FA     0002       210       0

  11    FA     0004       200       4

  12    FA     0004       201       0

  13    FA     0004       203       0

  14    FA     0004       209       0

  15    FA     0004       210       0

  16    FB     0001       200       0

  17    FB     0001       201       0

  18    FB     0001       203       0

  19    FB     0001       209       0

  20    FB     0001       210       0

  21    FB     0002       200       0

  22    FB     0002       201       0

  23    FB     0002       203       0

  24    FB     0002       209       0

  25    FB     0002       210       0

  26    FB     0004       200       0

  27    FB     0004       201       0

  28    FB     0004       203       0

  29    FB     0004       209       5

  30    FB     0004       210       0

  31    MGG    0001       200       0

  32    MGG    0001       201       6

  33    MGG    0001       203       0

  34    MGG    0001       209       0

  35    MGG    0001       210       0

  36    MGG    0002       200       0

  37    MGG    0002       201       0

  38    MGG    0002       203       8

  39    MGG    0002       209       0

  40    MGG    0002       210       0

  41    MGG    0004       200       0

  42    MGG    0004       201       0

  43    MGG    0004       203       0

  44    MGG    0004       209       0

  45    MGG    0004       210       1

  46    MGU    0001       200       0

  47    MGU    0001       201       0

  48    MGU    0001       203       8

  49    MGU    0001       209       0

  50    MGU    0001       210       0

  51    MGU    0002       200       0

  52    MGU    0002       201       0

  53    MGU    0002       203       0

  54    MGU    0002       209       0

  55    MGU    0002       210       0

  56    MGU    0004       200       0

  57    MGU    0004       201       0

  58    MGU    0004       203       0

  59    MGU    0004       209       0

  60    MGU    0004       210       0

Super Contributor
Posts: 336

Re: Fill in missing observations (if proc timeseries is not an option)

Thanks. I couldn't have done the iterator without help. And (delayed) thanks Xia Keshan for "hash aggregating strings"!

Super User
Posts: 9,681

Re: Fill in missing observations (if proc timeseries is not an option)

Ou, This is best for SQL's Cartesian Product , No need for Hash Table such advanced skill.

And I believe Hash Table is not faster than SQL .

Data Have;
  Input HPG $ Plant $ Material $ X;
  Datalines;
FA  0004 200 4
FB  0004 209 5
MGG 0001 201 6
MGU 0001 203 8
MGG 0002 203 8
MGG 0004 210 1
;
run;
proc sql;
create table want as
 select a.*,coalesce(b.x,0) as x
  from 
  (select * from (select distinct HPG from have),(select distinct Plant from have),(select distinct Material from have)) as a
  left join
  have as b
  on a.HPG=b.HPG and a.Plant=b.Plant and a.Material=b.Material ;
quit;



Xia Keshan

Super Contributor
Posts: 254

Re: Fill in missing observations (if proc timeseries is not an option)

On further thought, the 3 Key Variables are coming from the Iterator Objects and hence, there is no need to place them in the data-part of the Hash Table(H).  The following statement,

h.definedata('HPG','Plant','Material','X');

can be replaced with:

h.definedata('X');


For your benefit, I have just commented the previous statement in the code I posted yesterday.

Xia Keshan believes that his SQL solution can be faster than Hash solution.  I do not find evidence to his belief and instead, the  Hash solution runs AT LEAST 3 times faster than SQL besides Hash taking lesser memory. However, I concur with him that some expertise to use Hash is a must.

Here is the proof.

Let us create a simulated data set, HAVE and run both CODES.

options fullstimer;

   

data have;

   do j = 1 to 200;

      i1 = 64 + ceil(ranuni(123) * 26);

      i2 = 64 + ceil(ranuni(123) * 26);

      i3 = 64 + ceil(ranuni(123) * 26);

      HPG = catt(byte(i1), byte(i2), byte(i3));

      Plant = put(ceil(ranuni(123) * 1000), z4.);

      Material = put(ceil(ranuni(123) * 999), z3.);

      X = ceil(ranuni(123) * 100);

      output;

   end;

keep HPG Plant Material X;

run;

371  data want;

372     if _n_ = 1 then do;

373        if 0 then set have;

374        declare hash h(hashexp:20);

375        h.definekey('HPG','Plant','Material');

376        *h.definedata('HPG','Plant','Material','X');

377        h.definedata('X');

378        h.definedone();

379

380

381        declare hash hH(ordered:'a');

382        hH.definekey('HPG');

383        hH.definedone();

384        declare hash hP(ordered:'a');

385        hP.definekey('Plant');

386        hP.definedone();

387        declare hash hM(ordered:'a');

388        hM.definekey('Material');

389        hM.definedone();

390

391

392        do until(eof);

393           set have end = eof;

394           if hH.find() ^= 0 then hH.add();

395           if hP.find() ^= 0 then hP.add();

396           if hM.find() ^= 0 then hM.add();

397           h.add();

398        end;

399     end;

400        declare hiter hiH('hH');

401        declare hiter hiP('hP');

402        declare hiter hiM('hM');

403

404

405        do while(hiH.next() = 0);

406           do while(hiP.next() = 0);

407              do while(hiM.next() = 0);

408                 if h.find() ^= 0 then X = 0;;

409                 output;

410              end;

411           end;

412        end;

413  stop;

414  run;

NOTE: There were 200 observations read from the data set WORK.HAVE.

NOTE: The data set WORK.WANT has 6812784 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           14.63 seconds

      user cpu time       6.05 seconds

      system cpu time     2.18 seconds

      Memory                            9019k

      OS Memory                         17780k

      Timestamp            2/5/2015  2:46:34 PM

415  proc sql;

416  create table want as

417   select a.*,coalesce(b.x,0) as x

418    from

419    (select * from (select distinct HPG from have),(select distinct Plant from have),(select

419! distinct Material from have)) as a

420    left join

421    have as b

422    on a.HPG=b.HPG and a.Plant=b.Plant and a.Material=b.Material ;

NOTE: The execution of this query involves performing one or more Cartesian product joins that can

      not be optimized.

NOTE: Table WORK.WANT created, with 6812784 rows and 4 columns.

423  quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time           43.07 seconds

      user cpu time       16.73 seconds

      system cpu time     10.42 seconds

      Memory                            69088k

      OS Memory                         78196k

      Timestamp            2/5/2015  2:47:36 PM

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 332 views
  • 6 likes
  • 4 in conversation