Hello,
I have to create a dataset that shows an individual line for every combination of ID-variables (the character variables in the example below) even if there is a zero value for this record. This is it is, my problem is a bit like filling in missing observations using proc timeseries - but without time variables; searched the community for a while, but I always found something linked to timeseries. My actual data set "have" is large, which is why I would like to optimize the following code.
I mistrust the point statements I am using. My question is, could it be faster?
Data Have;
Input HPG $ Plant $ Material $ X;
Datalines;
FA 0004 200 4
FB 0004 209 5
MGG 0001 201 6
MGU 0001 203 8
MGG 0002 203 8
MGG 0004 210 1
;
Proc SQL NoPrint;
Select Count(Distinct HPG) As N_HPG, Count(Distinct Plant) As N_Plant, Count (Distinct Material) As N_Material Into :N_HPG, :N_Plant, :N_Material From A;
Create Table HPG As Select Distinct HPG From A;
Create Table Plant As Select Distinct Plant From A;
Create Table Material As Select Distinct Material From A;
Quit;
%Put **&N_HPG.**&N_Plant.**&N_Material.**;
Data Want (Drop=rc);
Declare Hash H ();
H.Definekey('HPG','Plant','Material');
H.Definedata('X');
H.Definedone();
Do Until (Eof_H);
Set Have End=Eof_H;
H.Add();
End;
* critical part, beginning - I think;
Do i=1 To &N_HPG.;
Set HPG Point=i;
Do j=1 To &N_Plant.;
Set Plant Point=j;
Do k=1 To &N_Material.;
Set Material Point=k;
rc=H.Find();
If rc ne 0 Then X=0;
Output;
End;
End;
End;
* critical part, end;
Stop;
Run;
Thanks&kind regards
I assume that the 3 variables give uniqueness to the observation. Otherewise , MULTIDATA option can be used. Because You have large dataset use od hashexp:20 will be useful.
data want;
if _n_ = 1 then do;
if 0 then set have;
declare hash h(hashexp:20);
h.definekey('HPG','Plant','Material');
h.definedata('HPG','Plant','Material','X');
h.definedone();
declare hash hH(ordered:'a');
hH.definekey('HPG');
hH.definedone();
declare hash hP(ordered:'a');
hP.definekey('Plant');
hP.definedone();
declare hash hM(ordered:'a');
hM.definekey('Material');
hM.definedone();
do until(eof);
set have end = eof;
if hH.find() ^= 0 then hH.add();
if hP.find() ^= 0 then hP.add();
if hM.find() ^= 0 then hM.add();
h.add();
end;
end;
declare hiter hiH('hH');
declare hiter hiP('hP');
declare hiter hiM('hM');
do while(hiH.next() = 0);
do while(hiP.next() = 0);
do while(hiM.next() = 0);
if h.find() ^= 0 then X = 0;;
output;
end;
end;
end;
stop;
run;
proc print data = want;
run;
Obs HPG Plant Material X
1 FA 0001 200 0
2 FA 0001 201 0
3 FA 0001 203 0
4 FA 0001 209 0
5 FA 0001 210 0
6 FA 0002 200 0
7 FA 0002 201 0
8 FA 0002 203 0
9 FA 0002 209 0
10 FA 0002 210 0
11 FA 0004 200 4
12 FA 0004 201 0
13 FA 0004 203 0
14 FA 0004 209 0
15 FA 0004 210 0
16 FB 0001 200 0
17 FB 0001 201 0
18 FB 0001 203 0
19 FB 0001 209 0
20 FB 0001 210 0
21 FB 0002 200 0
22 FB 0002 201 0
23 FB 0002 203 0
24 FB 0002 209 0
25 FB 0002 210 0
26 FB 0004 200 0
27 FB 0004 201 0
28 FB 0004 203 0
29 FB 0004 209 5
30 FB 0004 210 0
31 MGG 0001 200 0
32 MGG 0001 201 6
33 MGG 0001 203 0
34 MGG 0001 209 0
35 MGG 0001 210 0
36 MGG 0002 200 0
37 MGG 0002 201 0
38 MGG 0002 203 8
39 MGG 0002 209 0
40 MGG 0002 210 0
41 MGG 0004 200 0
42 MGG 0004 201 0
43 MGG 0004 203 0
44 MGG 0004 209 0
45 MGG 0004 210 1
46 MGU 0001 200 0
47 MGU 0001 201 0
48 MGU 0001 203 8
49 MGU 0001 209 0
50 MGU 0001 210 0
51 MGU 0002 200 0
52 MGU 0002 201 0
53 MGU 0002 203 0
54 MGU 0002 209 0
55 MGU 0002 210 0
56 MGU 0004 200 0
57 MGU 0004 201 0
58 MGU 0004 203 0
59 MGU 0004 209 0
60 MGU 0004 210 0
Hi,
I don’t see any problem with the set point= method. It causes performance problems only if data step needs to re-load different pages (blocks) of the input SAS dataset from the disk all the time. You can prevent this if you allocate enough memory. For example:
set HPG(bufno=100) Point=i;
On the other hand your full program reads the data 5 times. 3 times to generate distinct values of the 3 columns. You could use 3 additional hash objects to store only those distinct values, then you could iterate through them. In this case instead of “do i=1 to &N; set point=” you will use “do unti(hashiterator.next());” Only one pass through the data is needed then.
Have you already run your program on your full dataset? You are reading the full dataset into memory (hash object). If data is really big that could cause problems.
Thanks,
Gergely
What does your output look like ?
My file-upload seems to be stuck, hope this is o.k. for you. The output the code above creates would be right:
HPG;Plant;Material;X
FA;0001;200;0
FA;0001;201;0
FA;0001;203;0
FA;0001;209;0
FA;0001;210;0
FA;0002;200;0
FA;0002;201;0
FA;0002;203;0
FA;0002;209;0
FA;0002;210;0
FA;0004;200;4
FA;0004;201;0
FA;0004;203;0
FA;0004;209;0
FA;0004;210;0
FB;0001;200;0
FB;0001;201;0
FB;0001;203;0
FB;0001;209;0
FB;0001;210;0
FB;0002;200;0
FB;0002;201;0
FB;0002;203;0
FB;0002;209;0
FB;0002;210;0
FB;0004;200;0
FB;0004;201;0
FB;0004;203;0
FB;0004;209;5
FB;0004;210;0
MGG;0001;200;0
MGG;0001;201;6
MGG;0001;203;0
MGG;0001;209;0
MGG;0001;210;0
MGG;0002;200;0
MGG;0002;201;0
MGG;0002;203;8
MGG;0002;209;0
MGG;0002;210;0
MGG;0004;200;0
MGG;0004;201;0
MGG;0004;203;0
MGG;0004;209;0
MGG;0004;210;1
MGU;0001;200;0
MGU;0001;201;0
MGU;0001;203;8
MGU;0001;209;0
MGU;0001;210;0
MGU;0002;200;0
MGU;0002;201;0
MGU;0002;203;0
MGU;0002;209;0
MGU;0002;210;0
MGU;0004;200;0
MGU;0004;201;0
MGU;0004;203;0
MGU;0004;209;0
MGU;0004;210;0
I assume that the 3 variables give uniqueness to the observation. Otherewise , MULTIDATA option can be used. Because You have large dataset use od hashexp:20 will be useful.
data want;
if _n_ = 1 then do;
if 0 then set have;
declare hash h(hashexp:20);
h.definekey('HPG','Plant','Material');
h.definedata('HPG','Plant','Material','X');
h.definedone();
declare hash hH(ordered:'a');
hH.definekey('HPG');
hH.definedone();
declare hash hP(ordered:'a');
hP.definekey('Plant');
hP.definedone();
declare hash hM(ordered:'a');
hM.definekey('Material');
hM.definedone();
do until(eof);
set have end = eof;
if hH.find() ^= 0 then hH.add();
if hP.find() ^= 0 then hP.add();
if hM.find() ^= 0 then hM.add();
h.add();
end;
end;
declare hiter hiH('hH');
declare hiter hiP('hP');
declare hiter hiM('hM');
do while(hiH.next() = 0);
do while(hiP.next() = 0);
do while(hiM.next() = 0);
if h.find() ^= 0 then X = 0;;
output;
end;
end;
end;
stop;
run;
proc print data = want;
run;
Obs HPG Plant Material X
1 FA 0001 200 0
2 FA 0001 201 0
3 FA 0001 203 0
4 FA 0001 209 0
5 FA 0001 210 0
6 FA 0002 200 0
7 FA 0002 201 0
8 FA 0002 203 0
9 FA 0002 209 0
10 FA 0002 210 0
11 FA 0004 200 4
12 FA 0004 201 0
13 FA 0004 203 0
14 FA 0004 209 0
15 FA 0004 210 0
16 FB 0001 200 0
17 FB 0001 201 0
18 FB 0001 203 0
19 FB 0001 209 0
20 FB 0001 210 0
21 FB 0002 200 0
22 FB 0002 201 0
23 FB 0002 203 0
24 FB 0002 209 0
25 FB 0002 210 0
26 FB 0004 200 0
27 FB 0004 201 0
28 FB 0004 203 0
29 FB 0004 209 5
30 FB 0004 210 0
31 MGG 0001 200 0
32 MGG 0001 201 6
33 MGG 0001 203 0
34 MGG 0001 209 0
35 MGG 0001 210 0
36 MGG 0002 200 0
37 MGG 0002 201 0
38 MGG 0002 203 8
39 MGG 0002 209 0
40 MGG 0002 210 0
41 MGG 0004 200 0
42 MGG 0004 201 0
43 MGG 0004 203 0
44 MGG 0004 209 0
45 MGG 0004 210 1
46 MGU 0001 200 0
47 MGU 0001 201 0
48 MGU 0001 203 8
49 MGU 0001 209 0
50 MGU 0001 210 0
51 MGU 0002 200 0
52 MGU 0002 201 0
53 MGU 0002 203 0
54 MGU 0002 209 0
55 MGU 0002 210 0
56 MGU 0004 200 0
57 MGU 0004 201 0
58 MGU 0004 203 0
59 MGU 0004 209 0
60 MGU 0004 210 0
Thanks. I couldn't have done the iterator without help. And (delayed) thanks Xia Keshan for "hash aggregating strings"!
Ou, This is best for SQL's Cartesian Product , No need for Hash Table such advanced skill.
And I believe Hash Table is not faster than SQL .
Data Have; Input HPG $ Plant $ Material $ X; Datalines; FA 0004 200 4 FB 0004 209 5 MGG 0001 201 6 MGU 0001 203 8 MGG 0002 203 8 MGG 0004 210 1 ; run; proc sql; create table want as select a.*,coalesce(b.x,0) as x from (select * from (select distinct HPG from have),(select distinct Plant from have),(select distinct Material from have)) as a left join have as b on a.HPG=b.HPG and a.Plant=b.Plant and a.Material=b.Material ; quit;
Xia Keshan
On further thought, the 3 Key Variables are coming from the Iterator Objects and hence, there is no need to place them in the data-part of the Hash Table(H). The following statement,
h.definedata('HPG','Plant','Material','X');
can be replaced with:
h.definedata('X');
For your benefit, I have just commented the previous statement in the code I posted yesterday.
Xia Keshan believes that his SQL solution can be faster than Hash solution. I do not find evidence to his belief and instead, the Hash solution runs AT LEAST 3 times faster than SQL besides Hash taking lesser memory. However, I concur with him that some expertise to use Hash is a must.
Here is the proof.
Let us create a simulated data set, HAVE and run both CODES.
options fullstimer;
data have;
do j = 1 to 200;
i1 = 64 + ceil(ranuni(123) * 26);
i2 = 64 + ceil(ranuni(123) * 26);
i3 = 64 + ceil(ranuni(123) * 26);
HPG = catt(byte(i1), byte(i2), byte(i3));
Plant = put(ceil(ranuni(123) * 1000), z4.);
Material = put(ceil(ranuni(123) * 999), z3.);
X = ceil(ranuni(123) * 100);
output;
end;
keep HPG Plant Material X;
run;
371 data want;
372 if _n_ = 1 then do;
373 if 0 then set have;
374 declare hash h(hashexp:20);
375 h.definekey('HPG','Plant','Material');
376 *h.definedata('HPG','Plant','Material','X');
377 h.definedata('X');
378 h.definedone();
379
380
381 declare hash hH(ordered:'a');
382 hH.definekey('HPG');
383 hH.definedone();
384 declare hash hP(ordered:'a');
385 hP.definekey('Plant');
386 hP.definedone();
387 declare hash hM(ordered:'a');
388 hM.definekey('Material');
389 hM.definedone();
390
391
392 do until(eof);
393 set have end = eof;
394 if hH.find() ^= 0 then hH.add();
395 if hP.find() ^= 0 then hP.add();
396 if hM.find() ^= 0 then hM.add();
397 h.add();
398 end;
399 end;
400 declare hiter hiH('hH');
401 declare hiter hiP('hP');
402 declare hiter hiM('hM');
403
404
405 do while(hiH.next() = 0);
406 do while(hiP.next() = 0);
407 do while(hiM.next() = 0);
408 if h.find() ^= 0 then X = 0;;
409 output;
410 end;
411 end;
412 end;
413 stop;
414 run;
NOTE: There were 200 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 6812784 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 14.63 seconds
user cpu time 6.05 seconds
system cpu time 2.18 seconds
Memory 9019k
OS Memory 17780k
Timestamp 2/5/2015 2:46:34 PM
415 proc sql;
416 create table want as
417 select a.*,coalesce(b.x,0) as x
418 from
419 (select * from (select distinct HPG from have),(select distinct Plant from have),(select
419! distinct Material from have)) as a
420 left join
421 have as b
422 on a.HPG=b.HPG and a.Plant=b.Plant and a.Material=b.Material ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can
not be optimized.
NOTE: Table WORK.WANT created, with 6812784 rows and 4 columns.
423 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 43.07 seconds
user cpu time 16.73 seconds
system cpu time 10.42 seconds
Memory 69088k
OS Memory 78196k
Timestamp 2/5/2015 2:47:36 PM
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.