Solved: Re: Create Index to use it in by group processing

Zatere · Posted 07-30-2019 07:02 AM

Hi there,

What I am really trying to achieve is to replace the PROC SORT with a more time efficient way.

I came across the idea to use index. That is my original code:

 proc sort data=MyData;
      by CustID BranchNumber BrancType Territory;                    
   run;

   data MyData;
      set MyData;
      by CustID BranchNumber BrancType Territory;                  
      count;	
      if first.Territory then count = 0;

      if not first.Territory then do;
         count+1;
         CustID = strip(CustID)!!'_'!!strip(put(count,z4.));
      end;
   run;

And this is what I have tried to do:

proc datasets library=work;
modify MyData;
index create _mylist=(CustID BranchNumber BrancType Territory);
run;

data MyData;
      set MyData;
      by CustID BranchNumber BrancType Territory;                  
      count;	
      if first.Territory then count = 0;

      if not first.Territory then do;
         count+1;
         CustID = strip(CustID)!!'_'!!strip(put(count,z4.));
      end;
   run;

However it does not work even though there are duplicates CustIDs per Territory.

Are there any ideas, alternatives and/or reasons why it does not work?

Thanks!

Kurt_Bremser · Posted 07-30-2019 07:28 AM

First of all: indexes won't help with performance. The creation of the index already involves a sort, and when you process the whole dataset, you need to read the dataset (in random mode!) and the index, which slows you down quite a lot. Indexes are only good (performancewise) for selecting small subsets of the whole data.

But a small demonstration with one of the SASHELP datasets shows that both methods provide the same result:

data cars;
set sashelp.cars;
run;

proc sort
  data=cars
  out=cars1
;
by origin make;
run;

data want1 (keep=origin make);
set cars1;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc datasets lib=work nolist;
modify cars;
index create myind=(origin make);
quit;

data want2 (keep=origin make);
set cars;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc compare
  base=want1
  compare=want2
  criterion=0.00001
  method=relative
  out=comp(label="Compare Data for WANT1 and WANT2")
  outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
  outdif
  outnoequal
;
var make origin;
with make origin;
run;

Log:

27         data cars;
28         set sashelp.cars;
29         run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.00 seconds
      

30         
31         proc sort
32           data=cars
33           out=cars1
34         ;
35         by origin make;
36         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.CARS1 has 428 observations and 15 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

37         
38         data want1 (keep=origin make);
39         set cars1;
40         by origin make;
41         retain count;
2                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

42         if first.make then count = 0;
43         else do;
44           count + 1;
45           origin = catx('_',origin,count);
46         end;
47         run;

NOTE: There were 428 observations read from the data set WORK.CARS1.
NOTE: The data set WORK.WANT1 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

48         
49         proc datasets lib=work nolist;
50         modify cars;
51         index create myind=(origin make);
NOTE: Composite index myind has been defined.
52         quit;

NOTE: MODIFY was successful for WORK.CARS.DATA.
NOTE: PROZEDUR DATASETS used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      

53         
54         data want2 (keep=origin make);
55         set cars;
56         by origin make;
57         retain count;
58         if first.make then count = 0;
59         else do;
60           count + 1;
61           origin = catx('_',origin,count);
62         end;
63         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.WANT2 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

64         
65         proc compare
66           base=want1
67           compare=want2
68           criterion=0.00001
69           method=relative
70           out=comp(label="Compare Data for WANT1 and WANT2")
71           outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
72           outdif
73           outnoequal
74         ;
75         var make origin;
3                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

76         with make origin;
77         run;

NOTE: There were 428 observations read from the data set WORK.WANT1.
NOTE: There were 428 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.COMP has 0 observations and 4 variables.
NOTE: The data set WORK.STAT has 0 observations and 7 variables.
NOTE: The PROCEDURE COMPARE printed page 1.
NOTE: PROZEDUR COMPARE used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds

You can see that the second data step is slightly slower than the first, and that the combined time for index creation + data step is more than the combined time for sort and data step.

And part of the result:

NOTE: No unequal values were found. All values compared are exactly equal.

shows that the resulting datasets are identical.

If that does not align with what you found, please supply sufficient example data (in a data step with datalines) to illustrate your issue.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 07-30-2019 07:28 AM

First of all: indexes won't help with performance. The creation of the index already involves a sort, and when you process the whole dataset, you need to read the dataset (in random mode!) and the index, which slows you down quite a lot. Indexes are only good (performancewise) for selecting small subsets of the whole data.

But a small demonstration with one of the SASHELP datasets shows that both methods provide the same result:

data cars;
set sashelp.cars;
run;

proc sort
  data=cars
  out=cars1
;
by origin make;
run;

data want1 (keep=origin make);
set cars1;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc datasets lib=work nolist;
modify cars;
index create myind=(origin make);
quit;

data want2 (keep=origin make);
set cars;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc compare
  base=want1
  compare=want2
  criterion=0.00001
  method=relative
  out=comp(label="Compare Data for WANT1 and WANT2")
  outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
  outdif
  outnoequal
;
var make origin;
with make origin;
run;

Log:

27         data cars;
28         set sashelp.cars;
29         run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.00 seconds
      

30         
31         proc sort
32           data=cars
33           out=cars1
34         ;
35         by origin make;
36         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.CARS1 has 428 observations and 15 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

37         
38         data want1 (keep=origin make);
39         set cars1;
40         by origin make;
41         retain count;
2                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

42         if first.make then count = 0;
43         else do;
44           count + 1;
45           origin = catx('_',origin,count);
46         end;
47         run;

NOTE: There were 428 observations read from the data set WORK.CARS1.
NOTE: The data set WORK.WANT1 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

48         
49         proc datasets lib=work nolist;
50         modify cars;
51         index create myind=(origin make);
NOTE: Composite index myind has been defined.
52         quit;

NOTE: MODIFY was successful for WORK.CARS.DATA.
NOTE: PROZEDUR DATASETS used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      

53         
54         data want2 (keep=origin make);
55         set cars;
56         by origin make;
57         retain count;
58         if first.make then count = 0;
59         else do;
60           count + 1;
61           origin = catx('_',origin,count);
62         end;
63         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.WANT2 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

64         
65         proc compare
66           base=want1
67           compare=want2
68           criterion=0.00001
69           method=relative
70           out=comp(label="Compare Data for WANT1 and WANT2")
71           outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
72           outdif
73           outnoequal
74         ;
75         var make origin;
3                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

76         with make origin;
77         run;

NOTE: There were 428 observations read from the data set WORK.WANT1.
NOTE: There were 428 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.COMP has 0 observations and 4 variables.
NOTE: The data set WORK.STAT has 0 observations and 7 variables.
NOTE: The PROCEDURE COMPARE printed page 1.
NOTE: PROZEDUR COMPARE used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds

You can see that the second data step is slightly slower than the first, and that the combined time for index creation + data step is more than the combined time for sort and data step.

And part of the result:

NOTE: No unequal values were found. All values compared are exactly equal.

shows that the resulting datasets are identical.

If that does not align with what you found, please supply sufficient example data (in a data step with datalines) to illustrate your issue.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Zatere · Posted 07-30-2019 09:59 AM

Thanks for providing the code! It works as I wanted it to.

Regarding the time efficiencies, I thought that using index is faster. Please see below the times I am getting:

NOTE: There were 748528 observations read from the data set WORK.SORTDATA.
NOTE: The data set WORK.ORIGDATA has 748528 observations and 213 variables.
NOTE: Compressing data set WORK.ORIGDATA decreased size by 86.35 percent. 
      Compressed is 10219 pages; un-compressed would require 74854 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           4:41.85
      cpu time            3:14.87

NOTE: Composite index _mylist has been defined.
17         run;

NOTE: MODIFY was successful for WORK.INDEXDATA.DATA.

NOTE: PROCEDURE DATASETS used (Total process time):
      real time           50.81 seconds
      cpu time            29.16 seconds

And the times for the second data step is always the same. Maybe there are 2-3 seconds differences.

Any thoughts on that? Am I missing anything? Are there any other disadvantages?

Thanks a lot!

Create Index to use it in by group processing

Re: Create Index to use it in by group processing

Re: Create Index to use it in by group processing

Re: Create Index to use it in by group processing

Registration is open

Call for Content EXTENDED

Create Index to use it in by group processing

Re: Create Index to use it in by group processing

Re: Create Index to use it in by group processing

Re: Create Index to use it in by group processing

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away