BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Zatere
Quartz | Level 8

Hi there,

 

What I am really trying to achieve is to replace the PROC SORT with a more time efficient way.

 

I came across the idea to use index. That is my original code:

 

 proc sort data=MyData;
      by CustID BranchNumber BrancType Territory;                    
   run;

   data MyData;
      set MyData;
      by CustID BranchNumber BrancType Territory;                  
      count;	
      if first.Territory then count = 0;

      if not first.Territory then do;
         count+1;
         CustID = strip(CustID)!!'_'!!strip(put(count,z4.));
      end;
   run;

 

And this is what I have tried to do:

 

proc datasets library=work;
modify MyData;
index create _mylist=(CustID BranchNumber BrancType Territory);
run;

data MyData;
      set MyData;
      by CustID BranchNumber BrancType Territory;                  
      count;	
      if first.Territory then count = 0;

      if not first.Territory then do;
         count+1;
         CustID = strip(CustID)!!'_'!!strip(put(count,z4.));
      end;
   run;

 

However it does not work even though there are duplicates CustIDs per Territory.

 

Are there any ideas, alternatives and/or reasons why it does not work?

 

Thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

First of all: indexes won't help with performance. The creation of the index already involves a sort, and when you process the whole dataset, you need to read the dataset (in random mode!) and the index, which slows you down quite a lot. Indexes are only good (performancewise) for selecting small subsets of the whole data.

 

But a small demonstration with one of the SASHELP datasets shows that both methods provide the same result:

data cars;
set sashelp.cars;
run;

proc sort
  data=cars
  out=cars1
;
by origin make;
run;

data want1 (keep=origin make);
set cars1;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc datasets lib=work nolist;
modify cars;
index create myind=(origin make);
quit;

data want2 (keep=origin make);
set cars;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc compare
  base=want1
  compare=want2
  criterion=0.00001
  method=relative
  out=comp(label="Compare Data for WANT1 and WANT2")
  outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
  outdif
  outnoequal
;
var make origin;
with make origin;
run;

Log:

27         data cars;
28         set sashelp.cars;
29         run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.00 seconds
      

30         
31         proc sort
32           data=cars
33           out=cars1
34         ;
35         by origin make;
36         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.CARS1 has 428 observations and 15 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

37         
38         data want1 (keep=origin make);
39         set cars1;
40         by origin make;
41         retain count;
2                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

42         if first.make then count = 0;
43         else do;
44           count + 1;
45           origin = catx('_',origin,count);
46         end;
47         run;

NOTE: There were 428 observations read from the data set WORK.CARS1.
NOTE: The data set WORK.WANT1 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

48         
49         proc datasets lib=work nolist;
50         modify cars;
51         index create myind=(origin make);
NOTE: Composite index myind has been defined.
52         quit;

NOTE: MODIFY was successful for WORK.CARS.DATA.
NOTE: PROZEDUR DATASETS used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      

53         
54         data want2 (keep=origin make);
55         set cars;
56         by origin make;
57         retain count;
58         if first.make then count = 0;
59         else do;
60           count + 1;
61           origin = catx('_',origin,count);
62         end;
63         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.WANT2 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

64         
65         proc compare
66           base=want1
67           compare=want2
68           criterion=0.00001
69           method=relative
70           out=comp(label="Compare Data for WANT1 and WANT2")
71           outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
72           outdif
73           outnoequal
74         ;
75         var make origin;
3                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

76         with make origin;
77         run;

NOTE: There were 428 observations read from the data set WORK.WANT1.
NOTE: There were 428 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.COMP has 0 observations and 4 variables.
NOTE: The data set WORK.STAT has 0 observations and 7 variables.
NOTE: The PROCEDURE COMPARE printed page 1.
NOTE: PROZEDUR COMPARE used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds
 

You can see that the second data step is slightly slower than the first, and that the combined time for index creation + data step is more than the combined time for sort and data step.

And part of the result:

NOTE: No unequal values were found. All values compared are exactly equal.

shows that the resulting datasets are identical.

 

If that does not align with what you found, please supply sufficient example data (in a data step with datalines) to illustrate your issue.

View solution in original post

2 REPLIES 2
Kurt_Bremser
Super User

First of all: indexes won't help with performance. The creation of the index already involves a sort, and when you process the whole dataset, you need to read the dataset (in random mode!) and the index, which slows you down quite a lot. Indexes are only good (performancewise) for selecting small subsets of the whole data.

 

But a small demonstration with one of the SASHELP datasets shows that both methods provide the same result:

data cars;
set sashelp.cars;
run;

proc sort
  data=cars
  out=cars1
;
by origin make;
run;

data want1 (keep=origin make);
set cars1;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc datasets lib=work nolist;
modify cars;
index create myind=(origin make);
quit;

data want2 (keep=origin make);
set cars;
by origin make;
retain count;
if first.make then count = 0;
else do;
  count + 1;
  origin = catx('_',origin,count);
end;
run;

proc compare
  base=want1
  compare=want2
  criterion=0.00001
  method=relative
  out=comp(label="Compare Data for WANT1 and WANT2")
  outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
  outdif
  outnoequal
;
var make origin;
with make origin;
run;

Log:

27         data cars;
28         set sashelp.cars;
29         run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.00 seconds
      

30         
31         proc sort
32           data=cars
33           out=cars1
34         ;
35         by origin make;
36         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.CARS1 has 428 observations and 15 variables.
NOTE: PROZEDUR SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

37         
38         data want1 (keep=origin make);
39         set cars1;
40         by origin make;
41         retain count;
2                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

42         if first.make then count = 0;
43         else do;
44           count + 1;
45           origin = catx('_',origin,count);
46         end;
47         run;

NOTE: There were 428 observations read from the data set WORK.CARS1.
NOTE: The data set WORK.WANT1 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

48         
49         proc datasets lib=work nolist;
50         modify cars;
51         index create myind=(origin make);
NOTE: Composite index myind has been defined.
52         quit;

NOTE: MODIFY was successful for WORK.CARS.DATA.
NOTE: PROZEDUR DATASETS used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      

53         
54         data want2 (keep=origin make);
55         set cars;
56         by origin make;
57         retain count;
58         if first.make then count = 0;
59         else do;
60           count + 1;
61           origin = catx('_',origin,count);
62         end;
63         run;

NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.WANT2 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      

64         
65         proc compare
66           base=want1
67           compare=want2
68           criterion=0.00001
69           method=relative
70           out=comp(label="Compare Data for WANT1 and WANT2")
71           outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
72           outdif
73           outnoequal
74         ;
75         var make origin;
3                                                          Das SAS System                               07:31 Tuesday, July 30, 2019

76         with make origin;
77         run;

NOTE: There were 428 observations read from the data set WORK.WANT1.
NOTE: There were 428 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.COMP has 0 observations and 4 variables.
NOTE: The data set WORK.STAT has 0 observations and 7 variables.
NOTE: The PROCEDURE COMPARE printed page 1.
NOTE: PROZEDUR COMPARE used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds
 

You can see that the second data step is slightly slower than the first, and that the combined time for index creation + data step is more than the combined time for sort and data step.

And part of the result:

NOTE: No unequal values were found. All values compared are exactly equal.

shows that the resulting datasets are identical.

 

If that does not align with what you found, please supply sufficient example data (in a data step with datalines) to illustrate your issue.

Zatere
Quartz | Level 8

Thanks for providing the code! It works as I wanted it to.

 

Regarding the time efficiencies, I thought that using index is faster. Please see below the times I am getting:

 

NOTE: There were 748528 observations read from the data set WORK.SORTDATA.
NOTE: The data set WORK.ORIGDATA has 748528 observations and 213 variables.
NOTE: Compressing data set WORK.ORIGDATA decreased size by 86.35 percent. 
      Compressed is 10219 pages; un-compressed would require 74854 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           4:41.85
      cpu time            3:14.87
NOTE: Composite index _mylist has been defined.
17         run;

NOTE: MODIFY was successful for WORK.INDEXDATA.DATA.

NOTE: PROCEDURE DATASETS used (Total process time):
      real time           50.81 seconds
      cpu time            29.16 seconds
      

 

And the times for the second data step is always the same. Maybe there are 2-3 seconds differences. 

 

Any thoughts on that? Am I missing anything? Are there any other disadvantages? 

 

Thanks a lot!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 4155 views
  • 1 like
  • 2 in conversation