Hi there,
What I am really trying to achieve is to replace the PROC SORT with a more time efficient way.
I came across the idea to use index. That is my original code:
proc sort data=MyData;
by CustID BranchNumber BrancType Territory;
run;
data MyData;
set MyData;
by CustID BranchNumber BrancType Territory;
count;
if first.Territory then count = 0;
if not first.Territory then do;
count+1;
CustID = strip(CustID)!!'_'!!strip(put(count,z4.));
end;
run;
And this is what I have tried to do:
proc datasets library=work;
modify MyData;
index create _mylist=(CustID BranchNumber BrancType Territory);
run;
data MyData;
set MyData;
by CustID BranchNumber BrancType Territory;
count;
if first.Territory then count = 0;
if not first.Territory then do;
count+1;
CustID = strip(CustID)!!'_'!!strip(put(count,z4.));
end;
run;
However it does not work even though there are duplicates CustIDs per Territory.
Are there any ideas, alternatives and/or reasons why it does not work?
Thanks!
First of all: indexes won't help with performance. The creation of the index already involves a sort, and when you process the whole dataset, you need to read the dataset (in random mode!) and the index, which slows you down quite a lot. Indexes are only good (performancewise) for selecting small subsets of the whole data.
But a small demonstration with one of the SASHELP datasets shows that both methods provide the same result:
data cars;
set sashelp.cars;
run;
proc sort
data=cars
out=cars1
;
by origin make;
run;
data want1 (keep=origin make);
set cars1;
by origin make;
retain count;
if first.make then count = 0;
else do;
count + 1;
origin = catx('_',origin,count);
end;
run;
proc datasets lib=work nolist;
modify cars;
index create myind=(origin make);
quit;
data want2 (keep=origin make);
set cars;
by origin make;
retain count;
if first.make then count = 0;
else do;
count + 1;
origin = catx('_',origin,count);
end;
run;
proc compare
base=want1
compare=want2
criterion=0.00001
method=relative
out=comp(label="Compare Data for WANT1 and WANT2")
outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
outdif
outnoequal
;
var make origin;
with make origin;
run;
Log:
27 data cars;
28 set sashelp.cars;
29 run;
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
30
31 proc sort
32 data=cars
33 out=cars1
34 ;
35 by origin make;
36 run;
NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.CARS1 has 428 observations and 15 variables.
NOTE: PROZEDUR SORT used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
37
38 data want1 (keep=origin make);
39 set cars1;
40 by origin make;
41 retain count;
2 Das SAS System 07:31 Tuesday, July 30, 2019
42 if first.make then count = 0;
43 else do;
44 count + 1;
45 origin = catx('_',origin,count);
46 end;
47 run;
NOTE: There were 428 observations read from the data set WORK.CARS1.
NOTE: The data set WORK.WANT1 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
48
49 proc datasets lib=work nolist;
50 modify cars;
51 index create myind=(origin make);
NOTE: Composite index myind has been defined.
52 quit;
NOTE: MODIFY was successful for WORK.CARS.DATA.
NOTE: PROZEDUR DATASETS used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
53
54 data want2 (keep=origin make);
55 set cars;
56 by origin make;
57 retain count;
58 if first.make then count = 0;
59 else do;
60 count + 1;
61 origin = catx('_',origin,count);
62 end;
63 run;
NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.WANT2 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
64
65 proc compare
66 base=want1
67 compare=want2
68 criterion=0.00001
69 method=relative
70 out=comp(label="Compare Data for WANT1 and WANT2")
71 outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
72 outdif
73 outnoequal
74 ;
75 var make origin;
3 Das SAS System 07:31 Tuesday, July 30, 2019
76 with make origin;
77 run;
NOTE: There were 428 observations read from the data set WORK.WANT1.
NOTE: There were 428 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.COMP has 0 observations and 4 variables.
NOTE: The data set WORK.STAT has 0 observations and 7 variables.
NOTE: The PROCEDURE COMPARE printed page 1.
NOTE: PROZEDUR COMPARE used (Total process time):
real time 0.07 seconds
cpu time 0.00 seconds
You can see that the second data step is slightly slower than the first, and that the combined time for index creation + data step is more than the combined time for sort and data step.
And part of the result:
NOTE: No unequal values were found. All values compared are exactly equal.
shows that the resulting datasets are identical.
If that does not align with what you found, please supply sufficient example data (in a data step with datalines) to illustrate your issue.
First of all: indexes won't help with performance. The creation of the index already involves a sort, and when you process the whole dataset, you need to read the dataset (in random mode!) and the index, which slows you down quite a lot. Indexes are only good (performancewise) for selecting small subsets of the whole data.
But a small demonstration with one of the SASHELP datasets shows that both methods provide the same result:
data cars;
set sashelp.cars;
run;
proc sort
data=cars
out=cars1
;
by origin make;
run;
data want1 (keep=origin make);
set cars1;
by origin make;
retain count;
if first.make then count = 0;
else do;
count + 1;
origin = catx('_',origin,count);
end;
run;
proc datasets lib=work nolist;
modify cars;
index create myind=(origin make);
quit;
data want2 (keep=origin make);
set cars;
by origin make;
retain count;
if first.make then count = 0;
else do;
count + 1;
origin = catx('_',origin,count);
end;
run;
proc compare
base=want1
compare=want2
criterion=0.00001
method=relative
out=comp(label="Compare Data for WANT1 and WANT2")
outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
outdif
outnoequal
;
var make origin;
with make origin;
run;
Log:
27 data cars;
28 set sashelp.cars;
29 run;
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
30
31 proc sort
32 data=cars
33 out=cars1
34 ;
35 by origin make;
36 run;
NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.CARS1 has 428 observations and 15 variables.
NOTE: PROZEDUR SORT used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
37
38 data want1 (keep=origin make);
39 set cars1;
40 by origin make;
41 retain count;
2 Das SAS System 07:31 Tuesday, July 30, 2019
42 if first.make then count = 0;
43 else do;
44 count + 1;
45 origin = catx('_',origin,count);
46 end;
47 run;
NOTE: There were 428 observations read from the data set WORK.CARS1.
NOTE: The data set WORK.WANT1 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
48
49 proc datasets lib=work nolist;
50 modify cars;
51 index create myind=(origin make);
NOTE: Composite index myind has been defined.
52 quit;
NOTE: MODIFY was successful for WORK.CARS.DATA.
NOTE: PROZEDUR DATASETS used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
53
54 data want2 (keep=origin make);
55 set cars;
56 by origin make;
57 retain count;
58 if first.make then count = 0;
59 else do;
60 count + 1;
61 origin = catx('_',origin,count);
62 end;
63 run;
NOTE: There were 428 observations read from the data set WORK.CARS.
NOTE: The data set WORK.WANT2 has 428 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
64
65 proc compare
66 base=want1
67 compare=want2
68 criterion=0.00001
69 method=relative
70 out=comp(label="Compare Data for WANT1 and WANT2")
71 outstats=stat(label="Compare Data Summary Statistics for WANT1 and WANT2")
72 outdif
73 outnoequal
74 ;
75 var make origin;
3 Das SAS System 07:31 Tuesday, July 30, 2019
76 with make origin;
77 run;
NOTE: There were 428 observations read from the data set WORK.WANT1.
NOTE: There were 428 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.COMP has 0 observations and 4 variables.
NOTE: The data set WORK.STAT has 0 observations and 7 variables.
NOTE: The PROCEDURE COMPARE printed page 1.
NOTE: PROZEDUR COMPARE used (Total process time):
real time 0.07 seconds
cpu time 0.00 seconds
You can see that the second data step is slightly slower than the first, and that the combined time for index creation + data step is more than the combined time for sort and data step.
And part of the result:
NOTE: No unequal values were found. All values compared are exactly equal.
shows that the resulting datasets are identical.
If that does not align with what you found, please supply sufficient example data (in a data step with datalines) to illustrate your issue.
Thanks for providing the code! It works as I wanted it to.
Regarding the time efficiencies, I thought that using index is faster. Please see below the times I am getting:
NOTE: There were 748528 observations read from the data set WORK.SORTDATA.
NOTE: The data set WORK.ORIGDATA has 748528 observations and 213 variables.
NOTE: Compressing data set WORK.ORIGDATA decreased size by 86.35 percent.
Compressed is 10219 pages; un-compressed would require 74854 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 4:41.85
cpu time 3:14.87
NOTE: Composite index _mylist has been defined.
17 run;
NOTE: MODIFY was successful for WORK.INDEXDATA.DATA.
NOTE: PROCEDURE DATASETS used (Total process time):
real time 50.81 seconds
cpu time 29.16 seconds
And the times for the second data step is always the same. Maybe there are 2-3 seconds differences.
Any thoughts on that? Am I missing anything? Are there any other disadvantages?
Thanks a lot!
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.