topic Re: sortedby= in SAS Programming

sortedby=

Steelers_In_DC — Fri, 02 Oct 2015 15:13:44 GMT

I'm searching a large dataset for a specific zip code. I have this dataset sorted by prop_zip_code which is 9 digits, there is another field us_zip5 that it is not sorted on. (next month I will sort on both) If I add sortedby in the set statement does this option tell SAS to treat the search as if it is sorted by us_zip5? I cannot find any clear documentation on that, any clarification or other suggestions for speed are appreciated:

data buckingham;
set mdj.monthly_1mm_zips_sort(sortedby=us_zip5);
where us_zip5 = '18912';
run;

Thanks,

Mark

Re: sortedby=

ballardw — Fri, 02 Oct 2015 15:41:12 GMT

I don't believe that Sortedby is going to have any speed effect on an input data set though it may have an impact on a BY statement. The option is used to set the indicator stored in the dataset for sort order and would be useful for an output set. Normally a data step has no guarantee of what the output order might be but if you have made sure the data is in a specific order setting the option on the output data set would speed things that examine sort order.

Re: sortedby=

Steelers_In_DC — Fri, 02 Oct 2015 16:12:58 GMT

Thanks for the info. It took an hour and two minutes to run. I just added a by statement and am running it again. I'll reply again if there is any improvement.

Re: sortedby=

Astounding — Fri, 02 Oct 2015 18:29:25 GMT

Although SORTEDBY will not help, if your data set is in order, you are allowed to use a BY statement:

data buckingham;
set mdj.monthly_1mm_zips_sort;
where us_zip5 = '18912';

by us_zip5;
run;

It doesn't matter whether PROC SORT performed the sorting or not, it only matters that the observations are in order.

This in itself will not speed up the process, but it may set the stage for other experimentation. For example, you could try:

data buckingham;
set mdj.monthly_1mm_zips_sort;
where us_zip5 = '18912';

by us_zip5;

output;

if last.us_zip5 then stop;
run;

Using the WHERE statement saves the software from reading observations into the PDV. But the software still has to search through blocks of data (not read into the PDV) to check for additional "18912" observations. So adding the STOP logic saves the software from having to search through remaining blocks of data. The savings really depend on how far into the data "18912" appears, but you should be able to measure it and it might be considerable.

Let us know what you discover.

Good luck.