So the remote session you are submitting that code to is also connected remotely to something else?
Perhaps it is using SAS/Share?
I access to the WRDS database by SAS/CONNECT, but don't know much about the interface WRDS uses.
It all comes down to Maxim 31.
You want a specific order? Force it.
I couldn't find much on the REMOTE engine, but from your log, looks like the REMOTE engine may not read records sequentially. It could be doing some mutli-threading or similar in the background, I suppose.
That's an intersting "gotcha", since reading a SAS dataset in a DATA step is (I thought) always sequential.
Adding the BY statement must force the REMOTE enginge to read data sequentially (in order to create first. and last. variables).
I'm curious, can you see the difference with something like:
data unsorted;
set crsp.msf ;
where date<"1jan1930"d;
put (_N_ permno date)(=) ;
if _N_>100 then stop;
run;
data sorted;
set crsp.msf;
where date<"1jan1930"d;
put (_N_ permno date)(=) ;
by permno date;
if _N_>100 then stop;
run;
If it really is that the REMOTE engine is free to read a SAS dataset in any (?) order that would be scary to me. It's good that adding a BY statement forces it to read sequentially. But there are plenty of other times where code expects to read a SAS dataset sequentially (e.g. using lag function, or whatever).
We know relational databases don't store records in order therefore you can't expect to read them in a defined sequential order, and changes to PROC SQL code can easily change sort order of output if you don't explicitly order it.
But I always thought your order would be safe just reading and writing a SAS dataset in a DATA step. It's definitely safe with the BASE engine.
Yes. crsp.msf is good. crsp.msf+WHERE+BY is also good. Only crsp.msf+WHERE is problematic.
Here is the code.
rsubmit;
data _null_;
set crsp.msf ;
put (_N_ permno date)(=) ;
if _N_>10 then stop;
run;
data unsorted;
set crsp.msf ;
where date<"1jan1930"d;
put (_N_ permno date)(=) ;
if _N_>10 then stop;
run;
data sorted;
set crsp.msf;
where date<"1jan1930"d;
put (_N_ permno date)(=) ;
by permno date;
if _N_>10 then stop;
run;
endrsubmit;
And here is the output.
1 rsubmit; NOTE: Remote submit to WRDS commencing. 1 data _null_; 2 set crsp.msf ; 3 put (_N_ permno date)(=) ; 4 5 if _N_>10 then stop; 6 run; _N_=1 PERMNO=10000 DATE=19851231 _N_=2 PERMNO=10000 DATE=19860131 _N_=3 PERMNO=10000 DATE=19860228 _N_=4 PERMNO=10000 DATE=19860331 _N_=5 PERMNO=10000 DATE=19860430 _N_=6 PERMNO=10000 DATE=19860530 _N_=7 PERMNO=10000 DATE=19860630 _N_=8 PERMNO=10000 DATE=19860731 _N_=9 PERMNO=10000 DATE=19860829 _N_=10 PERMNO=10000 DATE=19860930 _N_=11 PERMNO=10000 DATE=19861031 NOTE: There were 11 observations read from the data set CRSP.MSF. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 7 8 data unsorted; 9 set crsp.msf ; 10 where date<"1jan1930"d; 11 put (_N_ permno date)(=) ; 12 13 if _N_>10 then stop; 14 run; _N_=1 PERMNO=10006 DATE=19251231 _N_=2 PERMNO=10014 DATE=19251231 _N_=3 PERMNO=10022 DATE=19251231 _N_=4 PERMNO=10030 DATE=19251231 _N_=5 PERMNO=10049 DATE=19251231 _N_=6 PERMNO=10057 DATE=19251231 _N_=7 PERMNO=10065 DATE=19251231 _N_=8 PERMNO=10073 DATE=19251231 _N_=9 PERMNO=10081 DATE=19251231 _N_=10 PERMNO=10102 DATE=19251231 _N_=11 PERMNO=10110 DATE=19251231 NOTE: There were 11 observations read from the data set CRSP.MSF. WHERE date<'01JAN1930'D; NOTE: The data set WORK.UNSORTED has 10 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 15 16 data sorted; 17 set crsp.msf; 18 where date<"1jan1930"d; 19 put (_N_ permno date)(=) ; 20 by permno date; 21 22 if _N_>10 then stop; 23 run; _N_=1 PERMNO=10006 DATE=19251231 _N_=2 PERMNO=10006 DATE=19260130 _N_=3 PERMNO=10006 DATE=19260227 _N_=4 PERMNO=10006 DATE=19260331 _N_=5 PERMNO=10006 DATE=19260430 _N_=6 PERMNO=10006 DATE=19260528 _N_=7 PERMNO=10006 DATE=19260630 _N_=8 PERMNO=10006 DATE=19260731 _N_=9 PERMNO=10006 DATE=19260831 _N_=10 PERMNO=10006 DATE=19260930 _N_=11 PERMNO=10006 DATE=19261030 NOTE: There were 12 observations read from the data set CRSP.MSF. WHERE date<'01JAN1930'D; NOTE: The data set WORK.SORTED has 10 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds NOTE: Remote submit to WRDS complete.
If sequential access is necessary, then BY will be important.
The PROC CONTENTS output above displays Sortedby PERMNO DATE. It seems WHERE must be accompanied by BY to preserve the sorting status. Thanks.
Where is applied before BY so that doesn't seem right either....
If a DATA step contains both a WHERE statement and a BY statement, the WHERE statement executes before BY groups are created. Therefore, BY groups reflect groups of observations in the subset of observations that are selected by the WHERE statement, not the actual BY groups of observations in the original input data set.
For a complete discussion of BY-group processing, see BY-Group Processing in the DATA Step in SAS Language Reference: Concepts.
EDIT: I suspect the BY statement adds a sort flag to data set, but will check and confirm.
So nope. I stand by my original answer, you cannot assume that a data set is sorted. The sort flag does not propogate across. There may be an implicit sort, but you're taking a risk there when SAS is not sorting the data. I will add this may behave differently if you're working on DBs because SQL by definition doesn't maintain row orders.
BY does check for the implicit sort and won't error out if it's there so likely that's what you're seeing and if its good enough for you on this system you're good. If you change things you may have to re-think this, I would probably add a check to ensure the data is sorted as needed or at least if it errors out it will explain why.
*implicitly sorted data set;
data have;
input ID Age;
cards;
1 5
2 8
3 13
4 14
5 3
6 34
7 6
8 3
9 3
10 2
;;;;
*check by/where without sort;
data want1 (label="Implicit Sort, BY and Where");
set have;
by id;
where id in (1, 2, 3);
run;
*explicitly sort data set now;
proc sort data=have out=have2 (label="Sorted input data");
by id;
run;
data want2 (label="explicit Sort, Only Where");
set have2;
where id in (1, 2, 3);
run;
data want3 (label="explicit sort, By and Where");
set have2;
by id;
where id in (1, 2,3 );
run;
data want3 (label="Explicit sort, copy only with BY");
set have2;
by id;
run;
data results;
set sashelp.vtable;
where libname='WORK' and (%Upcase(memname) like '%WANT%' or %Upcase(memname) like '%HAVE%') ;
keep libname memname memlabel sort:;
run;
So everything other than the explicit sorting would be risky then. For example, just have2 is safe. I use want1 or want3 as rely on the implicit sorting, but seem to be risky. My example is closer to want2 (have2 is explicitly sorted, but want2 is not necessarily. Likewise, crsp.msf is explicitly sorted, but will not be as soon as used with WHERE).
If a dataset is stored in a SPDS library, there can be side-effects from the way the data is stored in the "buckets" and the bucket metadata is used for optimizing the where.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.