I thought that WHERE subsetting does not affect the order of observations. For example, the following my consists of 500 is with 10 ts each. The WHERE in the second DATA picks t=4,5,6,7 without distorting the order.
data my;
do i=1 to 500;
do t=1 to 10;
output;
end;
end;
run;
data my;
set my;
where 3<t<8;
run;
I realized that the following WRDS example behaves differently when WHERE subsetting. The original data crsp.msf is sorted by PERMNO and DATE, but the resulting msf does not preserve the original order.
%let wrds=wrds.wharton.upenn.edu 4016;
signon wrds username=_prompt_;
rsubmit;
data msf;
set crsp.msf;
where date<"1jan1930"d;
run;
endrsubmit;
And I found that BY restores the original order—PERMNO and DATE. I wonder (1) why the resulting observations are shuffled here, and (2) whether BY after WHERE is necessary to make the observations sequential. Thanks in advance.
So nope. I stand by my original answer, you cannot assume that a data set is sorted. The sort flag does not propogate across. There may be an implicit sort, but you're taking a risk there when SAS is not sorting the data. I will add this may behave differently if you're working on DBs because SQL by definition doesn't maintain row orders.
BY does check for the implicit sort and won't error out if it's there so likely that's what you're seeing and if its good enough for you on this system you're good. If you change things you may have to re-think this, I would probably add a check to ensure the data is sorted as needed or at least if it errors out it will explain why.
*implicitly sorted data set;
data have;
input ID Age;
cards;
1 5
2 8
3 13
4 14
5 3
6 34
7 6
8 3
9 3
10 2
;;;;
*check by/where without sort;
data want1 (label="Implicit Sort, BY and Where");
set have;
by id;
where id in (1, 2, 3);
run;
*explicitly sort data set now;
proc sort data=have out=have2 (label="Sorted input data");
by id;
run;
data want2 (label="explicit Sort, Only Where");
set have2;
where id in (1, 2, 3);
run;
data want3 (label="explicit sort, By and Where");
set have2;
by id;
where id in (1, 2,3 );
run;
data want3 (label="Explicit sort, copy only with BY");
set have2;
by id;
run;
data results;
set sashelp.vtable;
where libname='WORK' and (%Upcase(memname) like '%WANT%' or %Upcase(memname) like '%HAVE%') ;
keep libname memname memlabel sort:;
run;
It seems crsp.msf is sorted by PERMNO and DATE. I wonder whether WHERE affects the sorting status.
When reading from a normal SAS dataset the values are read in order. So the WHERE statement will NOT change the order. But your code will not store anything into the metadata field(s) used to indicate what if any variables were used to sort the data.
Try it yourself.
Add a BY statement to your data step. It will test if the values making it past the WHERE condition do not present themselves to the data step in a sorted order and fail if the are not ordered as indicated by the BY statement If that happens then the metadata that generated the report you posted the photograph of is wrong.
If you just want to subset the data you could use PROC SORT and that would definitely set the order and also store what variables were used to order the data into the metadata.
proc sort out= msf data=crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
The sorting of the original data is correct by PERMNO and DATE. Here I attach the code and the screenshots.
rsubmit;
data msf;
set crsp.msf;
by permno date;/*<-this works correctly*/
run;
data msf1;/*<-the resulting msf1 is not sorted*/
set crsp.msf;
where date<"1jan1930"d;
run;
data msf2;/*<-the resulting msf2 is sorted*/
set crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
endrsubmit;
The code above creates three data sets—MSF, MSF1, and MSF2. The BY in the first DATA has no problem.
The only difference between MSF1 and MSF2 is the BY statement in the third DATA.
3 rsubmit; NOTE: Remote submit to WRDS commencing. 1 data msf; 2 set crsp.msf; 3 by permno date;/*<-this works correctly*/ 4 run; NOTE: There were 4509846 observations read from the data set CRSP.MSF. NOTE: The data set WORK.MSF has 4509846 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 1.61 seconds cpu time 1.52 seconds 5 data msf1;/*<-the resulting msf1 is not sorted*/ 6 set crsp.msf; 7 where date<"1jan1930"d; 8 run; NOTE: There were 29968 observations read from the data set CRSP.MSF. WHERE date<'01JAN1930'D; NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 0.68 seconds cpu time 0.43 seconds 9 data msf2;/*<-the resulting msf2 is sorted*/ 10 set crsp.msf; 11 where date<"1jan1930"d; 12 by permno date; 13 run; NOTE: There were 29968 observations read from the data set CRSP.MSF. WHERE date<'01JAN1930'D; NOTE: The data set WORK.MSF2 has 29968 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 1.00 seconds cpu time 0.44 seconds NOTE: Remote submit to WRDS complete.
The following is the resulting (not original) MSF. The observations are sorted correctly by PERMNO and DATE.
The following is MSF1 without BY. The PERMNO-DATE sorting disappears.
The following is MSF2 with BY. The PERMNO-DATE sorting is correct.
It seems SAS does not access to the original data sequentially if not BY.
So the data is NOT sorted by those variables. But perhaps you have an INDEX on those variables that will allow SAS to access the data in sorted order when you use the BY statement.
See this short example:
data class;
set sashelp.class;
run;
proc datasets nolist;
modify class;
index create sex;
quit;
data class1;
set class;
where age = 14;
by sex;
run;
data class2;
set class;
where age = 14;
run;
Since there is an index file present for the msf dataset, you see the same kind of behaviour. The only thing that puzzles me is your output where it says sorted by PERMNO DATE. Since this is not the output of a proc contents, how did you get that?
I got the information from Details in Properties in Explorer.
The SAS System The CONTENTS Procedure Data Set Name CRSP.MSF Observations 4509846 Member Type DATA Variables 21 Engine V9 Indexes 5 Created 01/31/2019 15:39:40 Observation Length 168 Last Modified 01/31/2019 15:39:48 Deleted Observations 0 Protection Compressed NO Data Set Type Sorted YES Label Monthly Stock - Securities Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 Encoding latin1 Western (ISO) Engine/Host Dependent Information Data Set Page Size 65536 Number of Data Set Pages 11595 First Data Page 1 Max Obs per Page 389 Obs in First Data Page 361 Index File Page Size 8192 Number of Index File Pages 40018 Number of Data Set Repairs 0 Filename /wrds/crsp/sasdata/a_stock/msf.sas7bdat Release Created 9.0401M5 Host Created Linux Inode Number 2465424092 Access Permission rw-r----- Owner Name wrdsadmn File Size 725MB File Size (bytes) 759955456 Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 18 ALTPRC Num 8 12.5 12.5 Price Alternate 20 ALTPRCDT Num 8 YYMMDDN8. YYMMDD6. Alternate Price Date 14 ASK Num 8 11.5 11.5 Ask 9 ASKHI Num 8 12.5 12.5 Ask or High Price 13 BID Num 8 11.5 11.5 Bid 8 BIDLO Num 8 12.5 12.5 Bid or Low Price 16 CFACPR Num 8 Cumulative Factor to Adjust Prices 17 CFACSHR Num 8 Cumulative Factor to Adjust Shares/Vol 1 CUSIP Char 8 8. 8. CUSIP Header 7 DATE Num 8 YYMMDDN8. Date of Observation 5 HEXCD Num 8 2. 2. Exchange Code Header 6 HSICCD Num 8 8. 8. Standard Industrial Classification Code 4 ISSUNO Num 8 8. 8. Nasdaq Issue Number 3 PERMCO Num 8 8. 8. PERMCO 2 PERMNO Num 8 8. 8. PERMNO 10 PRC Num 8 12.5 12.5 Price or Bid/Ask Average 12 RET Num 8 11.6 11.6 Returns 21 RETX Num 8 11.6 11.6 Returns without Dividends 15 SHROUT Num 8 Shares Outstanding 19 SPREAD Num 8 11.5 11.5 Spread Between Bid and Ask 11 VOL Num 8 10. 10. Volume Alphabetic List of Indexes and Attributes # of Unique # Index Values 1 CUSIP 32985 2 DATE 1117 3 HSICCD 1445 4 PERMCO 29481 5 PERMNO 32985 Sort Information Sortedby PERMNO DATE Validated YES Character Set ASCII Sort Option NODUPKEY
Here I attach the results from PROC CONTENTS too.
It is impossible for me to tell from your photographs of your browser window of the dataset whether the data was sorted or not.
Please run the following test instead.
* SUBSET the data ;
data msf1;
set crsp.msf;
where date<"1jan1930"d;
run;
* TEST if sorted ;
data _null_;
set msf1 ;
by permno date;
run;
The PROC CONTENTS output right above says crsp.msf is sorted (Sortedby PERMNO DATE). It seems (1) though crsp.msf per se is sorted, crsp.msf with WHERE is not, and (2) though crsp.msf with WHERE is not sorted, crsp.msf with both WHERE and BY is sorted. Thanks.
If the data is sorted when read in and you use a BY statement SAS will not generate an error.
If the data is sorted with a prior sort and sort entry in PROC CONTENTS then it will not generate an error.
However, these are two different cases that may need to be considered separately.
Please post the lines from the SAS log for running the two step test that I posted before.
The code generates an error.
rsubmit;
* SUBSET the data ;
data msf1;
set crsp.msf;
where date<"1jan1930"d;
run;
* TEST if sorted ;
data _null_;
set msf1 ;
by permno date;
run;
endrsubmit;
And the output here.
1 rsubmit; NOTE: Remote submit to WRDS commencing. 1 * SUBSET the data ; 2 data msf1; 3 set crsp.msf; 4 where date<"1jan1930"d; 5 run; NOTE: There were 29968 observations read from the data set CRSP.MSF. WHERE date<'01JAN1930'D; NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 0.69 seconds cpu time 0.36 seconds 6 * TEST if sorted ; 7 data _null_; 8 set msf1 ; 9 by permno date; 10 run; ERROR: BY variables are not properly sorted on data set WORK.MSF1. CUSIP=69499890 PERMNO=75471 PERMCO=25928 ISSUNO=0 HEXCD=1 HSICCD=3490 DATE=19251231 BIDLO=. ASKHI=. PRC=-32.00000 VOL=. RET=C BID=31.00000 ASK=33.00000 SHROUT=70 CFACPR=4.810256 CFACSHR=4 ALTPRC=-32.00000 SPREAD=2.00000 ALTPRCDT=19251231 RETX=C FIRST.PERMNO=1 LAST.PERMNO=1 FIRST.DATE=1 LAST.DATE=1 _ERROR_=1 _N_=520 NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 521 observations read from the data set WORK.MSF1. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds NOTE: Remote submit to WRDS complete.
Adding BY eliminates the error as follows.
rsubmit;
* SUBSET the data ;
data msf1;
set crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
* TEST if sorted ;
data _null_;
set msf1 ;
by permno date;
run;
endrsubmit;
And the output here too.
2 rsubmit; NOTE: Remote submit to WRDS commencing. 11 * SUBSET the data ; 12 data msf1; 13 set crsp.msf; 14 where date<"1jan1930"d; 15 by permno date; 16 run; NOTE: There were 29968 observations read from the data set CRSP.MSF. WHERE date<'01JAN1930'D; NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 0.79 seconds cpu time 0.32 seconds 17 * TEST if sorted ; 18 data _null_; 19 set msf1 ; 20 by permno date; 21 run; NOTE: There were 29968 observations read from the data set WORK.MSF1. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds NOTE: Remote submit to WRDS complete.
(1) crsp.msf is sorted. (2) crsp.msf+WHERE is not sorted. (3) crsp.msf+WHERE+BY is sorted.
Do you know how that libname, CRSP, is defined? Is it using the BASE (default) SAS engine? SAS/Share? Some other engine?
Also what version of SAS is that remote session running?
crsp is the REMOTE engine from WRDS, and it's SAS 9.4 TS1M5.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.