I thought that WHERE subsetting does not affect the order of observations. For example, the following my consists of 500 is with 10 ts each. The WHERE in the second DATA picks t=4,5,6,7 without distorting the order.
data my;
do i=1 to 500;
do t=1 to 10;
output;
end;
end;
run;
data my;
set my;
where 3<t<8;
run;I realized that the following WRDS example behaves differently when WHERE subsetting. The original data crsp.msf is sorted by PERMNO and DATE, but the resulting msf does not preserve the original order.
%let wrds=wrds.wharton.upenn.edu 4016;
signon wrds username=_prompt_;
rsubmit;
data msf;
set crsp.msf;
where date<"1jan1930"d;
run;
endrsubmit;And I found that BY restores the original order—PERMNO and DATE. I wonder (1) why the resulting observations are shuffled here, and (2) whether BY after WHERE is necessary to make the observations sequential. Thanks in advance.
So nope. I stand by my original answer, you cannot assume that a data set is sorted. The sort flag does not propogate across. There may be an implicit sort, but you're taking a risk there when SAS is not sorting the data. I will add this may behave differently if you're working on DBs because SQL by definition doesn't maintain row orders.
BY does check for the implicit sort and won't error out if it's there so likely that's what you're seeing and if its good enough for you on this system you're good. If you change things you may have to re-think this, I would probably add a check to ensure the data is sorted as needed or at least if it errors out it will explain why.
*implicitly sorted data set;
data have;
input ID Age;
cards;
1 5
2 8
3 13
4 14
5 3
6 34
7 6
8 3
9 3
10 2
;;;;
*check by/where without sort;
data want1 (label="Implicit Sort, BY and Where");
set have;
by id;
where id in (1, 2, 3);
run;
*explicitly sort data set now;
proc sort data=have out=have2 (label="Sorted input data");
by id;
run;
data want2 (label="explicit Sort, Only Where");
set have2;
where id in (1, 2, 3);
run;
data want3 (label="explicit sort, By and Where");
set have2;
by id;
where id in (1, 2,3 );
run;
data want3 (label="Explicit sort, copy only with BY");
set have2;
by id;
run;
data results;
set sashelp.vtable;
where libname='WORK' and (%Upcase(memname) like '%WANT%' or %Upcase(memname) like '%HAVE%') ;
keep libname memname memlabel sort:;
run;
It seems crsp.msf is sorted by PERMNO and DATE. I wonder whether WHERE affects the sorting status.
When reading from a normal SAS dataset the values are read in order. So the WHERE statement will NOT change the order. But your code will not store anything into the metadata field(s) used to indicate what if any variables were used to sort the data.
Try it yourself.
Add a BY statement to your data step. It will test if the values making it past the WHERE condition do not present themselves to the data step in a sorted order and fail if the are not ordered as indicated by the BY statement If that happens then the metadata that generated the report you posted the photograph of is wrong.
If you just want to subset the data you could use PROC SORT and that would definitely set the order and also store what variables were used to order the data into the metadata.
proc sort out= msf data=crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
The sorting of the original data is correct by PERMNO and DATE. Here I attach the code and the screenshots.
rsubmit;
data msf;
set crsp.msf;
by permno date;/*<-this works correctly*/
run;
data msf1;/*<-the resulting msf1 is not sorted*/
set crsp.msf;
where date<"1jan1930"d;
run;
data msf2;/*<-the resulting msf2 is sorted*/
set crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
endrsubmit;The code above creates three data sets—MSF, MSF1, and MSF2. The BY in the first DATA has no problem.
The only difference between MSF1 and MSF2 is the BY statement in the third DATA.
3 rsubmit;
NOTE: Remote submit to WRDS commencing.
1 data msf;
2 set crsp.msf;
3 by permno date;/*<-this works correctly*/
4 run;
NOTE: There were 4509846 observations read from the data set CRSP.MSF.
NOTE: The data set WORK.MSF has 4509846 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 1.61 seconds
cpu time 1.52 seconds
5 data msf1;/*<-the resulting msf1 is not sorted*/
6 set crsp.msf;
7 where date<"1jan1930"d;
8 run;
NOTE: There were 29968 observations read from the data set CRSP.MSF.
WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.68 seconds
cpu time 0.43 seconds
9 data msf2;/*<-the resulting msf2 is sorted*/
10 set crsp.msf;
11 where date<"1jan1930"d;
12 by permno date;
13 run;
NOTE: There were 29968 observations read from the data set CRSP.MSF.
WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF2 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 1.00 seconds
cpu time 0.44 seconds
NOTE: Remote submit to WRDS complete.The following is the resulting (not original) MSF. The observations are sorted correctly by PERMNO and DATE.
The following is MSF1 without BY. The PERMNO-DATE sorting disappears.
The following is MSF2 with BY. The PERMNO-DATE sorting is correct.
It seems SAS does not access to the original data sequentially if not BY.
So the data is NOT sorted by those variables. But perhaps you have an INDEX on those variables that will allow SAS to access the data in sorted order when you use the BY statement.
See this short example:
data class;
set sashelp.class;
run;
proc datasets nolist;
modify class;
index create sex;
quit;
data class1;
set class;
where age = 14;
by sex;
run;
data class2;
set class;
where age = 14;
run;
Since there is an index file present for the msf dataset, you see the same kind of behaviour. The only thing that puzzles me is your output where it says sorted by PERMNO DATE. Since this is not the output of a proc contents, how did you get that?
I got the information from Details in Properties in Explorer.
The SAS System
The CONTENTS Procedure
Data Set Name CRSP.MSF Observations 4509846
Member Type DATA Variables 21
Engine V9 Indexes 5
Created 01/31/2019 15:39:40 Observation Length 168
Last Modified 01/31/2019 15:39:48 Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted YES
Label Monthly Stock - Securities
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding latin1 Western (ISO)
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 11595
First Data Page 1
Max Obs per Page 389
Obs in First Data Page 361
Index File Page Size 8192
Number of Index File Pages 40018
Number of Data Set Repairs 0
Filename /wrds/crsp/sasdata/a_stock/msf.sas7bdat
Release Created 9.0401M5
Host Created Linux
Inode Number 2465424092
Access Permission rw-r-----
Owner Name wrdsadmn
File Size 725MB
File Size (bytes) 759955456
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
18 ALTPRC Num 8 12.5 12.5 Price Alternate
20 ALTPRCDT Num 8 YYMMDDN8. YYMMDD6. Alternate Price Date
14 ASK Num 8 11.5 11.5 Ask
9 ASKHI Num 8 12.5 12.5 Ask or High Price
13 BID Num 8 11.5 11.5 Bid
8 BIDLO Num 8 12.5 12.5 Bid or Low Price
16 CFACPR Num 8 Cumulative Factor to Adjust Prices
17 CFACSHR Num 8 Cumulative Factor to Adjust Shares/Vol
1 CUSIP Char 8 8. 8. CUSIP Header
7 DATE Num 8 YYMMDDN8. Date of Observation
5 HEXCD Num 8 2. 2. Exchange Code Header
6 HSICCD Num 8 8. 8. Standard Industrial Classification Code
4 ISSUNO Num 8 8. 8. Nasdaq Issue Number
3 PERMCO Num 8 8. 8. PERMCO
2 PERMNO Num 8 8. 8. PERMNO
10 PRC Num 8 12.5 12.5 Price or Bid/Ask Average
12 RET Num 8 11.6 11.6 Returns
21 RETX Num 8 11.6 11.6 Returns without Dividends
15 SHROUT Num 8 Shares Outstanding
19 SPREAD Num 8 11.5 11.5 Spread Between Bid and Ask
11 VOL Num 8 10. 10. Volume
Alphabetic List of Indexes and Attributes
# of
Unique
# Index Values
1 CUSIP 32985
2 DATE 1117
3 HSICCD 1445
4 PERMCO 29481
5 PERMNO 32985
Sort Information
Sortedby PERMNO DATE
Validated YES
Character Set ASCII
Sort Option NODUPKEYHere I attach the results from PROC CONTENTS too.
It is impossible for me to tell from your photographs of your browser window of the dataset whether the data was sorted or not.
Please run the following test instead.
* SUBSET the data ;
data msf1;
set crsp.msf;
where date<"1jan1930"d;
run;
* TEST if sorted ;
data _null_;
set msf1 ;
by permno date;
run;
The PROC CONTENTS output right above says crsp.msf is sorted (Sortedby PERMNO DATE). It seems (1) though crsp.msf per se is sorted, crsp.msf with WHERE is not, and (2) though crsp.msf with WHERE is not sorted, crsp.msf with both WHERE and BY is sorted. Thanks.
If the data is sorted when read in and you use a BY statement SAS will not generate an error.
If the data is sorted with a prior sort and sort entry in PROC CONTENTS then it will not generate an error.
However, these are two different cases that may need to be considered separately.
Please post the lines from the SAS log for running the two step test that I posted before.
The code generates an error.
rsubmit;
* SUBSET the data ;
data msf1;
set crsp.msf;
where date<"1jan1930"d;
run;
* TEST if sorted ;
data _null_;
set msf1 ;
by permno date;
run;
endrsubmit;And the output here.
1 rsubmit;
NOTE: Remote submit to WRDS commencing.
1 * SUBSET the data ;
2 data msf1;
3 set crsp.msf;
4 where date<"1jan1930"d;
5 run;
NOTE: There were 29968 observations read from the data set CRSP.MSF.
WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.69 seconds
cpu time 0.36 seconds
6 * TEST if sorted ;
7 data _null_;
8 set msf1 ;
9 by permno date;
10 run;
ERROR: BY variables are not properly sorted on data set WORK.MSF1.
CUSIP=69499890 PERMNO=75471 PERMCO=25928 ISSUNO=0 HEXCD=1 HSICCD=3490
DATE=19251231 BIDLO=. ASKHI=. PRC=-32.00000 VOL=. RET=C BID=31.00000
ASK=33.00000 SHROUT=70 CFACPR=4.810256 CFACSHR=4 ALTPRC=-32.00000
SPREAD=2.00000
ALTPRCDT=19251231 RETX=C FIRST.PERMNO=1 LAST.PERMNO=1 FIRST.DATE=1
LAST.DATE=1
_ERROR_=1 _N_=520
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 521 observations read from the data set WORK.MSF1.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: Remote submit to WRDS complete.Adding BY eliminates the error as follows.
rsubmit;
* SUBSET the data ;
data msf1;
set crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
* TEST if sorted ;
data _null_;
set msf1 ;
by permno date;
run;
endrsubmit;And the output here too.
2 rsubmit;
NOTE: Remote submit to WRDS commencing.
11 * SUBSET the data ;
12 data msf1;
13 set crsp.msf;
14 where date<"1jan1930"d;
15 by permno date;
16 run;
NOTE: There were 29968 observations read from the data set CRSP.MSF.
WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.79 seconds
cpu time 0.32 seconds
17 * TEST if sorted ;
18 data _null_;
19 set msf1 ;
20 by permno date;
21 run;
NOTE: There were 29968 observations read from the data set WORK.MSF1.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: Remote submit to WRDS complete.(1) crsp.msf is sorted. (2) crsp.msf+WHERE is not sorted. (3) crsp.msf+WHERE+BY is sorted.
Do you know how that libname, CRSP, is defined? Is it using the BASE (default) SAS engine? SAS/Share? Some other engine?
Also what version of SAS is that remote session running?
crsp is the REMOTE engine from WRDS, and it's SAS 9.4 TS1M5.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.