BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

I thought that WHERE subsetting does not affect the order of observations. For example, the following my consists of 500 is with 10 ts each. The WHERE in the second DATA picks t=4,5,6,7 without distorting the order.

data my;
do i=1 to 500;
do t=1 to 10;
output;
end;
end;
run;
data my;
set my;
where 3<t<8;
run;

I realized that the following WRDS example behaves differently when WHERE subsetting. The original data crsp.msf is sorted by PERMNO and DATE, but the resulting msf does not preserve the original order.

%let wrds=wrds.wharton.upenn.edu 4016;
signon wrds username=_prompt_;
rsubmit;
data msf;
set crsp.msf;
where date<"1jan1930"d;
run;
endrsubmit;

And I found that BY restores the original order—PERMNO and DATE. I wonder (1) why the resulting observations are shuffled here, and (2) whether BY after WHERE is necessary to make the observations sequential. Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

So nope. I stand by my original answer, you cannot assume that a data set is sorted. The sort flag does not propogate across. There may be an implicit sort, but you're taking a risk there when SAS is not sorting the data. I will add this may behave differently if you're working on DBs because SQL by definition doesn't maintain row orders. 

 

BY does check for the implicit sort and won't error out if it's there so likely that's what you're seeing and if its good enough for you on this system you're good. If you change things you may have to re-think this, I would probably add a check to ensure the data is sorted as needed or at least if it errors out it will explain why. 

 

*implicitly sorted data set;
data have;
input ID Age;
cards;
1 5
2 8 
3 13
4 14
5 3
6 34
7 6
8 3
9 3
10 2
;;;;

*check by/where without sort;
data want1 (label="Implicit Sort, BY and Where");
set have;
by id;
where id in (1, 2, 3);
run;

*explicitly sort data set now;
proc sort data=have out=have2 (label="Sorted input data");
by id;
run;

data want2 (label="explicit Sort, Only Where");
set have2;
where id in (1, 2, 3);
run;

data want3 (label="explicit sort, By and Where");
set have2;
by id;
where id in (1, 2,3 );
run;


data want3  (label="Explicit sort, copy only with BY");
set have2;
by id;
run;

data results;
set sashelp.vtable;
where libname='WORK' and (%Upcase(memname) like '%WANT%' or  %Upcase(memname) like '%HAVE%') ;
keep libname memname memlabel sort:;
run;

delete_results.JPG

 

View solution in original post

25 REPLIES 25
Reeza
Super User
You cannot assume order of a data set unless you've explicitly sorted it. BY checks that the data set is sorted as expected and will return an error if it's not.
Junyong
Pyrite | Level 9

1.png

It seems crsp.msf is sorted by PERMNO and DATE. I wonder whether WHERE affects the sorting status.

Tom
Super User Tom
Super User

When reading from a normal SAS dataset the values are read in order. So the WHERE statement will NOT change the order. But your code will not store anything into the metadata field(s) used to indicate what if any variables were used to sort the data.

 

Try it yourself.

 

Add a BY statement to your data step.  It will test if the values making it past the WHERE condition do not present themselves to the data step in a sorted order and fail if the are not ordered as indicated by the BY statement  If that happens then the metadata that generated the report you posted the photograph of is wrong.

 

If you just want to subset the data you could use PROC SORT and that would definitely set the order and also store what variables were used to order the data into the metadata.

proc sort out= msf data=crsp.msf;
  where date<"1jan1930"d;
  by permno date;
run;

 

Junyong
Pyrite | Level 9

The sorting of the original data is correct by PERMNO and DATE. Here I attach the code and the screenshots.

rsubmit;
data msf;
set crsp.msf;
by permno date;/*<-this works correctly*/
run;
data msf1;/*<-the resulting msf1 is not sorted*/
set crsp.msf;
where date<"1jan1930"d;
run;
data msf2;/*<-the resulting msf2 is sorted*/
set crsp.msf;
where date<"1jan1930"d;
by permno date;
run;
endrsubmit;

The code above creates three data sets—MSF, MSF1, and MSF2. The BY in the first DATA has no problem.

The only difference between MSF1 and MSF2 is the BY statement in the third DATA.

3    rsubmit;
NOTE: Remote submit to WRDS commencing.
1    data msf;
2    set crsp.msf;
3    by permno date;/*<-this works correctly*/
4    run;

NOTE: There were 4509846 observations read from the data set CRSP.MSF.
NOTE: The data set WORK.MSF has 4509846 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           1.61 seconds
      cpu time            1.52 seconds


5    data msf1;/*<-the resulting msf1 is not sorted*/
6    set crsp.msf;
7    where date<"1jan1930"d;
8    run;

NOTE: There were 29968 observations read from the data set CRSP.MSF.
      WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.68 seconds
      cpu time            0.43 seconds


9    data msf2;/*<-the resulting msf2 is sorted*/
10   set crsp.msf;
11   where date<"1jan1930"d;
12   by permno date;
13   run;

NOTE: There were 29968 observations read from the data set CRSP.MSF.
      WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF2 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           1.00 seconds
      cpu time            0.44 seconds


NOTE: Remote submit to WRDS complete.

The following is the resulting (not original) MSF. The observations are sorted correctly by PERMNO and DATE.

1.png

The following is MSF1 without BY. The PERMNO-DATE sorting disappears.

2.png

The following is MSF2 with BY. The PERMNO-DATE sorting is correct.

3.png

It seems SAS does not access to the original data sequentially if not BY.

Tom
Super User Tom
Super User

So the data is NOT sorted by those variables.  But perhaps you have an INDEX on those variables that will allow SAS to access the data in sorted order when you use the BY statement.

Kurt_Bremser
Super User

See this short example:

data class;
set sashelp.class;
run;

proc datasets nolist;
modify class;
index create sex;
quit;

data class1;
set class;
where age = 14;
by sex;
run;

data class2;
set class;
where age = 14;
run;

Since there is an index file present for the msf dataset, you see the same kind of behaviour. The only thing that puzzles me is your output where it says sorted by PERMNO DATE. Since this is not the output of a proc contents, how did you get that?

Junyong
Pyrite | Level 9

I got the information from Details in Properties in Explorer.

                                                         The SAS System

                                                     The CONTENTS Procedure

          Data Set Name        CRSP.MSF                                                 Observations          4509846
          Member Type          DATA                                                     Variables             21
          Engine               V9                                                       Indexes               5
          Created              01/31/2019 15:39:40                                      Observation Length    168
          Last Modified        01/31/2019 15:39:48                                      Deleted Observations  0
          Protection                                                                    Compressed            NO
          Data Set Type                                                                 Sorted                YES
          Label                Monthly Stock - Securities
          Data Representation  SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
          Encoding             latin1  Western (ISO)


                                               Engine/Host Dependent Information

                              Data Set Page Size          65536
                              Number of Data Set Pages    11595
                              First Data Page             1
                              Max Obs per Page            389
                              Obs in First Data Page      361
                              Index File Page Size        8192
                              Number of Index File Pages  40018
                              Number of Data Set Repairs  0
                              Filename                    /wrds/crsp/sasdata/a_stock/msf.sas7bdat
                              Release Created             9.0401M5
                              Host Created                Linux
                              Inode Number                2465424092
                              Access Permission           rw-r-----
                              Owner Name                  wrdsadmn
                              File Size                   725MB
                              File Size (bytes)           759955456


                                          Alphabetic List of Variables and Attributes

                #    Variable    Type    Len    Format       Informat    Label

               18    ALTPRC      Num       8    12.5         12.5        Price Alternate
               20    ALTPRCDT    Num       8    YYMMDDN8.    YYMMDD6.    Alternate Price Date
               14    ASK         Num       8    11.5         11.5        Ask
                9    ASKHI       Num       8    12.5         12.5        Ask or High Price
               13    BID         Num       8    11.5         11.5        Bid
                8    BIDLO       Num       8    12.5         12.5        Bid or Low Price
               16    CFACPR      Num       8                             Cumulative Factor to Adjust Prices
               17    CFACSHR     Num       8                             Cumulative Factor to Adjust Shares/Vol
                1    CUSIP       Char      8    8.           8.          CUSIP Header
                7    DATE        Num       8    YYMMDDN8.                Date of Observation
                5    HEXCD       Num       8    2.           2.          Exchange Code Header
                6    HSICCD      Num       8    8.           8.          Standard Industrial Classification Code
                4    ISSUNO      Num       8    8.           8.          Nasdaq Issue Number
                3    PERMCO      Num       8    8.           8.          PERMCO
                2    PERMNO      Num       8    8.           8.          PERMNO
               10    PRC         Num       8    12.5         12.5        Price or Bid/Ask Average
               12    RET         Num       8    11.6         11.6        Returns
               21    RETX        Num       8    11.6         11.6        Returns without Dividends
               15    SHROUT      Num       8                             Shares Outstanding
               19    SPREAD      Num       8    11.5         11.5        Spread Between Bid and Ask
               11    VOL         Num       8    10.          10.         Volume


                                           Alphabetic List of Indexes and Attributes

                                                                      # of
                                                                    Unique
                                                     #    Index     Values

                                                     1    CUSIP      32985
                                                     2    DATE        1117
                                                     3    HSICCD      1445
                                                     4    PERMCO     29481
                                                     5    PERMNO     32985


                                                        Sort Information

                                                   Sortedby       PERMNO DATE
                                                   Validated      YES
                                                   Character Set  ASCII
                                                   Sort Option    NODUPKEY

Here I attach the results from PROC CONTENTS too.

Tom
Super User Tom
Super User

It is impossible for me to tell from your photographs of your browser window of the dataset whether the data was sorted or not.

Please run the following test instead.

* SUBSET the data ;
data msf1;
  set crsp.msf;
  where date<"1jan1930"d;
run;
* TEST if sorted ;
data _null_;
  set msf1 ;
  by permno date;
run;
Junyong
Pyrite | Level 9

The PROC CONTENTS output right above says crsp.msf is sorted (Sortedby PERMNO DATE). It seems (1) though crsp.msf per se is sorted, crsp.msf with WHERE is not, and (2) though crsp.msf with WHERE is not sorted, crsp.msf with both WHERE and BY is sorted. Thanks.

Reeza
Super User

If the data is sorted when read in and you use a BY statement SAS will not generate an error. 

If the data is sorted with a prior sort and sort entry in PROC CONTENTS then it will not generate an error. 

 

However, these are two different cases that may need to be considered separately. 

Tom
Super User Tom
Super User

Please post the lines from the SAS log for running the two step test that I posted before.

Junyong
Pyrite | Level 9

The code generates an error.

rsubmit;
* SUBSET the data ;
data msf1;
  set crsp.msf;
  where date<"1jan1930"d;
run;
* TEST if sorted ;
data _null_;
  set msf1 ;
  by permno date;
run;
endrsubmit;

And the output here.

1    rsubmit;
NOTE: Remote submit to WRDS commencing.
1    * SUBSET the data ;
2    data msf1;
3      set crsp.msf;
4      where date<"1jan1930"d;
5    run;

NOTE: There were 29968 observations read from the data set CRSP.MSF.
      WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.69 seconds
      cpu time            0.36 seconds


6    * TEST if sorted ;
7    data _null_;
8      set msf1 ;
9      by permno date;
10   run;

ERROR: BY variables are not properly sorted on data set WORK.MSF1.
CUSIP=69499890 PERMNO=75471 PERMCO=25928 ISSUNO=0 HEXCD=1 HSICCD=3490
DATE=19251231 BIDLO=. ASKHI=. PRC=-32.00000 VOL=. RET=C BID=31.00000
ASK=33.00000 SHROUT=70 CFACPR=4.810256 CFACSHR=4 ALTPRC=-32.00000
SPREAD=2.00000
ALTPRCDT=19251231 RETX=C FIRST.PERMNO=1 LAST.PERMNO=1 FIRST.DATE=1
LAST.DATE=1
_ERROR_=1 _N_=520
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 521 observations read from the data set WORK.MSF1.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


NOTE: Remote submit to WRDS complete.

Adding BY eliminates the error as follows.

rsubmit;
* SUBSET the data ;
data msf1;
  set crsp.msf;
  where date<"1jan1930"d;
  by permno date;
run;
* TEST if sorted ;
data _null_;
  set msf1 ;
  by permno date;
run;
endrsubmit;

And the output here too.

2    rsubmit;
NOTE: Remote submit to WRDS commencing.
11   * SUBSET the data ;
12   data msf1;
13     set crsp.msf;
14     where date<"1jan1930"d;
15     by permno date;
16   run;

NOTE: There were 29968 observations read from the data set CRSP.MSF.
      WHERE date<'01JAN1930'D;
NOTE: The data set WORK.MSF1 has 29968 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.79 seconds
      cpu time            0.32 seconds


17   * TEST if sorted ;
18   data _null_;
19     set msf1 ;
20     by permno date;
21   run;

NOTE: There were 29968 observations read from the data set WORK.MSF1.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


NOTE: Remote submit to WRDS complete.

(1) crsp.msf is sorted. (2) crsp.msf+WHERE is not sorted. (3) crsp.msf+WHERE+BY is sorted.

Tom
Super User Tom
Super User

Do you know how that libname, CRSP,  is defined? Is it using the BASE (default) SAS engine? SAS/Share? Some other engine?

Also what version of SAS is that remote session running?

Junyong
Pyrite | Level 9

crsp is the REMOTE engine from WRDS, and it's SAS 9.4 TS1M5.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 25 replies
  • 1614 views
  • 3 likes
  • 5 in conversation