BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Tom
Super User Tom
Super User

So the remote session you are submitting that code to is also connected remotely to something else?

Perhaps it is using SAS/Share? 

Junyong
Pyrite | Level 9

I access to the WRDS database by SAS/CONNECT, but don't know much about the interface WRDS uses.

Quentin
Super User

I couldn't find much on the REMOTE engine, but from your log, looks like the REMOTE engine may not read records sequentially.  It could be doing some mutli-threading or similar in the background, I suppose.  

 

That's an intersting "gotcha", since reading a SAS dataset in a DATA step is (I thought) always sequential.

 

Adding the BY statement must force the REMOTE enginge to read data sequentially (in order to  create first. and last. variables).

 

I'm curious, can you see the difference with something like:

 

 

data unsorted;
  set crsp.msf ;
  where date<"1jan1930"d;
  put (_N_ permno date)(=) ;

  if _N_>100 then stop;
run;

data sorted;
  set crsp.msf;
  where date<"1jan1930"d;
  put (_N_ permno date)(=) ;
  by permno date;

  if _N_>100 then stop;
run;

If it really is that the REMOTE engine is free to read a SAS dataset in any (?) order that would be scary to me.  It's good that adding a BY statement forces it to read sequentially.  But there are plenty of other times where code expects to read a SAS dataset sequentially (e.g. using lag function, or whatever).  

 

We know relational databases don't store records in order therefore you can't expect to read them in a defined sequential order, and changes to PROC SQL code can easily change sort order of output if you don't explicitly order it.

 

But I always thought your order would be safe just reading and writing a SAS dataset in a DATA step.  It's definitely safe with the BASE engine.

 

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Junyong
Pyrite | Level 9

Yes. crsp.msf is good. crsp.msf+WHERE+BY is also good. Only crsp.msf+WHERE is problematic.

Here is the code.

rsubmit;
data _null_;
  set crsp.msf ;
  put (_N_ permno date)(=) ;

  if _N_>10 then stop;
run;

data unsorted;
  set crsp.msf ;
  where date<"1jan1930"d;
  put (_N_ permno date)(=) ;

  if _N_>10 then stop;
run;

data sorted;
  set crsp.msf;
  where date<"1jan1930"d;
  put (_N_ permno date)(=) ;
  by permno date;

  if _N_>10 then stop;
run;
endrsubmit;

And here is the output.

1    rsubmit;
NOTE: Remote submit to WRDS commencing.
1    data _null_;
2      set crsp.msf ;
3      put (_N_ permno date)(=) ;
4
5      if _N_>10 then stop;
6    run;

_N_=1 PERMNO=10000 DATE=19851231
_N_=2 PERMNO=10000 DATE=19860131
_N_=3 PERMNO=10000 DATE=19860228
_N_=4 PERMNO=10000 DATE=19860331
_N_=5 PERMNO=10000 DATE=19860430
_N_=6 PERMNO=10000 DATE=19860530
_N_=7 PERMNO=10000 DATE=19860630
_N_=8 PERMNO=10000 DATE=19860731
_N_=9 PERMNO=10000 DATE=19860829
_N_=10 PERMNO=10000 DATE=19860930
_N_=11 PERMNO=10000 DATE=19861031
NOTE: There were 11 observations read from the data set CRSP.MSF.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


7
8    data unsorted;
9      set crsp.msf ;
10     where date<"1jan1930"d;
11     put (_N_ permno date)(=) ;
12
13     if _N_>10 then stop;
14   run;

_N_=1 PERMNO=10006 DATE=19251231
_N_=2 PERMNO=10014 DATE=19251231
_N_=3 PERMNO=10022 DATE=19251231
_N_=4 PERMNO=10030 DATE=19251231
_N_=5 PERMNO=10049 DATE=19251231
_N_=6 PERMNO=10057 DATE=19251231
_N_=7 PERMNO=10065 DATE=19251231
_N_=8 PERMNO=10073 DATE=19251231
_N_=9 PERMNO=10081 DATE=19251231
_N_=10 PERMNO=10102 DATE=19251231
_N_=11 PERMNO=10110 DATE=19251231
NOTE: There were 11 observations read from the data set CRSP.MSF.
      WHERE date<'01JAN1930'D;
NOTE: The data set WORK.UNSORTED has 10 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


15
16   data sorted;
17     set crsp.msf;
18     where date<"1jan1930"d;
19     put (_N_ permno date)(=) ;
20     by permno date;
21
22     if _N_>10 then stop;
23   run;

_N_=1 PERMNO=10006 DATE=19251231
_N_=2 PERMNO=10006 DATE=19260130
_N_=3 PERMNO=10006 DATE=19260227
_N_=4 PERMNO=10006 DATE=19260331
_N_=5 PERMNO=10006 DATE=19260430
_N_=6 PERMNO=10006 DATE=19260528
_N_=7 PERMNO=10006 DATE=19260630
_N_=8 PERMNO=10006 DATE=19260731
_N_=9 PERMNO=10006 DATE=19260831
_N_=10 PERMNO=10006 DATE=19260930
_N_=11 PERMNO=10006 DATE=19261030
NOTE: There were 12 observations read from the data set CRSP.MSF.
      WHERE date<'01JAN1930'D;
NOTE: The data set WORK.SORTED has 10 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


NOTE: Remote submit to WRDS complete.

If sequential access is necessary, then BY will be important.

Reeza
Super User
The general rule of programming is if you didn't sort it you CANNOT assume it's going to be sorted. Especially if pulling from DB.
Junyong
Pyrite | Level 9

The PROC CONTENTS output above displays Sortedby PERMNO DATE. It seems WHERE must be accompanied by BY to preserve the sorting status. Thanks.

Reeza
Super User

Where is applied before BY so that doesn't seem right either.... 

 

WHERE and BY in a DATA Step

If a DATA step contains both a WHERE statement and a BY statement, the WHERE statement executes before BY groups are created. Therefore, BY groups reflect groups of observations in the subset of observations that are selected by the WHERE statement, not the actual BY groups of observations in the original input data set.

For a complete discussion of BY-group processing, see BY-Group Processing in the DATA Step in SAS Language Reference: Concepts.

 

EDIT: I suspect the BY statement adds a sort flag to data set, but will check and confirm. 

Reeza
Super User

So nope. I stand by my original answer, you cannot assume that a data set is sorted. The sort flag does not propogate across. There may be an implicit sort, but you're taking a risk there when SAS is not sorting the data. I will add this may behave differently if you're working on DBs because SQL by definition doesn't maintain row orders. 

 

BY does check for the implicit sort and won't error out if it's there so likely that's what you're seeing and if its good enough for you on this system you're good. If you change things you may have to re-think this, I would probably add a check to ensure the data is sorted as needed or at least if it errors out it will explain why. 

 

*implicitly sorted data set;
data have;
input ID Age;
cards;
1 5
2 8 
3 13
4 14
5 3
6 34
7 6
8 3
9 3
10 2
;;;;

*check by/where without sort;
data want1 (label="Implicit Sort, BY and Where");
set have;
by id;
where id in (1, 2, 3);
run;

*explicitly sort data set now;
proc sort data=have out=have2 (label="Sorted input data");
by id;
run;

data want2 (label="explicit Sort, Only Where");
set have2;
where id in (1, 2, 3);
run;

data want3 (label="explicit sort, By and Where");
set have2;
by id;
where id in (1, 2,3 );
run;


data want3  (label="Explicit sort, copy only with BY");
set have2;
by id;
run;

data results;
set sashelp.vtable;
where libname='WORK' and (%Upcase(memname) like '%WANT%' or  %Upcase(memname) like '%HAVE%') ;
keep libname memname memlabel sort:;
run;

delete_results.JPG

 

Junyong
Pyrite | Level 9

So everything other than the explicit sorting would be risky then. For example, just have2 is safe. I use want1 or want3 as rely on the implicit sorting, but seem to be risky. My example is closer to want2 (have2 is explicitly sorted, but want2 is not necessarily. Likewise, crsp.msf is explicitly sorted, but will not be as soon as used with WHERE).

Kurt_Bremser
Super User

If a dataset is stored in a SPDS library, there can be side-effects from the way the data is stored in the "buckets" and the bucket metadata is used for optimizing the where.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 1110 views
  • 3 likes
  • 5 in conversation