how to set the seed in random number generation in SQL

JacobSimonsen · Posted 07-10-2017 10:04 AM

Hello,

Im in doubt of how to set the seed when using the rand() function in proc sql. Im aware that the "call streaminit" routine can be used if one will use the rand function in datasteps. How do I do it in proc sql? Or alternatively in fedsql?

As example, lets say that I want to order the dataset "mydata" in a random order.

data mydata;
do i=1 to 20;
output;
end;
run;
proc sql;
select * from mydata
order by rand('uniform');
quit;

art297 · Posted 07-10-2017 10:18 AM

I don't think you can set the seed, using the rand function in proc sql. It will automatically use the system clock. If you want to use a specific seed, you'd either have to use the ranuni function or run it in a data step where you can set a specific seed.

Art, CEO, AnalystFinder.com

FreelanceReinh · Posted 09-01-2022 05:20 PM

Hello @JacobSimonsen,

I came across this old thread earlier today because I had the same question. Like @art297 I had always believed that there was no way to set the random seed (or the random number generator, for that matter) in this situation and had been wondering why PROC SQL should have this shortcoming.

Now I think I've found a solution. The trick is to execute CALL STREAMINIT in the same PROC SQL step (not necessarily in the same SELECT statement, which is also possible), before the RAND function call. This could be done, for example, in a separate SELECT or INSERT statement by using the %SYSCALL statement in conjunction with the RESOLVE function (to avoid premature resolution of the macro statement before the PROC SQL step is executing).

In your example:

data mydata;
do i=1 to 20;
  output;
end;
run;

%let seed=27182818;

proc sql;
select resolve('%syscall streaminit(seed)')
from sashelp.loc; /* an arbitrary one-observation dataset */

select * from mydata
order by rand('uniform');
quit;

The result is the same as that from

data temp;
call streaminit(27182818);
set mydata;
r=rand('uniform');
run;

proc sql;
select i from temp
order by r;
quit;

(using Windows SAS 9.4M5), except that the "dummy" SELECT statement creates a blank output (which, of course, could be avoided by the NOPRINT option of the PROC SQL statement if the second SELECT statement did not print relevant output).

Here's another example where the blank output is avoided by using an INSERT statement. (Maybe someone has a better idea for a "dummy" PROC SQL statement to accommodate the resolve-%syscall construct.)

%let rng=MT64;
%let seed=3141592;

data dummy;
x=' ';
run;

proc sql;
insert into dummy
set x=resolve('%syscall streaminit(rng, seed)');

select name, rand('normal') as r format=17.14
from sashelp.class;
quit;

Again, the result matches that of a DATA step using

call streaminit('MT64', 3141592);

Quentin · Posted 09-01-2022 06:21 PM

As I understand it, the concept of record order is not really defined for SQL. When a query executes, it's free to read records in any order that makes sense to the SQL optimizer (and will of course return in order if you ask for it with an ORDER BY clause). Therefore I'm not sure order by rand('uniform') would make sense, because theoretically, it's possible that the records could be read in a different order each time the query is run. This is unlikely to happen if the source is a SAS dataset, but if the source is a database table I think it's more likely, and I think the database is allowed to change the order of how records are stored, insert records into the middle of a table, etc. Given FEDSQL's multi-threading ability, I'd think it's probably even more likely to read records in different orders, depending on the number of threads etc.

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

FreelanceReinh · Posted 09-02-2022 07:54 AM

This is a very good point. Indeed, the notorious "instability" of record order in PROC SQL can easily defeat the exact reproducibility of results and pseudo-random numbers created in PROC SQL are a case in point: The assignment of such numbers to two or more records changes if the record order is permuted. There might be cases where record order is irrelevant, though. For example, if random items from a set of key values are generated to draw a random sample (with replacement) from some dataset.

I think the question aimed at the apparent limitation of PROC SQL that the usual control of initial seed and RNG via CALL STREAMINIT seemed to be unavailable. As it turned out, the control is available, yes, but it is less useful than in the DATA step.

Rick_SAS · Posted 09-03-2022 07:13 AM

As others have discussed there are data order and reproducibility issues. However, I think there is a simpler way to call STREAMINIT in a PROC SQL step. Since the advent of DS2, some CALL subroutines were made available as function calls. The STREAMINIT function is one example.

The first call to the STREAMINIT function returns the seed. This is usually the SEED argument, but if you use 0 as the argument, you will get back the actual seed value that was generated from the system time. Subsequent calls return 0.

Here is an example of calling the STREAMINIT function in a SELECT statement in PROC SQL. For comparison, I include a DATA step that generates the same random values, as shown by PROC COMPARE.

data mydata;
do cnt=1 to 1000;
  output;
end;
run;

%let seed = 27182818;

data RandDS;
call streaminit(&seed);
set mydata(drop=cnt);
u = rand('uniform');
n = rand('normal');
run;

/* You can call STREAMINIT as a function in a SELECT statement.
   The first time it returns the seed used. The second 
   time it returns 0 */
proc sql;
create table RandSQL as
   select streaminit(&seed), 
          rand('uniform') as u,
          rand('normal') as n 
   from mydata;
quit;

proc compare base=RandDS compare=RandSQL;
run;

FreelanceReinh · Posted 09-03-2022 08:54 AM

Thank you so much, @Rick_SAS, for sharing this valuable information.

So, in the list of the 400+ DS2 functions, which at first glance looks like a subset of the list of the 600+ data step functions, there are about 16 items that are not contained in the longer list. Among these 16 functions only 3 are available in a DATA or PROC SQL step (according to a quick check in SAS 9.4M5). And these three are as diverse as

INTNEST Function     Calculates the number of whole periods of the smaller interval that will fit into the period of the larger interval.
PROBDF Function      Calculates significance probabilities for Dickey-Fuller tests for unit roots in time series.
STREAMINIT Function  Specifies a random-number generator and seed value for generating random numbers.

What a discovery. Thanks again.

PGStats · Posted 09-03-2022 04:49 PM

Great discussion ! Here is a simple example showing the dangers of trusting the seed setting mechanism to get reproducible random sequences in SQL. Note that the only difference between the two queries is the order of the datasets mentioned in the inner join

data mydata1;
do cnt=1 to 10;
  output;
end;
run;

proc sort data=mydata1 out=mydata2; by descending cnt; run;

%let seed = 27182818;

proc sql;
create table RandSQL1(drop=dumb) as
   select mydata1.cnt,
          streaminit(&seed) as dumb, 
          rand('uniform') as u,
          rand('normal') as n 
   from mydata1 inner join mydata2 on mydata1.cnt=mydata2.cnt
   order by cnt;
quit;

proc sql;   
create table RandSQL2(drop=dumb) as
   select mydata1.cnt,
          streaminit(&seed) as dumb, 
          rand('uniform') as u,
          rand('normal') as n 
   from mydata2 inner join mydata1 on mydata1.cnt=mydata2.cnt
   order by cnt;
quit;

proc sql;
select a.cnt, a.u=b.u, a.n=b.n
from randsql1 as a, randsql2 as b
where a.cnt=b.cnt;
quit;

PG

FreelanceReinh · Posted 09-04-2022 08:12 AM

It's true that the sequences of random numbers differ between RandSQL1 and RandSQL2 in this example, but the difference is still "systematic" as the order is just reversed: If the second ORDER BY clause is changed to

order by cnt desc;

the differences regarding variables U and N vanish (at least on my computer) even when MYDATA1 was extended from 10 to 10 million observations.

Unpredictable differences are typically seen in more complex situations such as many-to-many joins.

Example:

data dups1 dups2;
do cnt=1 to 10000;
  do sub=1 to 4;
    if sub<=2 then output dups1;
    else output dups2;
  end;
end;
run;

%let seed = 27182818;
proc sql;
create table RandSQL(drop=dumb) as
   select a.cnt, a.sub as sub1, b.sub as sub2,
          streaminit(&seed) as dumb, 
          rand('uniform') as u,
          rand('normal') as n 
   from dups1 a inner join dups2 b on a.cnt=b.cnt
   order by a.cnt, sub1, sub2;
quit;

data RandDS;
call streaminit(&seed);
do cnt=1 to 10000;
  do sub1=1, 2;
    do sub2=3, 4;
      u=rand('uniform');
      n=rand('normal');
      output;
    end;
  end;
end;
run;

proc compare data=RandSQL c=RandDS;
id cnt sub1 sub2;
run;

On my computer the first 8000+ observations in RandSQL match those from RandDS exactly, but for CNT=2079 (obs. 8313, ..., 8316) the first difference occurs: the random numbers are swapped between SUB2=3 and SUB2=4. The next difference (of the same type) occurs for CNT=2111. In total, 7248 observations (1812 of the 10000 CNT values) are affected.

PGStats · Posted 09-04-2022 03:18 PM

@FreelanceReinh wrote:

It's true that the sequences of random numbers differ between RandSQL1 and RandSQL2 in this example, but the difference is still "systematic" as the order is just reversed: If the second ORDER BY clause is changed to
order by cnt desc;
the differences regarding variables U and N vanish (at least on my computer) even when MYDATA1 was extended from 10 to 10 million observations.

When the join order is changed and the sort order is changed, the original random sequence is recovered, but the random values are no longer associated with the same cnt values.

I get exactly the same results as you describe when I run your many-to-many join example on a SODA (SAS On Demand for Academics) server.

To anybody considering to call the SETINIT function in proc SQL I say DON'T !

PG

how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Re: how to set the seed in random number generation in SQL

Classroom Training Available!