BookmarkSubscribeRSS Feed
Fluorite | Level 6

## Random combination of variables

Hello,

I have a dataset with 2000+ random first names, another dataset with 1000+ first names, and a third dataset with 100+ cities. I wanted to find a way of creating a random combinations of the three variables so that my end result would be a dataset with 2000+ first name, last name, and cities. I have a feeling I would need a do loop of some sort but im stuck. Thanks!

34 REPLIES 34
Diamond | Level 26

## Re: Random combination of variables

UNTESTED CODE

Let's suppose your data set names are firstname, lastname and city, and there is also a variable in that data set by the same name.

``````proc sql;
create table combined as select *,uniform(0) as rand
from firstname,lastname,city order by rand;
quit;
data final;
set combined(obs=2000);
run;
``````
--
Paige Miller
Fluorite | Level 6

## Re: Random combination of variables

Hi, unfortunately sas wont stop running when i executed the code.

Jade | Level 19

## Re: Random combination of variables

The Cartesian product will be HUGE.

Super User

## Re: Random combination of variables

2000 * 1000 * 100 = 200,000,000  records.

Each additional city adds 2,000,000 records depending on how much larger your 2000+ and 1000+ actually are …

It will take a bit of time.

Jade | Level 19

## Re: Random combination of variables

Maybe this will work.  In the factors statement use the actual obs numbers of your 3 data sets.  Noprint is important on factors statement.  If you want to keep the obs numbers from the OBS data set created by PROC PLAN assign them to new variables in DATA SAMPLE.

``````data firstname; do firstname=1 to 2000; output; end; run;
data lastname;  do lastname=1 to 1000;  output; end; run;
data city;      do city=1 to 100;       output; end; run;

proc plan;
factors f=2000 random l=1 of 1000 random c=1 of 100 random / noprint;
output out=obs;
run;
proc print data=obs(obs=100); run;
data sample;
set obs;
set firstname point=f;
set lastname point=l;
set city point=c;
run;

proc print data=sample(obs=100); run;
``````
Jade | Level 19

## Re: Random combination of variables

@juanvenegas wrote:

Hello,

I have a dataset with 2000+ random first names, another dataset with 1000+ first names, and a third dataset with 100+ cities. I wanted to find a way of creating a random combinations of the three variables so that my end result would be a dataset with 2000+ first name, last name, and cities. I have a feeling I would need a do loop of some sort but im stuck. Thanks!

Hello @juanvenegas,

If you want to assign random last names and cities to each of the 2000+ first names, try this:

``````data want;
call streaminit(27182818);
set firstname;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
run;``````

(I see, @data_null__ had a similar idea.)

Jade | Level 19

## Re: Random combination of variables

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").

Jade | Level 19

## Re: Random combination of variables

@data_null__ wrote:

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").

Thanks. The "integer distribution" is fairly new: see this post. I was happy to get it with my recent update to release 9.4M5.

Jade | Level 19

## Re: Random combination of variables

@FreelanceReinh wrote:

@data_null__ wrote:

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").

Thanks. The "integer distribution" is fairly new: see this post. I was happy to get it with my recent update to release 9.4M5.

Sh.t I'm way behind this times.

Current version: 9.04.01M3P062415

Fluorite | Level 6

## Re: Random combination of variables

@FreelanceReinhard Thanks! This worked! If its not too much trouble, could you go over the logic of the code. Also, if I wanted to randomly sort the first col (the first names) since right now they're in alphabetical order, how may I go about doing so. Thanks again!

Jade | Level 19

## Re: Random combination of variables

@juanvenegas: You're welcome.

Here is an easy way to sort the output dataset randomly:

```data want;
call streaminit(27182818);
set firstname;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
r=ranuni(31416);
run;

proc sort data=want out=want(drop=r);
by r;
run;
```

(I used the old RANUNI function because the newer RAND function would share the underlying random number stream with the two other calls of RAND and thus create different assignments of LASTNAME and CITY for all but the first FIRSTNAME. This would not be wrong, but you couldn't just compare the old, alphabetically sorted and the new dataset WANT.)

In each iteration of the DATA step the first SET statement reads a first name from dataset FIRSTNAME. At compile time already, the NOBS= options of the second and third SET statement have retrieved the number of observations in datasets LASTNAME and CITY, respectively. This is why variables N1 and N2 can be used successfully in the assignment statements populating variables P1 and P2, although these precede the corresponding SET statements.

The RAND function using the "integer" distribution and a second argument k (=N1 or N2, resp.) generates a random integer from the set {1, 2, ..., k} (uniform distribution), based on a random number stream that was initialized at the beginning of the DATA step by means of the CALL STREAMINIT routine (27182818 is an arbitrary seed value). Thus, P1 and P2 are assigned randomly selected observation numbers from dataset LASTNAME and CITY, respectively.

The second and third SET statement use the POINT= option to designate the next observation to read, i.e., they read the observations specified by P1 and P2.

In the extended DATA step above, variable R is assigned a random number from the continuous uniform distribution on the interval [0, 1]. Only this variable and the variables contained in datasets FIRSTNAME, LASTNAME and CITY are written to dataset WANT because variables in the NOBS= and POINT= options of the SET statement are automatically dropped.

The DATA step terminates after all observations from dataset FIRSTNAME have been processed as described above.

The PROC SORT step sorts dataset WANT by R, hence randomly, and sheds variable R (using a DROP= dataset option), which is no longer needed.

Fluorite | Level 6

## Re: Random combination of variables

Thanks for everything!

Jade | Level 19

## Re: Random combination of variables

@FreelanceReinh I expecting you to do this.

``````data want;
call streaminit(27182818);
do _n_ = 1 to n0;
p0=rand('integer',n0)
set firstname nobs=n0 point=p0;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
output;
end;
STOP;
run;
``````
Jade | Level 19

## Re: Random combination of variables

Hi @data_null__, this was in fact the first impulse I had, too (even with the same variable names N0, P0). However, I stopped going this way when I realized that this would not use all first names in dataset FIRSTNAME, but only about 63%. (Okay, datasets LASTNAME and CITY aren't always fully exhausted either, but the expected rates are higher.)

Discussion stats
• 34 replies
• 1722 views
• 2 likes
• 6 in conversation