Re: Random combination of variables

juanvenegas · Posted 10-30-2018 02:45 PM

Hello,

I have a dataset with 2000+ random first names, another dataset with 1000+ first names, and a third dataset with 100+ cities. I wanted to find a way of creating a random combinations of the three variables so that my end result would be a dataset with 2000+ first name, last name, and cities. I have a feeling I would need a do loop of some sort but im stuck. Thanks!

PaigeMiller · Posted 10-30-2018 02:54 PM

UNTESTED CODE

Let's suppose your data set names are firstname, lastname and city, and there is also a variable in that data set by the same name.

proc sql;
     create table combined as select *,uniform(0) as rand
     from firstname,lastname,city order by rand;
quit;
data final;
    set combined(obs=2000);
run;

--
Paige Miller

juanvenegas · Posted 10-30-2018 03:30 PM

Hi, unfortunately sas wont stop running when i executed the code.

data_null__ · Posted 10-30-2018 05:29 PM

The Cartesian product will be HUGE.

ballardw · Posted 10-30-2018 07:18 PM

2000 * 1000 * 100 = 200,000,000 records.

Each additional city adds 2,000,000 records depending on how much larger your 2000+ and 1000+ actually are …

It will take a bit of time.

data_null__ · Posted 10-30-2018 05:43 PM

Maybe this will work. In the factors statement use the actual obs numbers of your 3 data sets. Noprint is important on factors statement. If you want to keep the obs numbers from the OBS data set created by PROC PLAN assign them to new variables in DATA SAMPLE.

data firstname; do firstname=1 to 2000; output; end; run;
data lastname;  do lastname=1 to 1000;  output; end; run;
data city;      do city=1 to 100;       output; end; run;

proc plan;
   factors f=2000 random l=1 of 1000 random c=1 of 100 random / noprint;
   output out=obs;
   run;
proc print data=obs(obs=100); run;
data sample;
   set obs;
   set firstname point=f;
   set lastname point=l;
   set city point=c;
   run;

proc print data=sample(obs=100); run;

FreelanceReinh · Posted 10-30-2018 05:52 PM

@juanvenegas wrote:

Hello,

I have a dataset with 2000+ random first names, another dataset with 1000+ first names, and a third dataset with 100+ cities. I wanted to find a way of creating a random combinations of the three variables so that my end result would be a dataset with 2000+ first name, last name, and cities. I have a feeling I would need a do loop of some sort but im stuck. Thanks!

Hello @juanvenegas,

If you want to assign random last names and cities to each of the 2000+ first names, try this:

data want;
call streaminit(27182818);
set firstname;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
run;

(I see, @data_null__ had a similar idea.)

data_null__ · Posted 10-30-2018 05:56 PM

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").

FreelanceReinh · Posted 10-30-2018 06:03 PM

@data_null__ wrote:

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").

Thanks. The "integer distribution" is fairly new: see this post. I was happy to get it with my recent update to release 9.4M5.

data_null__ · Posted 10-30-2018 06:11 PM

@FreelanceReinh wrote:

@data_null__ wrote:

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").

Thanks. The "integer distribution" is fairly new: see this post. I was happy to get it with my recent update to release 9.4M5.

Sh.t I'm way behind this times.

Current version: 9.04.01M3P062415

juanvenegas · Posted 10-31-2018 08:32 AM

@FreelanceReinhard Thanks! This worked! If its not too much trouble, could you go over the logic of the code. Also, if I wanted to randomly sort the first col (the first names) since right now they're in alphabetical order, how may I go about doing so. Thanks again!

FreelanceReinh · Posted 10-31-2018 09:24 AM

@juanvenegas: You're welcome.

Here is an easy way to sort the output dataset randomly:

data want;
call streaminit(27182818);
set firstname;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
r=ranuni(31416);
run;

proc sort data=want out=want(drop=r);
by r;
run;

(I used the old RANUNI function because the newer RAND function would share the underlying random number stream with the two other calls of RAND and thus create different assignments of LASTNAME and CITY for all but the first FIRSTNAME. This would not be wrong, but you couldn't just compare the old, alphabetically sorted and the new dataset WANT.)

In each iteration of the DATA step the first SET statement reads a first name from dataset FIRSTNAME. At compile time already, the NOBS= options of the second and third SET statement have retrieved the number of observations in datasets LASTNAME and CITY, respectively. This is why variables N1 and N2 can be used successfully in the assignment statements populating variables P1 and P2, although these precede the corresponding SET statements.

The RAND function using the "integer" distribution and a second argument k (=N1 or N2, resp.) generates a random integer from the set {1, 2, ..., k} (uniform distribution), based on a random number stream that was initialized at the beginning of the DATA step by means of the CALL STREAMINIT routine (27182818 is an arbitrary seed value). Thus, P1 and P2 are assigned randomly selected observation numbers from dataset LASTNAME and CITY, respectively.

The second and third SET statement use the POINT= option to designate the next observation to read, i.e., they read the observations specified by P1 and P2.

In the extended DATA step above, variable R is assigned a random number from the continuous uniform distribution on the interval [0, 1]. Only this variable and the variables contained in datasets FIRSTNAME, LASTNAME and CITY are written to dataset WANT because variables in the NOBS= and POINT= options of the SET statement are automatically dropped.

The DATA step terminates after all observations from dataset FIRSTNAME have been processed as described above.

The PROC SORT step sorts dataset WANT by R, hence randomly, and sheds variable R (using a DROP= dataset option), which is no longer needed.

juanvenegas · Posted 10-31-2018 09:35 AM

Thanks for everything!

data_null__ · Posted 10-31-2018 09:40 AM

@FreelanceReinh I expecting you to do this.

data want;
   call streaminit(27182818);
   do _n_ = 1 to n0;
      p0=rand('integer',n0)
      set firstname nobs=n0 point=p0;
      p1=rand('integer',n1);
      set lastname nobs=n1 point=p1;
      p2=rand('integer',n2);
      set city nobs=n2 point=p2;
      output;
      end;
   STOP;
   run;

FreelanceReinh · Posted 10-31-2018 09:53 AM

Hi @data_null__, this was in fact the first impulse I had, too (even with the same variable names N0, P0). However, I stopped going this way when I realized that this would not use all first names in dataset FIRSTNAME, but only about 63%. (Okay, datasets LASTNAME and CITY aren't always fully exhausted either, but the expected rates are higher.)

Classroom Training Available!