BookmarkSubscribeRSS Feed
juanvenegas
Fluorite | Level 6

Hello, 

 

I have a dataset with 2000+ random first names, another dataset with 1000+ first names, and a third dataset with 100+ cities. I wanted to find a way of creating a random combinations of the three variables so that my end result would be a dataset with 2000+ first name, last name, and cities. I have a feeling I would need a do loop of some sort but im stuck. Thanks!

34 REPLIES 34
PaigeMiller
Diamond | Level 26

UNTESTED CODE

 

Let's suppose your data set names are firstname, lastname and city, and there is also a variable in that data set by the same name.

 

proc sql;
     create table combined as select *,uniform(0) as rand
     from firstname,lastname,city order by rand;
quit;
data final;
    set combined(obs=2000);
run;
--
Paige Miller
juanvenegas
Fluorite | Level 6

Hi, unfortunately sas wont stop running when i executed the code. 

data_null__
Jade | Level 19

The Cartesian product will be HUGE.

 

ballardw
Super User

2000 * 1000 * 100 = 200,000,000  records.

Each additional city adds 2,000,000 records depending on how much larger your 2000+ and 1000+ actually are …

It will take a bit of time.

data_null__
Jade | Level 19

Maybe this will work.  In the factors statement use the actual obs numbers of your 3 data sets.  Noprint is important on factors statement.  If you want to keep the obs numbers from the OBS data set created by PROC PLAN assign them to new variables in DATA SAMPLE.

 

data firstname; do firstname=1 to 2000; output; end; run;
data lastname;  do lastname=1 to 1000;  output; end; run;
data city;      do city=1 to 100;       output; end; run;

proc plan;
   factors f=2000 random l=1 of 1000 random c=1 of 100 random / noprint;
   output out=obs;
   run;
proc print data=obs(obs=100); run;
data sample;
   set obs;
   set firstname point=f;
   set lastname point=l;
   set city point=c;
   run;

proc print data=sample(obs=100); run;
FreelanceReinh
Jade | Level 19

@juanvenegas wrote:

Hello, 

 

I have a dataset with 2000+ random first names, another dataset with 1000+ first names, and a third dataset with 100+ cities. I wanted to find a way of creating a random combinations of the three variables so that my end result would be a dataset with 2000+ first name, last name, and cities. I have a feeling I would need a do loop of some sort but im stuck. Thanks!


Hello @juanvenegas,

 

If you want to assign random last names and cities to each of the 2000+ first names, try this:

data want;
call streaminit(27182818);
set firstname;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
run;

(I see, @data_null__ had a similar idea.)

data_null__
Jade | Level 19

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").  

FreelanceReinh
Jade | Level 19

@data_null__ wrote:

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").  


Thanks. The "integer distribution" is fairly new: see this post. I was happy to get it with my recent update to release 9.4M5.

data_null__
Jade | Level 19

@FreelanceReinh wrote:

@data_null__ wrote:

@FreelanceReinh yours is much better I did not know about RAND("INTEGER").  


Thanks. The "integer distribution" is fairly new: see this post. I was happy to get it with my recent update to release 9.4M5.


 

Sh.t I'm way behind this times. 

Current version: 9.04.01M3P062415

juanvenegas
Fluorite | Level 6

@FreelanceReinhard Thanks! This worked! If its not too much trouble, could you go over the logic of the code. Also, if I wanted to randomly sort the first col (the first names) since right now they're in alphabetical order, how may I go about doing so. Thanks again!

FreelanceReinh
Jade | Level 19

@juanvenegas: You're welcome.

 

Here is an easy way to sort the output dataset randomly:

data want;
call streaminit(27182818);
set firstname;
p1=rand('integer',n1);
set lastname nobs=n1 point=p1;
p2=rand('integer',n2);
set city nobs=n2 point=p2;
r=ranuni(31416);
run;

proc sort data=want out=want(drop=r);
by r;
run;

(I used the old RANUNI function because the newer RAND function would share the underlying random number stream with the two other calls of RAND and thus create different assignments of LASTNAME and CITY for all but the first FIRSTNAME. This would not be wrong, but you couldn't just compare the old, alphabetically sorted and the new dataset WANT.)

 

In each iteration of the DATA step the first SET statement reads a first name from dataset FIRSTNAME. At compile time already, the NOBS= options of the second and third SET statement have retrieved the number of observations in datasets LASTNAME and CITY, respectively. This is why variables N1 and N2 can be used successfully in the assignment statements populating variables P1 and P2, although these precede the corresponding SET statements.

 

The RAND function using the "integer" distribution and a second argument k (=N1 or N2, resp.) generates a random integer from the set {1, 2, ..., k} (uniform distribution), based on a random number stream that was initialized at the beginning of the DATA step by means of the CALL STREAMINIT routine (27182818 is an arbitrary seed value). Thus, P1 and P2 are assigned randomly selected observation numbers from dataset LASTNAME and CITY, respectively.

 

The second and third SET statement use the POINT= option to designate the next observation to read, i.e., they read the observations specified by P1 and P2.

 

In the extended DATA step above, variable R is assigned a random number from the continuous uniform distribution on the interval [0, 1]. Only this variable and the variables contained in datasets FIRSTNAME, LASTNAME and CITY are written to dataset WANT because variables in the NOBS= and POINT= options of the SET statement are automatically dropped.

 

The DATA step terminates after all observations from dataset FIRSTNAME have been processed as described above.

 

The PROC SORT step sorts dataset WANT by R, hence randomly, and sheds variable R (using a DROP= dataset option), which is no longer needed.

juanvenegas
Fluorite | Level 6

Thanks for everything!

data_null__
Jade | Level 19

@FreelanceReinh I expecting you to do this. 

 

data want;
   call streaminit(27182818);
   do _n_ = 1 to n0;
      p0=rand('integer',n0)
      set firstname nobs=n0 point=p0;
      p1=rand('integer',n1);
      set lastname nobs=n1 point=p1;
      p2=rand('integer',n2);
      set city nobs=n2 point=p2;
      output;
      end;
   STOP;
   run;
FreelanceReinh
Jade | Level 19

Hi @data_null__, this was in fact the first impulse I had, too (even with the same variable names N0, P0). However, I stopped going this way when I realized that this would not use all first names in dataset FIRSTNAME, but only about 63%. (Okay, datasets LASTNAME and CITY aren't always fully exhausted either, but the expected rates are higher.)

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 34 replies
  • 1439 views
  • 2 likes
  • 6 in conversation