Re: Random combination of variables - Page 2

data_null__ · Posted 10-31-2018 11:48 AM

@FreelanceReinh I had thought a bit about the full sample issue as you mention @juanvenegas did not give much detail about that. If we generate permutations for last name and city in blocks we can insure full use of all first last and city names. Back to PROC PLAN one of my favorite procedures.

data firstname; do fn=1 to 341; firstname='F: '||put(fn,words64.); output; end; run; %let f0=&SYSNOBS;
data lastname;  do ln=1 to 134; lastname ='L: '||put(ln,words64.); output; end; run; %let l0=&SYSNOBS;
data city;      do cn=1 to 15;  city     ='C: '||put(cn,words64.); output; end; run; %let c0=&SYSNOBS;

proc plan seed=1234;
   factors f=&f0 random                                    / noprint;   output out=fobs;   run;
   factors r1=%sysevalf(&f0/&l0,ceil) ordered l=&l0 random / noprint;   output out=lobs;   run;
   factors r2=%sysevalf(&f0/&c0,ceil) ordered c=&c0 random / noprint;   output out=cobs;   run;
   quit;

data sample;
   merge fobs(in=in1) lobs cobs;
   if not in1 then stop;
   p1=f; p2=l; p3=c;
   set firstname point=p1;
   set lastname point=p2;
   set city point=p3;
   run;

proc print data=sample(obs=100); run;

FreelanceReinh · Posted 10-31-2018 12:05 PM

Thanks @data_null__, looks interesting. I'll take a closer look later.

@juanvenegas: As you can see, your question was really inspiring. 🙂

Ksharp · Posted 10-31-2018 10:09 AM

John King,

Why not PROC SURVEYSELECT which is more efficient I think.

proc surveyselect data=firstname out=a method=srs sampsize=200 reps=10 ;
run;

proc surveyselect data=lastname out=b method=srs sampsize=200 reps=10 ;
run;

proc surveyselect data=city out=c method=urs sampsize=200 reps=10 ;
run;

data want;

merge a b c;

run;

FreelanceReinh · Posted 10-31-2018 10:31 AM

Thanks @Ksharp for presenting another elegant solution. Of course, with METHOD=URS again not all first names would be used (in general). Also, the DATA step solution doesn't need hardcoded sample sizes.

Regarding efficiency:

216  data want;
217  call streaminit(27182818);
218  set firstname;
219  p1=rand('integer',n1);
220  set lastname nobs=n1 point=p1;
221  p2=rand('integer',n2);
222  set city nobs=n2 point=p2;
223  run;

NOTE: There were 2000 observations read from the data set WORK.FIRSTNAME.
NOTE: The data set WORK.WANT has 2000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

juanvenegas · Posted 10-31-2018 12:36 PM

These are all great. Last question. I now have a dataset with 2000+ observations with variables such as first_name, last_name, city, etc. If I wanted to create a dataset with over 2,000,000 observations using the dataset I already have, how would I go about that? Would it be efficient to just to stack datasets on top of one another until i reach 2,000,000 and then randomly sort? Thanks!

FreelanceReinh · Posted 10-31-2018 02:19 PM

To keep it simple, I'd probably make only minor modifications to the existing code:

data want;
call streaminit(27182818);
set firstname;
do _n_=1 to 1000;
  p1=rand('integer',n1);
  set lastname nobs=n1 point=p1;
  p2=rand('integer',n2);
  set city nobs=n2 point=p2;
  r=ranuni(31416);
  output;
end;
run;

(The PROC SORT step remains unchanged.)

Thus, you get each of the first names exactly a thousand times. Last names and cities are assigned independently and randomly as before.

With 2000 first names, 1000 last names and 100 cities you're likely to get about 10,000 duplicate records with the above method. If this was an issue (e.g. you don't want any duplicate records), you could sufficiently increase the number of DO-loop iterations and then remove the duplicates, e.g., by sorting NODUPKEY by firstname lastname city. Finally, sort by r as before.

juanvenegas · Posted 10-31-2018 04:28 PM

Thank you for the solution. However, it has 40 mins since i executed the code. Is there another way of creating a dataset with 2 million observations from one that has 2000 observations? Thanks!!

FreelanceReinh · Posted 10-31-2018 04:41 PM

@juanvenegas wrote:

Thank you for the solution. However, it has 40 mins since i executed the code. Is there another way of creating a dataset with 2 million observations from one that has 2000 observations? Thanks!!

Are you saying that the DATA step creating (about) 2 million observations from three datasets containing (about) 2000, 1000 and 100 observations, respectively, took 40 (forty) minutes to complete?

With my test data (using numeric variables rather than character variables, though) it took less than one second:

208  data want;
209  call streaminit(27182818);
210  set firstname;
211  do _n_=1 to 1000;
212    p1=rand('integer',n1);
213    set lastname nobs=n1 point=p1;
214    p2=rand('integer',n2);
215    set city nobs=n2 point=p2;
216    r=ranuni(31416);
217    output;
218  end;
219  run;

NOTE: There were 2000 observations read from the data set WORK.FIRSTNAME.
NOTE: The data set WORK.WANT has 2000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.69 seconds
      cpu time            0.68 seconds

Does your log look essentially the same apart from the time data?

Edit: Same for the PROC SORT step:

220  proc sort data=want out=want(drop=r);
221  by r;
222  run;

NOTE: There were 2000000 observations read from the data set WORK.WANT.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.WANT has 2000000 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.52 seconds
      cpu time            1.18 seconds

juanvenegas · Posted 10-31-2018 04:55 PM

So unfortunately, I never received log output because It never finished executing. I had to stop it manually. I'm not sure why it doesn't work.

FreelanceReinh · Posted 10-31-2018 05:07 PM

@juanvenegas wrote:

So unfortunately, I never received log output because It never finished executing. I had to stop it manually. I'm not sure why it doesn't work.

What happens if you reduce the number of DO-loop iterations, e.g., from 1000 to 100 or only 10 and/or similarly the number of observations used from dataset FIRSTNAME, e.g. by using

set firstname(obs=10);

data_null__ · Posted 10-31-2018 05:14 PM

@juanvenegas wrote:

So unfortunately, I never received log output because It never finished executing. I had to stop it manually. I'm not sure why it doesn't work.

You need to show your code.

juanvenegas · Posted 11-01-2018 08:19 AM

filename src temp; 
filename src1 temp;
filename src2 temp; 
filename src3 temp; 

proc http
	method='get'
	url='https://names.mongabay.com/male_names_alpha.htm'
	out=src; 
run; 

proc http
	method='get'
	url='https://names.mongabay.com/female_names.htm'
	out=src1; 
run;

proc http 
	method='get'
	url='https://names.mongabay.com/data/1000.html'
	out=src2; 
run; 

proc http 	
	method='get'
	url='https://www.biggestuscities.com/'
	out=src3;
run; 

data m; 
	infile src length=len lrecl=32767; 
	input line $varying32767. len; 
		line = strip(line); 
		if len>0; 
run; 

data m(keep=first_name);
	set m; 
	if find(line,'<tr><td>') then do; 
		first_name =compress(scan(line, 3,'>'), '</td'); 
		output; 
	end;
run ;

data f; 
	infile src1 length=len lrecl=32767; 
	input line $varying32767. len; 
		line = strip(line); 
		if len>0; 
run; 

data f(keep=first_name);
	set f; 
	if find(line,'<tr><td>') then do; 
		first_name =compress(scan(line, 3,'>'), '</td'); 
		output; 
	end;
run ;

data lastnames; 
	infile src2 length=len lrecl=32767; 
	input line $varying32767. len; 
		line = strip(line); 
		if len>0; 
run;

data lastnames(keep=last_name); 
	set lastnames;
	if find(line,'<tr><td>') then do; 
		last_name=compress(scan(line,3,'>'), '</td class="c1"'); 
		output; 
	end; 
run; 

data lastnames;
	set lastnames; 	
	if _n_=1 then delete; 
run; 

data cities; 
	infile src3 length=len lrecl=32767; 
	input line $varying32767. len ;
		line=strip(line); 
		if len>0; 
run; 

data cities; 
	set cities; 
	if find(line, '/city/') then do; 
		pickup=_n_+1; 
		/* read ahead one line */
		set cities (rename=(line=city)) point=pickup;
		output;
	end; 
	drop line; 
run; 
		
proc sql outobs=2000; 
	create table firstnames as
		select * from m
		union
		select * from f
		order by ranuni(0); 
quit; 

data want;
	call streaminit(27182818);
	set firstnames;
	do _n_=1 to 1000;
		set lastnames nobs=n1 point=p1;
		p2=rand('integer',n2);
		set cities nobs=n2 point=p2;
	 	r=ranuni(31416); 
		output; 
	end;

run; 

proc sort data=want out=want(drop=r); 
	by r; 
run;

juanvenegas · Posted 11-01-2018 08:23 AM

first_name=ANDRE i=1001 n1=1000 p1=0 last_name=  p2=108 n2=1000 city=Oxnard r=0.4124817473 _ERROR_=1 _N_=1
first_name=VON i=1001 n1=1000 p1=0 last_name=  p2=314 n2=1000 city=Orem r=0.0159212062 _ERROR_=1 _N_=2
first_name=KIMBERLY i=1001 n1=1000 p1=0 last_name=  p2=426 n2=1000 city=Tustin r=0.9027105714 _ERROR_=1 _N_=3
first_name=HECTOR i=1001 n1=1000 p1=0 last_name=  p2=327 n2=1000 city=Yuma r=0.2278609286 _ERROR_=1 _N_=4
first_name=OTHA i=1001 n1=1000 p1=0 last_name=  p2=597 n2=1000 city=Springfield r=0.8277588453 _ERROR_=1 _N_=5
first_name=NAOMI i=1001 n1=1000 p1=0 last_name=  p2=282 n2=1000 city=Santa Maria r=0.6695287682 _ERROR_=1 _N_=6
first_name=AUTUMN i=1001 n1=1000 p1=0 last_name=  p2=648 n2=1000 city=Dubuque r=0.0367212226 _ERROR_=1 _N_=7
first_name=BENITA i=1001 n1=1000 p1=0 last_name=  p2=619 n2=1000 city=Lakewood r=0.8849451397 _ERROR_=1 _N_=8
first_name=COREY i=1001 n1=1000 p1=0 last_name=  p2=299 n2=1000 city=South Bend r=0.4537835002 _ERROR_=1 _N_=9
2                                                          The SAS System                           08:10 Thursday, November 1, 2018

first_name=DESHAWN i=1001 n1=1000 p1=0 last_name=  p2=367 n2=1000 city=Bellingham r=0.0297729597 _ERROR_=1 _N_=10
first_name=DANA i=1001 n1=1000 p1=0 last_name=  p2=349 n2=1000 city=Greenville r=0.1076641763 _ERROR_=1 _N_=11
first_name=VAUGHN i=1001 n1=1000 p1=0 last_name=  p2=592 n2=1000 city=Grand Junction r=0.9345960836 _ERROR_=1 _N_=12
first_name=LINDSEY i=1001 n1=1000 p1=0 last_name=  p2=375 n2=1000 city=Citrus Heights r=0.6866326126 _ERROR_=1 _N_=13
first_name=VIOLET i=1001 n1=1000 p1=0 last_name=  p2=989 n2=1000 city=Hurst r=0.002655746 _ERROR_=1 _N_=14
first_name=LIDIA i=1001 n1=1000 p1=0 last_name=  p2=149 n2=1000 city=Elk Grove r=0.7080130888 _ERROR_=1 _N_=15
first_name=MURIEL i=1001 n1=1000 p1=0 last_name=  p2=100 n2=1000 city=Des Moines r=0.3503743826 _ERROR_=1 _N_=16
first_name=NETTIE i=1001 n1=1000 p1=0 last_name=  p2=292 n2=1000 city=League City r=0.436807743 _ERROR_=1 _N_=17
first_name=ANGEL i=1001 n1=1000 p1=0 last_name=  p2=695 n2=1000 city=Manhattan r=0.6702266404 _ERROR_=1 _N_=18
first_name=BENITO i=1001 n1=1000 p1=0 last_name=  p2=303 n2=1000 city=Sparks r=0.2132111556 _ERROR_=1 _N_=19
WARNING: Limit set by ERRORS= option reached.  Further errors of this type will not be printed.

This is my log message.

FreelanceReinh · Posted 11-01-2018 08:28 AM

@juanvenegas wrote:

data want;
	call streaminit(27182818);
	set firstnames;
	do _n_=1 to 1000;
		set lastnames nobs=n1 point=p1;
		p2=rand('integer',n2);
		set cities nobs=n2 point=p2;
	 	r=ranuni(31416); 
		output; 
	end;

run;

Why did you delete this important line from the DATA step?

p1=rand('integer',n1);

juanvenegas · Posted 11-01-2018 08:33 AM

0__o....... thank you for pointing that out. let me try again.

Classroom Training Available!