@FreelanceReinh I had thought a bit about the full sample issue as you mention @juanvenegas did not give much detail about that. If we generate permutations for last name and city in blocks we can insure full use of all first last and city names. Back to PROC PLAN one of my favorite procedures.
data firstname; do fn=1 to 341; firstname='F: '||put(fn,words64.); output; end; run; %let f0=&SYSNOBS;
data lastname; do ln=1 to 134; lastname ='L: '||put(ln,words64.); output; end; run; %let l0=&SYSNOBS;
data city; do cn=1 to 15; city ='C: '||put(cn,words64.); output; end; run; %let c0=&SYSNOBS;
proc plan seed=1234;
factors f=&f0 random / noprint; output out=fobs; run;
factors r1=%sysevalf(&f0/&l0,ceil) ordered l=&l0 random / noprint; output out=lobs; run;
factors r2=%sysevalf(&f0/&c0,ceil) ordered c=&c0 random / noprint; output out=cobs; run;
quit;
data sample;
merge fobs(in=in1) lobs cobs;
if not in1 then stop;
p1=f; p2=l; p3=c;
set firstname point=p1;
set lastname point=p2;
set city point=p3;
run;
proc print data=sample(obs=100); run;
Thanks @data_null__, looks interesting. I'll take a closer look later.
@juanvenegas: As you can see, your question was really inspiring. 🙂
John King,
Why not PROC SURVEYSELECT which is more efficient I think.
proc surveyselect data=firstname out=a method=srs sampsize=200 reps=10 ;
run;
proc surveyselect data=lastname out=b method=srs sampsize=200 reps=10 ;
run;
proc surveyselect data=city out=c method=urs sampsize=200 reps=10 ;
run;
data want;
merge a b c;
run;
Thanks @Ksharp for presenting another elegant solution. Of course, with METHOD=URS again not all first names would be used (in general). Also, the DATA step solution doesn't need hardcoded sample sizes.
Regarding efficiency:
216 data want; 217 call streaminit(27182818); 218 set firstname; 219 p1=rand('integer',n1); 220 set lastname nobs=n1 point=p1; 221 p2=rand('integer',n2); 222 set city nobs=n2 point=p2; 223 run; NOTE: There were 2000 observations read from the data set WORK.FIRSTNAME. NOTE: The data set WORK.WANT has 2000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
These are all great. Last question. I now have a dataset with 2000+ observations with variables such as first_name, last_name, city, etc. If I wanted to create a dataset with over 2,000,000 observations using the dataset I already have, how would I go about that? Would it be efficient to just to stack datasets on top of one another until i reach 2,000,000 and then randomly sort? Thanks!
To keep it simple, I'd probably make only minor modifications to the existing code:
data want; call streaminit(27182818); set firstname; do _n_=1 to 1000; p1=rand('integer',n1); set lastname nobs=n1 point=p1; p2=rand('integer',n2); set city nobs=n2 point=p2; r=ranuni(31416); output; end; run;
(The PROC SORT step remains unchanged.)
Thus, you get each of the first names exactly a thousand times. Last names and cities are assigned independently and randomly as before.
With 2000 first names, 1000 last names and 100 cities you're likely to get about 10,000 duplicate records with the above method. If this was an issue (e.g. you don't want any duplicate records), you could sufficiently increase the number of DO-loop iterations and then remove the duplicates, e.g., by sorting NODUPKEY by firstname lastname city. Finally, sort by r as before.
Thank you for the solution. However, it has 40 mins since i executed the code. Is there another way of creating a dataset with 2 million observations from one that has 2000 observations? Thanks!!
@juanvenegas wrote:
Thank you for the solution. However, it has 40 mins since i executed the code. Is there another way of creating a dataset with 2 million observations from one that has 2000 observations? Thanks!!
Are you saying that the DATA step creating (about) 2 million observations from three datasets containing (about) 2000, 1000 and 100 observations, respectively, took 40 (forty) minutes to complete?
With my test data (using numeric variables rather than character variables, though) it took less than one second:
208 data want; 209 call streaminit(27182818); 210 set firstname; 211 do _n_=1 to 1000; 212 p1=rand('integer',n1); 213 set lastname nobs=n1 point=p1; 214 p2=rand('integer',n2); 215 set city nobs=n2 point=p2; 216 r=ranuni(31416); 217 output; 218 end; 219 run; NOTE: There were 2000 observations read from the data set WORK.FIRSTNAME. NOTE: The data set WORK.WANT has 2000000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.69 seconds cpu time 0.68 seconds
Does your log look essentially the same apart from the time data?
Edit: Same for the PROC SORT step:
220 proc sort data=want out=want(drop=r); 221 by r; 222 run; NOTE: There were 2000000 observations read from the data set WORK.WANT. NOTE: SAS threaded sort was used. NOTE: The data set WORK.WANT has 2000000 observations and 3 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.52 seconds cpu time 1.18 seconds
So unfortunately, I never received log output because It never finished executing. I had to stop it manually. I'm not sure why it doesn't work.
@juanvenegas wrote:
So unfortunately, I never received log output because It never finished executing. I had to stop it manually. I'm not sure why it doesn't work.
What happens if you reduce the number of DO-loop iterations, e.g., from 1000 to 100 or only 10 and/or similarly the number of observations used from dataset FIRSTNAME, e.g. by using
set firstname(obs=10);
@juanvenegas wrote:
So unfortunately, I never received log output because It never finished executing. I had to stop it manually. I'm not sure why it doesn't work.
You need to show your code.
filename src temp; filename src1 temp; filename src2 temp; filename src3 temp; proc http method='get' url='https://names.mongabay.com/male_names_alpha.htm' out=src; run; proc http method='get' url='https://names.mongabay.com/female_names.htm' out=src1; run; proc http method='get' url='https://names.mongabay.com/data/1000.html' out=src2; run; proc http method='get' url='https://www.biggestuscities.com/' out=src3; run; data m; infile src length=len lrecl=32767; input line $varying32767. len; line = strip(line); if len>0; run; data m(keep=first_name); set m; if find(line,'<tr><td>') then do; first_name =compress(scan(line, 3,'>'), '</td'); output; end; run ; data f; infile src1 length=len lrecl=32767; input line $varying32767. len; line = strip(line); if len>0; run; data f(keep=first_name); set f; if find(line,'<tr><td>') then do; first_name =compress(scan(line, 3,'>'), '</td'); output; end; run ; data lastnames; infile src2 length=len lrecl=32767; input line $varying32767. len; line = strip(line); if len>0; run; data lastnames(keep=last_name); set lastnames; if find(line,'<tr><td>') then do; last_name=compress(scan(line,3,'>'), '</td class="c1"'); output; end; run; data lastnames; set lastnames; if _n_=1 then delete; run; data cities; infile src3 length=len lrecl=32767; input line $varying32767. len ; line=strip(line); if len>0; run; data cities; set cities; if find(line, '/city/') then do; pickup=_n_+1; /* read ahead one line */ set cities (rename=(line=city)) point=pickup; output; end; drop line; run; proc sql outobs=2000; create table firstnames as select * from m union select * from f order by ranuni(0); quit; data want; call streaminit(27182818); set firstnames; do _n_=1 to 1000; set lastnames nobs=n1 point=p1; p2=rand('integer',n2); set cities nobs=n2 point=p2; r=ranuni(31416); output; end; run; proc sort data=want out=want(drop=r); by r; run;
first_name=ANDRE i=1001 n1=1000 p1=0 last_name= p2=108 n2=1000 city=Oxnard r=0.4124817473 _ERROR_=1 _N_=1 first_name=VON i=1001 n1=1000 p1=0 last_name= p2=314 n2=1000 city=Orem r=0.0159212062 _ERROR_=1 _N_=2 first_name=KIMBERLY i=1001 n1=1000 p1=0 last_name= p2=426 n2=1000 city=Tustin r=0.9027105714 _ERROR_=1 _N_=3 first_name=HECTOR i=1001 n1=1000 p1=0 last_name= p2=327 n2=1000 city=Yuma r=0.2278609286 _ERROR_=1 _N_=4 first_name=OTHA i=1001 n1=1000 p1=0 last_name= p2=597 n2=1000 city=Springfield r=0.8277588453 _ERROR_=1 _N_=5 first_name=NAOMI i=1001 n1=1000 p1=0 last_name= p2=282 n2=1000 city=Santa Maria r=0.6695287682 _ERROR_=1 _N_=6 first_name=AUTUMN i=1001 n1=1000 p1=0 last_name= p2=648 n2=1000 city=Dubuque r=0.0367212226 _ERROR_=1 _N_=7 first_name=BENITA i=1001 n1=1000 p1=0 last_name= p2=619 n2=1000 city=Lakewood r=0.8849451397 _ERROR_=1 _N_=8 first_name=COREY i=1001 n1=1000 p1=0 last_name= p2=299 n2=1000 city=South Bend r=0.4537835002 _ERROR_=1 _N_=9 2 The SAS System 08:10 Thursday, November 1, 2018 first_name=DESHAWN i=1001 n1=1000 p1=0 last_name= p2=367 n2=1000 city=Bellingham r=0.0297729597 _ERROR_=1 _N_=10 first_name=DANA i=1001 n1=1000 p1=0 last_name= p2=349 n2=1000 city=Greenville r=0.1076641763 _ERROR_=1 _N_=11 first_name=VAUGHN i=1001 n1=1000 p1=0 last_name= p2=592 n2=1000 city=Grand Junction r=0.9345960836 _ERROR_=1 _N_=12 first_name=LINDSEY i=1001 n1=1000 p1=0 last_name= p2=375 n2=1000 city=Citrus Heights r=0.6866326126 _ERROR_=1 _N_=13 first_name=VIOLET i=1001 n1=1000 p1=0 last_name= p2=989 n2=1000 city=Hurst r=0.002655746 _ERROR_=1 _N_=14 first_name=LIDIA i=1001 n1=1000 p1=0 last_name= p2=149 n2=1000 city=Elk Grove r=0.7080130888 _ERROR_=1 _N_=15 first_name=MURIEL i=1001 n1=1000 p1=0 last_name= p2=100 n2=1000 city=Des Moines r=0.3503743826 _ERROR_=1 _N_=16 first_name=NETTIE i=1001 n1=1000 p1=0 last_name= p2=292 n2=1000 city=League City r=0.436807743 _ERROR_=1 _N_=17 first_name=ANGEL i=1001 n1=1000 p1=0 last_name= p2=695 n2=1000 city=Manhattan r=0.6702266404 _ERROR_=1 _N_=18 first_name=BENITO i=1001 n1=1000 p1=0 last_name= p2=303 n2=1000 city=Sparks r=0.2132111556 _ERROR_=1 _N_=19 WARNING: Limit set by ERRORS= option reached. Further errors of this type will not be printed.
This is my log message.
@juanvenegas wrote:data want; call streaminit(27182818); set firstnames; do _n_=1 to 1000; set lastnames nobs=n1 point=p1; p2=rand('integer',n2); set cities nobs=n2 point=p2; r=ranuni(31416); output; end; run;
Why did you delete this important line from the DATA step?
p1=rand('integer',n1);
0__o....... thank you for pointing that out. let me try again.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.