PROBLEM STATEMENT
I have 4 large data tables from which I have to create a random sample of certain sizes from each data table. Each data table has a field called COMMAND. The values for the COMMAND field are as follows: DC, TX, FL and GL. Each data table has only one command value eg either DC or TX or FL or GL. Each data table has more than 5 million records. The sample sizes are as follows for each COMMAND:
Data Table Command Sample size
Table_1 DC 1000
Table_2 TX 1500
Table_3 FL 2000
Table_4 GL 1400
What I want to do is a MACRO that creates a series of random samples using PROC SURVEYSELECT from those 4 data tables based on the values of COMMAND variable in those tables.
I need to create a random sample for each of these command (DC, TX,FL and GL) with the sample size of 1000, 1500,2000, and 1400 respectively as shown ABOVE. I need a macro that can do this using PROC SURVEYSELECT . What I want is, a random sample of 1000 for DC, 1500 for TX, 2000 for FL and 1400 for GL from those 4 tables . For example, if the command field DC has 4 million records in table_1, I want to create a random sample size of 1000 for DC from that table (table_1) and likewise, 1500 random sample for TX (another table, table_2) and likewise for all(table_3 and table_4). Please help me on this. I really appreciate the kind assistance.
I want to use the following code for random sampling OR ANOTHER CODE IF YOU WANT TO SUGGEST
PROC SURVEYSELECT DATA=ERP_MASTER
METHOD=SRS
REP=1
SAMPSIZE=&samplesize.
OUT= &COMMAND._SRS;
ID _ALL_;
I hope this helpS you understand the problem STATEMENT. I highly appreciate your kind assistance. Thanks
Gopi Upreti
You can do this with call execute:
data _null_;
length line $200;
input Table $ Command $ SampleSize;
line = catt("proc surveyselect data=",Table," sampsize=",SampleSize," out=",Command,"_SRS; run;");
call execute(line);
datalines;
Table_1 DC 1000
Table_2 TX 1500
Table_3 FL 2000
Table_4 GL 1400
;
I think CALL EXECUTE is the best tool for the job. Assuming that your program is basically working the way you want, here is how you could automate:
data _null_;
set commands;
call execute('PROC SURVEYSELECT DATA=');
call execute(data_table);
call execute('ERP_MASTER
METHOD=SRS
REP=1
SAMPSIZE=');
call execute(put(samplesize, 4.);
call execute('OUT=');
call execute(COMMAND);
call execute('_SRS; ID _ALL_; run;');
run;
Theoretically, you could put everything together in a single CALL EXECUTE, but I think this is easier to read.
There are details to attend to ... always using 4 digits as the sample size (hope that's enough but you can increase it if needed), no trailing blanks allowed at the end of COMMAND.
I assumed you wanted to add the RUN statements.
Good luck.
Just in case you had gone to the work of segregating the data just for this step, you could have used STRAT=command and a list for sampsize parameter. Just make sure that the order of the list on the sampsize matches your strata order.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.