12-17-2015 04:26 PM
I have 4 large data tables from which I have to create a random sample of certain sizes from each data table. Each data table has a field called COMMAND. The values for the COMMAND field are as follows: DC, TX, FL and GL. Each data table has only one command value eg either DC or TX or FL or GL. Each data table has more than 5 million records. The sample sizes are as follows for each COMMAND:
Data Table Command Sample size
Table_1 DC 1000
Table_2 TX 1500
Table_3 FL 2000
Table_4 GL 1400
What I want to do is a MACRO that creates a series of random samples using PROC SURVEYSELECT from those 4 data tables based on the values of COMMAND variable in those tables.
I need to create a random sample for each of these command (DC, TX,FL and GL) with the sample size of 1000, 1500,2000, and 1400 respectively as shown ABOVE. I need a macro that can do this using PROC SURVEYSELECT . What I want is, a random sample of 1000 for DC, 1500 for TX, 2000 for FL and 1400 for GL from those 4 tables . For example, if the command field DC has 4 million records in table_1, I want to create a random sample size of 1000 for DC from that table (table_1) and likewise, 1500 random sample for TX (another table, table_2) and likewise for all(table_3 and table_4). Please help me on this. I really appreciate the kind assistance.
I want to use the following code for random sampling OR ANOTHER CODE IF YOU WANT TO SUGGEST
PROC SURVEYSELECT DATA=ERP_MASTER
I hope this helpS you understand the problem STATEMENT. I highly appreciate your kind assistance. Thanks
12-17-2015 05:24 PM - edited 12-17-2015 05:25 PM
You can do this with call execute:
length line $200; input Table $ Command $ SampleSize; line = catt("proc surveyselect data=",Table," sampsize=",SampleSize," out=",Command,"_SRS; run;"); call execute(line); datalines; Table_1 DC 1000 Table_2 TX 1500 Table_3 FL 2000 Table_4 GL 1400 ;
12-17-2015 05:26 PM
I think CALL EXECUTE is the best tool for the job. Assuming that your program is basically working the way you want, here is how you could automate:
call execute('PROC SURVEYSELECT DATA=');
call execute(put(samplesize, 4.);
call execute('_SRS; ID _ALL_; run;');
Theoretically, you could put everything together in a single CALL EXECUTE, but I think this is easier to read.
There are details to attend to ... always using 4 digits as the sample size (hope that's enough but you can increase it if needed), no trailing blanks allowed at the end of COMMAND.
I assumed you wanted to add the RUN statements.
12-17-2015 06:19 PM
Just in case you had gone to the work of segregating the data just for this step, you could have used STRAT=command and a list for sampsize parameter. Just make sure that the order of the list on the sampsize matches your strata order.