06-22-2016 07:29 PM - edited 06-22-2016 07:30 PM
Hello fellow SAS Users,
Hopefully someone is able to help. It's not direly urgent as I have written the code and it works fine. But, I need to repeat this code for 3 similar tests on 20-30 different sample combinations, each combination having 10s of millions of observations. So, it could get rather lengthy.
The code is below, and here is (esentially) the steps that need to be followed:
*POOLED PRE/POST FOR TREATMENT AND BENCHMARK FIRMS* = Combining necessary datasets with various sample partitions (in this case, four), based on a bootstrap procedure that generated 1,000 random samples
*RUN REGRESSION AND OBTAIN RESIDUALS FOR CHG IN NI* = The tests mentioned use the residuals from the OLS Regression model. The regressions are run by replicate (from the bootstrap).
*OBTAIN VARIANCES* = For each regression and each sample partition, I obtain the variance of the residuals. Since there are 1,000 replicates, this means there are 1,000 variances for each sample partition.
*TEST FOR PRE/POST DIFFERENCE IN VARIANCES* = t-test for difference in the means of the 1,000 variances between pre/post for each of the two types of firms.
*TO CALCULATE DIFFERENCE-IN-DIFFERENCES* = Final step, calculating the difference for each of the above 1,000 pre/post combinations, then testing for difference in the differences.
ODS HTML CLOSE; /*CLOSE PREVIOUS*/ ODS HTML; /*OPEN NEW*/ *POOLED PRE/POST FOR TREATMENT AND BENCHMARK FIRMS*;; PROC SORT DATA=PRETR_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTTR_R; BY DSCD FYEAR; RUN; PROC SORT DATA=PREBN_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTBN_R; BY DSCD FYEAR; RUN; DATA POOLED1; MERGE PRETR_R POSTTR_R PREBN_R POSTBN_R; BY DSCD FYEAR; RUN; proc sort data=pooled1; by replicate; run; *TEST 1 -- FULL SAMPLE*; *RUN REGRESSION AND OBTAIN RESIDUALS FOR CHG IN NI;; PROC REG DATA=pooled1 PLOTS PLOTS=NONE NOPRINT ; BY REPLICATE; MODEL NI_D = GROWTH EISSUE LEV DISSUE TURN SIZE CF_S NUMEX XLIST CLOSE ROL; OUTPUT OUT=POOLEDTR_REG1 R=NI_DRESID; RUN; *OBTAIN VARIANCES*;; PROC SORT DATA=POOLEDTR_REG1; BY POST TYPE2 REPLICATE; RUN; PROC MEANS NOPRINT DATA=POOLEDTR_REG1 VAR; BY POST TYPE2 REPLICATE; VAR NI_DRESID; OUTPUT OUT=POOLEDTR_NI VAR= NI_VAR ; RUN; *TEST FOR PRE/POST DIFFERENCE IN VARIANCES*;; PROC SORT DATA=POOLEDTR_NI; BY TYPE2; RUN; PROC TTEST DATA=POOLEDTR_NI PLOTS=NONE; VAR NI_VAR; CLASS post ; by type2; RUN; *TO CALCULATE AND TEST DIFFERENCE-IN-DIFFERENCES*;; DATA POOLEDTR_NI2; SET POOLEDTR_NI; BY TYPE2 POST REPLICATE; LAG = LAG1000(NI_VAR); DIFF = NI_VAR-LAG; IF POST=0 THEN DELETE; RUN; PROC TTEST DATA=POOLEDTR_NI2 PLOTS=NONE; VAR DIFF; CLASS TYPE2 ; RUN;
As I said, this code WORKS. My concern is that it is not EFFICIENT. Considering how many data sets this one test creates, combined with having to do this another 20-30 times.....
So, if anyone can help simplify or consolidate this code, I would greatly appreciate it.
Thanks in advance!
06-22-2016 10:05 PM
About the first step. POOLED1 is the merge of four datasets, where does replicate come from? Why does the merging not involve the replicate number?
06-22-2016 10:10 PM
06-22-2016 10:32 PM
Merge is not the same as append. Merge is for when the datasets have some by-values in common. If your 4 datasets don't have DSCD values in common, then a simple dataset concatenation followed by a sort is all you would need to complete the first step (you would save four sorting steps).
06-22-2016 10:11 PM
06-22-2016 11:15 PM
OK. So your first step can be done in a single data step. Replace
PROC SORT DATA=PRETR_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTTR_R; BY DSCD FYEAR; RUN; PROC SORT DATA=PREBN_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTBN_R; BY DSCD FYEAR; RUN; DATA POOLED1; MERGE PRETR_R POSTTR_R PREBN_R POSTBN_R; BY DSCD FYEAR; RUN; proc sort data=pooled1; by replicate; run;
data pooled1; set PRETR_R POSTTR_R PREBN_R POSTBN_R; by replicate; run;
06-24-2016 11:02 AM
If I understand your post correctly, you're going to be running the same model with different inputs. I would wrap the entire thing in a macro, and run it once for each combination. If you're using different input datasets each time, add them to the macro.
The neat trick to doing it this way is that you have to put the commas in, but don't have to supply values for everything, as it's just text substitution (I tend to put the ones that might be short values at the end).
Something else you might want to consider is to manipulate your results for a run into a dataset with a little selecting and PROC TRANSPOSE if neccessary. If you name the datasets with a leadoff that is not used for anything else, you can put them all together using the colon notation, and not have to know (and type) all the names.
Another thought might be to use PROC SQL to do your initial dataset and avoid all that sorting. I've put a complete guess at the bottom .
Hope this helps!
%MACRO TESTTREAT(MODELID, VAR1, VAR2, VAR3, VAR4, VAR5, VAR6, VAR7, VAR8, VAR9, VAR10, VAR11) ; .... program statments here MODEL &MODELID = &VAR1 &VAR2 &VAR3 &VAR4 &VAR5 &VAR6 &VAR7 &VAR8 &VAR9 &VAR10 &VAR11 ; .... program statements here .... put results for the run together .... DATA RESULT_&MODELID ; %MEND ; %TESTTREAT(NI_D, GROWTH, EISSUE, LEV, DISSUE, TURN, SIZE, CF_S, NUMEX, XLIST, CLOSE, ROL) ; %TESTTREAT(aaa, bbb, ccc, ddd, eee, fff, ggg, hhh, iii, jjj, kkk, lll) ; %TESTTREAT(aaa, bbb, ccc, ddd, eee, fff, ggg, hhh, iii, jjj, , , ) ; .... however many times you need for all the models .... .... put the result datasets together .... DATA ALL_RESULTS ; SET RESULT_: ; RUN ; ... shot in the dark using SQL ... PROC SQL NOPRINT ; CREATE TABLE POOLED1 AS SELECT * FROM PRETR_R, POSTTR_R, PREBN_R, POSTBN_R GROUP BY DSCD FYEAR, ORDER BY REPLICATE; QUIT ;