BookmarkSubscribeRSS Feed
wriccar
Fluorite | Level 6

Hello fellow SAS Users,

 

Hopefully someone is able to help. It's not direly urgent as I have written the code and it works fine. But, I need to repeat this code for 3 similar tests on 20-30 different sample combinations, each combination having 10s of millions of observations. So, it could get rather lengthy. 

 

The code is below, and here is (esentially) the steps that need to be followed:
*POOLED PRE/POST FOR TREATMENT AND BENCHMARK FIRMS* = Combining necessary datasets with various sample partitions (in this case, four), based on a bootstrap procedure that generated 1,000 random samples

 

*RUN REGRESSION AND OBTAIN RESIDUALS FOR CHG IN NI* = The tests mentioned use the residuals from the OLS Regression model. The regressions are run by replicate (from the bootstrap).

 

*OBTAIN VARIANCES* = For each regression and each sample partition, I obtain the variance of the residuals. Since there are 1,000 replicates, this means there are 1,000 variances for each sample partition.

 

*TEST FOR PRE/POST DIFFERENCE IN VARIANCES* = t-test for difference in the means of the 1,000 variances between pre/post for each of the two types of firms. 

 

*TO CALCULATE DIFFERENCE-IN-DIFFERENCES* = Final step, calculating the difference for each of the above 1,000 pre/post combinations, then testing for difference in the differences. 

 

ODS HTML CLOSE; /*CLOSE PREVIOUS*/
ODS HTML; /*OPEN NEW*/
*POOLED PRE/POST FOR TREATMENT AND BENCHMARK FIRMS*;;
PROC SORT DATA=PRETR_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTTR_R; BY DSCD FYEAR; RUN;
PROC SORT DATA=PREBN_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTBN_R; BY DSCD FYEAR; RUN;
DATA POOLED1; 
MERGE PRETR_R POSTTR_R PREBN_R POSTBN_R; 
BY DSCD FYEAR; 
RUN;

proc sort data=pooled1; by replicate; run;

*TEST 1 -- FULL SAMPLE*;
*RUN REGRESSION AND OBTAIN RESIDUALS FOR CHG IN NI;;
PROC REG DATA=pooled1 PLOTS PLOTS=NONE NOPRINT ; 
BY REPLICATE;
MODEL NI_D = GROWTH EISSUE LEV DISSUE TURN SIZE CF_S NUMEX XLIST CLOSE ROL;
OUTPUT OUT=POOLEDTR_REG1
R=NI_DRESID;
RUN;
*OBTAIN VARIANCES*;;
PROC SORT DATA=POOLEDTR_REG1; BY POST TYPE2 REPLICATE; RUN;
PROC MEANS NOPRINT DATA=POOLEDTR_REG1 VAR; 
BY POST TYPE2 REPLICATE; 
VAR NI_DRESID;
OUTPUT OUT=POOLEDTR_NI VAR= NI_VAR ; 
RUN;
*TEST FOR PRE/POST DIFFERENCE IN VARIANCES*;;
PROC SORT DATA=POOLEDTR_NI; BY TYPE2; RUN;
PROC TTEST DATA=POOLEDTR_NI PLOTS=NONE;
VAR NI_VAR; 
CLASS post ; 
by type2;
RUN;
*TO CALCULATE AND TEST DIFFERENCE-IN-DIFFERENCES*;;
DATA POOLEDTR_NI2; 
SET POOLEDTR_NI;
BY TYPE2 POST REPLICATE;
LAG = LAG1000(NI_VAR); 
DIFF = NI_VAR-LAG;
IF POST=0 THEN DELETE;
RUN;
PROC TTEST DATA=POOLEDTR_NI2 PLOTS=NONE;
VAR DIFF; 
CLASS TYPE2 ; 
RUN;

As I said, this code WORKS. My concern is that it is not EFFICIENT. Considering how many data sets this one test creates, combined with having to do this another 20-30 times..... 

 

So, if anyone can help simplify or consolidate this code, I would greatly appreciate it. 

Thanks in advance!

8 REPLIES 8
PGStats
Opal | Level 21

About the first step. POOLED1 is the merge of four datasets, where does replicate come from? Why does the merging not involve the replicate number? 

PG
wriccar
Fluorite | Level 6
Hi @PGStats, thanks for the reply.

Replicate is from a bootstrap procedure that produces 1,000 random samples on each of the four subsamples. So Replicate is 1-1,000 in each of the four datasets but not unique between them.

DSCD is the identifier variable and FYEAR is the time period. (I could have left out this step and used APPEND instead--the goal is just to combine the datasets completely.)
PGStats
Opal | Level 21

Merge is not the same as append. Merge is for when the datasets have some by-values in common. If your 4 datasets don't have DSCD values in common, then a simple dataset concatenation followed by a sort is all you would need to complete the first step (you would save four sorting steps). 

PG
wriccar
Fluorite | Level 6
However, in later steps I compare differences BY Replicate. That is, each random sample compared to the corresponding iteration of the other random sample.
wriccar
Fluorite | Level 6
I know what merge is for but I've always used it to combine datasets as well, as in this case. Not saying it is efficient. 😛
PGStats
Opal | Level 21

OK. So your first step can be done in a single data step. Replace

 

PROC SORT DATA=PRETR_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTTR_R; BY DSCD FYEAR; RUN;
PROC SORT DATA=PREBN_R; BY DSCD FYEAR; RUN; PROC SORT DATA=POSTBN_R; BY DSCD FYEAR; RUN;
DATA POOLED1; 
MERGE PRETR_R POSTTR_R PREBN_R POSTBN_R; 
BY DSCD FYEAR; 
RUN;

proc sort data=pooled1; by replicate; run;

with

 

data pooled1;
set PRETR_R POSTTR_R PREBN_R POSTBN_R; 
by replicate;
run;
PG
WendyT
Pyrite | Level 9

Hi Wriccar-

 

If I understand your post correctly, you're going to be running the same model with different inputs.  I would wrap the entire thing in a macro, and run it once for each combination.  If you're using different input datasets each time, add them to the macro. 

 

The neat trick to doing it this way is that you have to put the commas in, but don't have to supply values for everything, as it's just text substitution (I tend to put the ones that might be short values at the end).  

 

Something else you might want to consider is to manipulate your results for a run into a dataset with a little selecting and PROC TRANSPOSE if neccessary.  If you name the datasets with a leadoff that is not used for anything else, you can put them all together using the colon notation, and not have to know (and type) all the names.

 

Another thought might be to use PROC SQL to do your initial dataset and avoid all that sorting. I've put a complete guess at the bottom .   

 

Hope this helps!

 

Wendy T

 

 

%MACRO TESTTREAT(MODELID, VAR1, VAR2, VAR3, VAR4, VAR5, VAR6, VAR7, VAR8, VAR9, VAR10, VAR11) ;
.... program statments here
MODEL &MODELID = &VAR1 &VAR2 &VAR3 &VAR4 &VAR5 &VAR6 &VAR7 &VAR8 &VAR9 &VAR10 &VAR11 ;
.... program statements here

.... put results for the run together ....
DATA RESULT_&MODELID ; 

%MEND ;
 
%TESTTREAT(NI_D, GROWTH, EISSUE, LEV, DISSUE, TURN, SIZE, CF_S, NUMEX, XLIST, CLOSE, ROL) ; 
%TESTTREAT(aaa,     bbb,    ccc, ddd,    eee,  fff,  ggg,  hhh,   iii,   jjj,   kkk, lll) ;
%TESTTREAT(aaa,     bbb,    ccc, ddd,    eee,  fff,  ggg,  hhh,   iii,   jjj,   ,  ,    ) ; 
.... however many times you need for all the models ....

.... put the result datasets together ....
DATA ALL_RESULTS ; SET RESULT_: ;
RUN ;

... shot in the dark using SQL ...

PROC SQL NOPRINT ;
CREATE TABLE POOLED1
AS SELECT *
FROM PRETR_R, POSTTR_R, PREBN_R, POSTBN_R
GROUP BY DSCD FYEAR,
ORDER BY REPLICATE;
QUIT ;
 
 
WendyT
Pyrite | Level 9
whoops... the comma after FYEAR in the GROUP BY ought to be after DSCD. sorry about that.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1975 views
  • 0 likes
  • 3 in conversation