Solved: Re: Simulation Sampling and OLS regression in SAS

tritringuyen · Posted 08-14-2018 03:05 PM

Hello,

I want to randomly match observations from two datasets. Data1 contains treatment firms and data2 contain control firms.

Data1 (treatment)

gvkey	Fyear	sic3	AT	Q	Y	X1	X2
1012	1987	335	16.858	2	0.837307	0.3018	0.246012
1037	1983	366	24.403	2	0.780522	0.411516	0.309363
1050	2007	356	96.535	1	0.762024	0.192556	0.893492
1050	2008	356	120.017	1	0.131183	0.775842	0.306371
1056	2000	382	248.707	3	0.780103	0.886534	0.583451
1056	2001	382	310.252	3	0.506279	0.471617	0.82575
1056	2002	382	318.465	3	0.513091	0.701931	0.92213

Data2 (eligible control)

GVKEY	FYEAR	sic3	AT	Q	Y	X1	X2
1000	1973	308	21.771	2	0.150945	0.34661	0.820642
1000	1977	308	44.025	3	0.195324	0.823802	0.258745
1000	1971	308	29.33	2	0.402987	0.579026	0.036652
1000	1976	308	38.586	3	0.175518	0.346498	0.691525
1000	1974	308	25.638	2	0.282994	0.695814	0.549823
1000	1975	308	23.905	2	0.184889	0.634319	0.003082
1000	1970	308	33.45	3	0.227734	0.688334	0.985432
1000	1972	308	19.907	2	0.690881	0.34722	0.360407

Control and treatment datasets have the same variables. I want to match each treatment with one control with the same SIC3 and the same Q (asset quintile) (control and treatment do not need to be in year (fyear)). The matched sample should be like this:

Want dataset (matched sample):

sic3	Q	fyear	gvkey	at	Y	X1	X2	gvkey_c	at_c	Y_C	X1_C	X2_C
100	1	1980	7920	38.362	0.187701	0.301751	0.979782	8942	47.288	0.268632	0.314711	0.811104
104	2	1980	5560	53.779	0.25313	0.501512	0.317621	12822	26.643	0.710939	0.926753	0.706577
104	3	1980	5686	367.111	0.025062	0.466618	0.102241	1856	100.133	0.971438	0.77248	0.397624
104	3	1980	3153	14.781	0.103293	0.672411	0.366022	2127	9.006	0.735427	0.978271	0.624185
131	4	1980	2164	1.894	0.148038	0.320704	0.628284	8238	1.896	0.643021	0.880709	0.607355
131	1	1980	5129	4.368	0.78328	0.96015	0.670081	5498	4.474	0.022384	0.868951	0.877956
131	2	1980	2073	18.815	0.560716	0.514982	0.979269	2873	18.472	0.201125	0.922818	0.736144
131	3	1980	6870	142.959	0.784038	0.892376	0.323672	3420	140.806	0.404023	0.236646	0.159027
131	3	1980	2437	209.629	0.618297	0.118454	0.854774	1364	204.723	0.82115	0.163351	0.524918
131	4	1980	1544	210.22	0.577997	0.668211	0.218588	7560	203.871	0.29084	0.638787	0.077885

As you can see in the matched sample, the last 5 columns belong to controls and other belong to treatment (sic3 and Q are the same for treatment and control). I need to repeat this process 1,000 times to have 1,000 matched samples. Then, I run OLS regression Y = X1 + X2 for 1,000 matched samples to obtain the mean and standard deviation, p-value of coefficient X1, X2. The SAS code for this regression is something like:

proc surveyreg data=want ;

cluster gvkey;

model Y = X1 X2 X1_C X2_C;

ods output parameterestimates = param fitstatistics = statistics datasummary = datasummary;

run;

quit;

(I need output dataset to obtain some statistics I need).

I hope that anyone can give me some suggestions on how to perform this simulation. I can match one treatment with one control in the same sic3 and closest AT (assets), but not random match.

Best regards,

Thierry

tritringuyen · Posted 08-15-2018 05:50 AM

Thank @PGStats for providing some guidelines. I have modified the code as following:


* Random match each treatment with one firm in the same SIC3 and Q;
%macro simulation;                                                                                                                              
%do i=1 %to 5;                                                                                                                                                                                                                                                                     

data t1&i;
    set treatment;
    rnd = rand("uniform");
    output;
run;
                                                                                                                                                                                                   

data c1&i;
    set eligible_controls;
    rnd = rand("uniform");
    output;
run;

proc sort data=t1&i; by sic3 Q rnd; run;
proc sort data=c1&i; by sic3 Q rnd; run;


proc rank data=t1&i out=t2&i;
by sic3 Q;
var rnd;
run;

proc rank data=c1&i out=c2&i;
by sic3 Q;
var rnd;
run;

data matched&i;
merge 
    t2&i (in=in_t)
    c2&i (in=in_c rename=(gvkey=gvkey_c at=at_c));
by sic3 Q rnd;
if in_t and in_c;
run;

data matched&i; set matched&i; rep = &i; run;

* Combine all matched samples;
proc append base = matched_full data = matched&i force; run; 
proc datasets library=work; delete stat t1&i t2&i c1&i c2&i matched&i; quit; run; 
                                                                                                                                       
%end;                                                                                                                                       
%mend simulation;                                                                                                                               
%simulation

It seems that the code works. @PGStats, what do you think? Thanks.

View solution in original post

Rick_SAS · Posted 08-14-2018 03:22 PM

You say that you can do the "random matching." Do it 1000 times and put all matched samples into a single data set. (Use an ID variable such as REPLICATE with values 1,2,...,1000 to know which observations belong to each sample.) Then run your regression plot by using

BY REPLICATE;

For an overview and some details (such as "remember to suppress ODS output!:), see

"Simulation in SAS: The slow way or the BY way"

"Turn off ODS when running simulations in SAS"

tritringuyen · Posted 08-14-2018 03:28 PM

Thanks @Rick_SAS for your reply, but actually it does not help much. Best regards, Thierry

ballardw · Posted 08-14-2018 04:06 PM

@tritringuyen wrote:

Hello,

I want to randomly match observations from two datasets. Data1 contains treatment firms and data2 contain control firms. I want to match each treatment with one control in the same SIC3 and the same AT (asset) quintile group.

Data1 (treatment)

gvkey FYEAR at sic3 Y X1 X2

1000 1970 33.45 308 0.185866 0.231087 0.009421

1000 1971 29.33 308 0.757783 0.321722 0.246141

1000 1972 19.907 308 0.896117 0.150721 0.996615

1000 1973 21.771 308 0.83092 0.354851 0.821562

1000 1974 25.638 308 0.348816 0.008351 0.833004

1000 1975 23.905 308 0.561266 0.331876 0.42865

1000 1976 38.586 308 0.93423 0.171456 0.660347

1000 1977 44.025 308 0.851992 0.450207 0.425079

Data2 (control)

Fyear GVKEY AT Sic3 Y X1 X2

1983 3734 8.08 335 0.026617 0.236824 0.797894

1983 4839 12.617 366 0.460664 0.347329 0.337429

1983 11284 24.403 356 0.749856 0.384469 0.900897

2007 5073 96.535 382 0.89703 0.007704 0.151084

2002 6774 318.465 382 0.420823 0.012027 0.085926

2000 10453 248.707 382 0.809201 0.124128 0.778542

2001 141845 310.252 366 0.034532 0.958971 0.213459

1983 3229 1.537 367 0.854481 0.164966 0.675804

2002 23671 1700.513 737 0.032982 0.015465 0.70077

I want to repeat this process 1,000 times to have 1,000 samples. Then, I want to simulate OLS regression Y = X1 + X2 for 1,000 matched samples to obtain the mean and standard deviation, p-value of coefficient X1, X2.

I hope that anyone can give me some suggestions on how to perform this simulation. I can match one to one and run the regression, but not 1,000 times.

Thank you very much!

Best regards,

Thierry

One thing that might help is to define what you mean by quintile group in this context. Do you mean a quintile within each SIC, fyear, gvkey, some combination of those? Very likely a first step is assigning that quintile so it is available for matching.

I suggest that you 1) start over by picking data from the two sets that have the SIC codes that match, 2) show some of the possible matches using that example data (random can come later after we understand what you are attempting to do), 3) indicate which other variables you want. And it won't hurt to show what the base regression code you are planning on running.

Do you know which regression you are planning? If so show an example of the code.

PGStats · Posted 08-14-2018 04:22 PM

"I can match one to one and run the regression". That's where we should start from. Please show us that SAS code.

PG

tritringuyen · Posted 08-14-2018 07:43 PM

Thank you all for your response. I have edited the message to make it more clear. Please see the original post for edited question. What I already have is a code to match one treatment with one control based on sic3 and Q (same quintile), but not random match. I suppose that I need new code for random match. Thank you very much!

Best regards,

Tri

PGStats · Posted 08-15-2018 12:20 AM

I guess you may want to do the pairing like this:

%let nbRep=10;    /* start small for testing */

data t1;
do rep = 1 to &nbRep;
    set treatment;
    rnd = rand("uniform");
    output;
    end;
run;

data c1;
do rep = 1 to &nbRep;
    set control;
    rnd = rand("uniform");
    output;
    end;
run;

proc sort data=t1; by rep sic3 rnd; run;
proc sort data=c1; by rep sic3 rnd; run;

proc rank data=t1 out=t2;
by rep sic3;
var rnd;
run;

proc rank data=c1 out=c2;
by rep sic3;
var rnd;
run;

data want;
merge 
    t2 (in=in_t)
    c2 (in=in_c rename=(gvkey=gvkey_c at=at_c Y=Y_c X1=X1_c X2=X2_c) drop=Q fyear);
by rep sic3 rnd;
if in_t and in_c;
run;

and then do your regression analysis by rep;

(untested)

PG

tritringuyen · Posted 08-15-2018 05:21 AM

Thank you very much! The code is helpful although it does not work exactly what I want. The code seems to divide the treatment sample into nbRep of subsamples (I tested when nbRep=10, treatment will be matched with controls, but total observations in all nbRep are exactly the same treatment sample). What I want is that, assume that I have 200 treatment and need to find 200 control in the same SIC 3 and same Q (your code does not consider same Q, unfortunately). The first matched sample have 200 observations (or 200 pairs of controls and treatments). I need to replicate this process for 1,000 times. So, the final matched sample must have 200*1,000 = 200,000 observations. I hope that it is a little bit more clear now.

tritringuyen · Posted 08-15-2018 05:50 AM

Thank @PGStats for providing some guidelines. I have modified the code as following:


* Random match each treatment with one firm in the same SIC3 and Q;
%macro simulation;                                                                                                                              
%do i=1 %to 5;                                                                                                                                                                                                                                                                     

data t1&i;
    set treatment;
    rnd = rand("uniform");
    output;
run;
                                                                                                                                                                                                   

data c1&i;
    set eligible_controls;
    rnd = rand("uniform");
    output;
run;

proc sort data=t1&i; by sic3 Q rnd; run;
proc sort data=c1&i; by sic3 Q rnd; run;


proc rank data=t1&i out=t2&i;
by sic3 Q;
var rnd;
run;

proc rank data=c1&i out=c2&i;
by sic3 Q;
var rnd;
run;

data matched&i;
merge 
    t2&i (in=in_t)
    c2&i (in=in_c rename=(gvkey=gvkey_c at=at_c));
by sic3 Q rnd;
if in_t and in_c;
run;

data matched&i; set matched&i; rep = &i; run;

* Combine all matched samples;
proc append base = matched_full data = matched&i force; run; 
proc datasets library=work; delete stat t1&i t2&i c1&i c2&i matched&i; quit; run; 
                                                                                                                                       
%end;                                                                                                                                       
%mend simulation;                                                                                                                               
%simulation

It seems that the code works. @PGStats, what do you think? Thanks.

PGStats · Posted 08-15-2018 02:43 PM

Macro coding will only complicate things for you. SAS can handle sample sizes of 200*1000 very easily with by-processing, and you will get your regression results in a single dataset.

The matching that I illustrated was without replacement, i.e. a control would not be matched with multiple treatment cases in any given repetition. You seem to want random matching with replacement. That would require a different algorithm.

PG

tritringuyen · Posted 08-15-2018 02:59 PM

Dear @PGStats, you are right. My macro code results in matched samples with replacement. The fact is that I have more than 10,000 treatments. Thus, the full matched sample after 1,000 times has more than 10 millions observations (more than 8GB for the full matched sample with variables needed). I need a macro for matching because I think it is more efficient. I tested my code and it works perfectly (with replacement). I wonder if you can help me adjust my code a little bit from matching with replacement to without replacement. I would be helpful. Thank you very much!

Best regards, Thierry

Registration is open

SAS Training: Just a Click Away