Hello,
I want to randomly match observations from two datasets. Data1 contains treatment firms and data2 contain control firms.
Data1 (treatment)
gvkey | Fyear | sic3 | AT | Q | Y | X1 | X2 |
1012 | 1987 | 335 | 16.858 | 2 | 0.837307 | 0.3018 | 0.246012 |
1037 | 1983 | 366 | 24.403 | 2 | 0.780522 | 0.411516 | 0.309363 |
1050 | 2007 | 356 | 96.535 | 1 | 0.762024 | 0.192556 | 0.893492 |
1050 | 2008 | 356 | 120.017 | 1 | 0.131183 | 0.775842 | 0.306371 |
1056 | 2000 | 382 | 248.707 | 3 | 0.780103 | 0.886534 | 0.583451 |
1056 | 2001 | 382 | 310.252 | 3 | 0.506279 | 0.471617 | 0.82575 |
1056 | 2002 | 382 | 318.465 | 3 | 0.513091 | 0.701931 | 0.92213 |
Data2 (eligible control)
GVKEY | FYEAR | sic3 | AT | Q | Y | X1 | X2 |
1000 | 1973 | 308 | 21.771 | 2 | 0.150945 | 0.34661 | 0.820642 |
1000 | 1977 | 308 | 44.025 | 3 | 0.195324 | 0.823802 | 0.258745 |
1000 | 1971 | 308 | 29.33 | 2 | 0.402987 | 0.579026 | 0.036652 |
1000 | 1976 | 308 | 38.586 | 3 | 0.175518 | 0.346498 | 0.691525 |
1000 | 1974 | 308 | 25.638 | 2 | 0.282994 | 0.695814 | 0.549823 |
1000 | 1975 | 308 | 23.905 | 2 | 0.184889 | 0.634319 | 0.003082 |
1000 | 1970 | 308 | 33.45 | 3 | 0.227734 | 0.688334 | 0.985432 |
1000 | 1972 | 308 | 19.907 | 2 | 0.690881 | 0.34722 | 0.360407 |
Control and treatment datasets have the same variables. I want to match each treatment with one control with the same SIC3 and the same Q (asset quintile) (control and treatment do not need to be in year (fyear)). The matched sample should be like this:
Want dataset (matched sample):
sic3 | Q | fyear | gvkey | at | Y | X1 | X2 | gvkey_c | at_c | Y_C | X1_C | X2_C |
100 | 1 | 1980 | 7920 | 38.362 | 0.187701 | 0.301751 | 0.979782 | 8942 | 47.288 | 0.268632 | 0.314711 | 0.811104 |
104 | 2 | 1980 | 5560 | 53.779 | 0.25313 | 0.501512 | 0.317621 | 12822 | 26.643 | 0.710939 | 0.926753 | 0.706577 |
104 | 3 | 1980 | 5686 | 367.111 | 0.025062 | 0.466618 | 0.102241 | 1856 | 100.133 | 0.971438 | 0.77248 | 0.397624 |
104 | 3 | 1980 | 3153 | 14.781 | 0.103293 | 0.672411 | 0.366022 | 2127 | 9.006 | 0.735427 | 0.978271 | 0.624185 |
131 | 4 | 1980 | 2164 | 1.894 | 0.148038 | 0.320704 | 0.628284 | 8238 | 1.896 | 0.643021 | 0.880709 | 0.607355 |
131 | 1 | 1980 | 5129 | 4.368 | 0.78328 | 0.96015 | 0.670081 | 5498 | 4.474 | 0.022384 | 0.868951 | 0.877956 |
131 | 2 | 1980 | 2073 | 18.815 | 0.560716 | 0.514982 | 0.979269 | 2873 | 18.472 | 0.201125 | 0.922818 | 0.736144 |
131 | 3 | 1980 | 6870 | 142.959 | 0.784038 | 0.892376 | 0.323672 | 3420 | 140.806 | 0.404023 | 0.236646 | 0.159027 |
131 | 3 | 1980 | 2437 | 209.629 | 0.618297 | 0.118454 | 0.854774 | 1364 | 204.723 | 0.82115 | 0.163351 | 0.524918 |
131 | 4 | 1980 | 1544 | 210.22 | 0.577997 | 0.668211 | 0.218588 | 7560 | 203.871 | 0.29084 | 0.638787 | 0.077885 |
As you can see in the matched sample, the last 5 columns belong to controls and other belong to treatment (sic3 and Q are the same for treatment and control). I need to repeat this process 1,000 times to have 1,000 matched samples. Then, I run OLS regression Y = X1 + X2 for 1,000 matched samples to obtain the mean and standard deviation, p-value of coefficient X1, X2. The SAS code for this regression is something like:
proc surveyreg data=want ;
cluster gvkey;
model Y = X1 X2 X1_C X2_C;
ods output parameterestimates = param fitstatistics = statistics datasummary = datasummary;
run;
quit;
(I need output dataset to obtain some statistics I need).
I hope that anyone can give me some suggestions on how to perform this simulation. I can match one treatment with one control in the same sic3 and closest AT (assets), but not random match.
Best regards,
Thierry
Thank @PGStats for providing some guidelines. I have modified the code as following:
* Random match each treatment with one firm in the same SIC3 and Q;
%macro simulation;
%do i=1 %to 5;
data t1&i;
set treatment;
rnd = rand("uniform");
output;
run;
data c1&i;
set eligible_controls;
rnd = rand("uniform");
output;
run;
proc sort data=t1&i; by sic3 Q rnd; run;
proc sort data=c1&i; by sic3 Q rnd; run;
proc rank data=t1&i out=t2&i;
by sic3 Q;
var rnd;
run;
proc rank data=c1&i out=c2&i;
by sic3 Q;
var rnd;
run;
data matched&i;
merge
t2&i (in=in_t)
c2&i (in=in_c rename=(gvkey=gvkey_c at=at_c));
by sic3 Q rnd;
if in_t and in_c;
run;
data matched&i; set matched&i; rep = &i; run;
* Combine all matched samples;
proc append base = matched_full data = matched&i force; run;
proc datasets library=work; delete stat t1&i t2&i c1&i c2&i matched&i; quit; run;
%end;
%mend simulation;
%simulation
It seems that the code works. @PGStats, what do you think? Thanks.
You say that you can do the "random matching." Do it 1000 times and put all matched samples into a single data set. (Use an ID variable such as REPLICATE with values 1,2,...,1000 to know which observations belong to each sample.) Then run your regression plot by using
BY REPLICATE;
For an overview and some details (such as "remember to suppress ODS output!:), see
Thanks @Rick_SAS for your reply, but actually it does not help much. Best regards, Thierry
@tritringuyen wrote:
Hello,
I want to randomly match observations from two datasets. Data1 contains treatment firms and data2 contain control firms. I want to match each treatment with one control in the same SIC3 and the same AT (asset) quintile group.
Data1 (treatment)
gvkey FYEAR at sic3 Y X1 X2 1000 1970 33.45 308 0.185866 0.231087 0.009421 1000 1971 29.33 308 0.757783 0.321722 0.246141 1000 1972 19.907 308 0.896117 0.150721 0.996615 1000 1973 21.771 308 0.83092 0.354851 0.821562 1000 1974 25.638 308 0.348816 0.008351 0.833004 1000 1975 23.905 308 0.561266 0.331876 0.42865 1000 1976 38.586 308 0.93423 0.171456 0.660347 1000 1977 44.025 308 0.851992 0.450207 0.425079
Data2 (control)
Fyear GVKEY AT Sic3 Y X1 X2 1983 3734 8.08 335 0.026617 0.236824 0.797894 1983 4839 12.617 366 0.460664 0.347329 0.337429 1983 11284 24.403 356 0.749856 0.384469 0.900897 2007 5073 96.535 382 0.89703 0.007704 0.151084 2002 6774 318.465 382 0.420823 0.012027 0.085926 2000 10453 248.707 382 0.809201 0.124128 0.778542 2001 141845 310.252 366 0.034532 0.958971 0.213459 1983 3229 1.537 367 0.854481 0.164966 0.675804 2002 23671 1700.513 737 0.032982 0.015465 0.70077
I want to repeat this process 1,000 times to have 1,000 samples. Then, I want to simulate OLS regression Y = X1 + X2 for 1,000 matched samples to obtain the mean and standard deviation, p-value of coefficient X1, X2.
I hope that anyone can give me some suggestions on how to perform this simulation. I can match one to one and run the regression, but not 1,000 times.
Thank you very much!
Best regards,
Thierry
One thing that might help is to define what you mean by quintile group in this context. Do you mean a quintile within each SIC, fyear, gvkey, some combination of those? Very likely a first step is assigning that quintile so it is available for matching.
I suggest that you 1) start over by picking data from the two sets that have the SIC codes that match, 2) show some of the possible matches using that example data (random can come later after we understand what you are attempting to do), 3) indicate which other variables you want. And it won't hurt to show what the base regression code you are planning on running.
Do you know which regression you are planning? If so show an example of the code.
"I can match one to one and run the regression". That's where we should start from. Please show us that SAS code.
Thank you all for your response. I have edited the message to make it more clear. Please see the original post for edited question. What I already have is a code to match one treatment with one control based on sic3 and Q (same quintile), but not random match. I suppose that I need new code for random match. Thank you very much!
Best regards,
Tri
I guess you may want to do the pairing like this:
%let nbRep=10; /* start small for testing */
data t1;
do rep = 1 to &nbRep;
set treatment;
rnd = rand("uniform");
output;
end;
run;
data c1;
do rep = 1 to &nbRep;
set control;
rnd = rand("uniform");
output;
end;
run;
proc sort data=t1; by rep sic3 rnd; run;
proc sort data=c1; by rep sic3 rnd; run;
proc rank data=t1 out=t2;
by rep sic3;
var rnd;
run;
proc rank data=c1 out=c2;
by rep sic3;
var rnd;
run;
data want;
merge
t2 (in=in_t)
c2 (in=in_c rename=(gvkey=gvkey_c at=at_c Y=Y_c X1=X1_c X2=X2_c) drop=Q fyear);
by rep sic3 rnd;
if in_t and in_c;
run;
and then do your regression analysis by rep;
(untested)
Thank @PGStats for providing some guidelines. I have modified the code as following:
* Random match each treatment with one firm in the same SIC3 and Q;
%macro simulation;
%do i=1 %to 5;
data t1&i;
set treatment;
rnd = rand("uniform");
output;
run;
data c1&i;
set eligible_controls;
rnd = rand("uniform");
output;
run;
proc sort data=t1&i; by sic3 Q rnd; run;
proc sort data=c1&i; by sic3 Q rnd; run;
proc rank data=t1&i out=t2&i;
by sic3 Q;
var rnd;
run;
proc rank data=c1&i out=c2&i;
by sic3 Q;
var rnd;
run;
data matched&i;
merge
t2&i (in=in_t)
c2&i (in=in_c rename=(gvkey=gvkey_c at=at_c));
by sic3 Q rnd;
if in_t and in_c;
run;
data matched&i; set matched&i; rep = &i; run;
* Combine all matched samples;
proc append base = matched_full data = matched&i force; run;
proc datasets library=work; delete stat t1&i t2&i c1&i c2&i matched&i; quit; run;
%end;
%mend simulation;
%simulation
It seems that the code works. @PGStats, what do you think? Thanks.
Macro coding will only complicate things for you. SAS can handle sample sizes of 200*1000 very easily with by-processing, and you will get your regression results in a single dataset.
The matching that I illustrated was without replacement, i.e. a control would not be matched with multiple treatment cases in any given repetition. You seem to want random matching with replacement. That would require a different algorithm.
Dear @PGStats, you are right. My macro code results in matched samples with replacement. The fact is that I have more than 10,000 treatments. Thus, the full matched sample after 1,000 times has more than 10 millions observations (more than 8GB for the full matched sample with variables needed). I need a macro for matching because I think it is more efficient. I tested my code and it works perfectly (with replacement). I wonder if you can help me adjust my code a little bit from matching with replacement to without replacement. I would be helpful. Thank you very much!
Best regards, Thierry
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.