BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tritringuyen
Quartz | Level 8

Hello, 

I want to randomly match observations from two datasets. Data1 contains treatment firms and data2 contain control firms. 

 

Data1 (treatment)

 

gvkeyFyearsic3ATQYX1X2
1012198733516.85820.8373070.30180.246012
1037198336624.40320.7805220.4115160.309363
1050200735696.53510.7620240.1925560.893492
10502008356120.01710.1311830.7758420.306371
10562000382248.70730.7801030.8865340.583451
10562001382310.25230.5062790.4716170.82575
10562002382318.46530.5130910.7019310.92213

 

Data2 (eligible control) 

GVKEYFYEARsic3ATQYX1X2
1000197330821.77120.1509450.346610.820642
1000197730844.02530.1953240.8238020.258745
1000197130829.3320.4029870.5790260.036652
1000197630838.58630.1755180.3464980.691525
1000197430825.63820.2829940.6958140.549823
1000197530823.90520.1848890.6343190.003082
1000197030833.4530.2277340.6883340.985432
1000197230819.90720.6908810.347220.360407

 

Control and treatment datasets have the same variables. I want to match each treatment with one control with the same SIC3 and the same Q (asset quintile) (control and treatment do not need to be in year (fyear)). The matched sample should be like this:

 

Want dataset (matched sample):

 

sic3QfyeargvkeyatYX1X2gvkey_cat_cY_CX1_CX2_C
10011980792038.3620.1877010.3017510.979782894247.2880.2686320.3147110.811104
10421980556053.7790.253130.5015120.3176211282226.6430.7109390.9267530.706577
104319805686367.1110.0250620.4666180.1022411856100.1330.9714380.772480.397624
10431980315314.7810.1032930.6724110.36602221279.0060.7354270.9782710.624185
1314198021641.8940.1480380.3207040.62828482381.8960.6430210.8807090.607355
1311198051294.3680.783280.960150.67008154984.4740.0223840.8689510.877956
13121980207318.8150.5607160.5149820.979269287318.4720.2011250.9228180.736144
131319806870142.9590.7840380.8923760.3236723420140.8060.4040230.2366460.159027
131319802437209.6290.6182970.1184540.8547741364204.7230.821150.1633510.524918
131419801544210.220.5779970.6682110.2185887560203.8710.290840.6387870.077885

 

As you can see in the matched sample, the last 5 columns belong to controls and other belong to treatment (sic3 and Q are the same for treatment and control). I need to repeat this process 1,000 times to have 1,000 matched samples. Then, I run OLS regression Y = X1 + X2 for 1,000 matched samples to obtain the mean and standard deviation, p-value of coefficient X1, X2. The SAS code for this regression is something like:

 

proc surveyreg data=want ;                                                                                                                                                 

 cluster gvkey;                                                                                                                                                                            

 model  Y = X1 X2 X1_C X2_C;                                                                                           

ods output parameterestimates = param fitstatistics = statistics datasummary = datasummary;                                                                                           

run;                                                                                                                                                                                       

quit;                                                                                                                                                                                     

 

(I need output dataset to obtain some statistics I need). 

 

I hope that anyone can give me some suggestions on how to perform this simulation. I can match one treatment with one control in the same sic3 and closest AT (assets), but not random match. 

 

Best regards,

Thierry

1 ACCEPTED SOLUTION

Accepted Solutions
tritringuyen
Quartz | Level 8

Thank @PGStats for providing some guidelines. I have modified the code as following:

 

                                                                                                                                            


* Random match each treatment with one firm in the same SIC3 and Q;
%macro simulation;
%do i=1 %to 5;

data t1&i;
set treatment;
rnd = rand("uniform");
output;
run;


data c1&i;
set eligible_controls;
rnd = rand("uniform");
output;
run;

proc sort data=t1&i; by sic3 Q rnd; run;
proc sort data=c1&i; by sic3 Q rnd; run;


proc rank data=t1&i out=t2&i;
by sic3 Q;
var rnd;
run;

proc rank data=c1&i out=c2&i;
by sic3 Q;
var rnd;
run;

data matched&i;
merge
t2&i (in=in_t)
c2&i (in=in_c rename=(gvkey=gvkey_c at=at_c));
by sic3 Q rnd;
if in_t and in_c;
run;

data matched&i; set matched&i; rep = &i; run;

* Combine all matched samples;
proc append base = matched_full data = matched&i force; run;
proc datasets library=work; delete stat t1&i t2&i c1&i c2&i matched&i; quit; run;

%end;
%mend simulation;
%simulation

     

It seems that the code works. @PGStats, what do you think? Thanks.                                                                                                 

 

View solution in original post

10 REPLIES 10
Rick_SAS
SAS Super FREQ

You say that you can do the "random matching." Do it 1000 times and put all matched samples into a single data set. (Use an ID variable such as REPLICATE with values 1,2,...,1000 to know which observations belong to each sample.) Then run your regression plot by using 

BY REPLICATE;

 

For an overview and some details (such as "remember to suppress ODS output!:), see

"Simulation in SAS: The slow way or the BY way"

"Turn off ODS when running simulations in SAS"

tritringuyen
Quartz | Level 8

Thanks @Rick_SAS for your reply, but actually it does not help much. Best regards, Thierry

ballardw
Super User

@tritringuyen wrote:

Hello, 

I want to randomly match observations from two datasets. Data1 contains treatment firms and data2 contain control firms. I want to match each treatment with one control in the same SIC3 and the same AT (asset) quintile group.

 

Data1 (treatment)

gvkey FYEAR at sic3 Y X1 X2
1000 1970 33.45 308 0.185866 0.231087 0.009421
1000 1971 29.33 308 0.757783 0.321722 0.246141
1000 1972 19.907 308 0.896117 0.150721 0.996615
1000 1973 21.771 308 0.83092 0.354851 0.821562
1000 1974 25.638 308 0.348816 0.008351 0.833004
1000 1975 23.905 308 0.561266 0.331876 0.42865
1000 1976 38.586 308 0.93423 0.171456 0.660347
1000 1977 44.025 308 0.851992 0.450207 0.425079

 

Data2 (control)

Fyear GVKEY AT Sic3 Y X1 X2
1983 3734 8.08 335 0.026617 0.236824 0.797894
1983 4839 12.617 366 0.460664 0.347329 0.337429
1983 11284 24.403 356 0.749856 0.384469 0.900897
2007 5073 96.535 382 0.89703 0.007704 0.151084
2002 6774 318.465 382 0.420823 0.012027 0.085926
2000 10453 248.707 382 0.809201 0.124128 0.778542
2001 141845 310.252 366 0.034532 0.958971 0.213459
1983 3229 1.537 367 0.854481 0.164966 0.675804
2002 23671 1700.513 737 0.032982 0.015465 0.70077


I want to repeat this process 1,000 times to have 1,000 samples. Then, I want to simulate OLS regression Y = X1 + X2 for 1,000 matched samples to obtain the mean and standard deviation, p-value of coefficient X1, X2. 

 

I hope that anyone can give me some suggestions on how to perform this simulation. I can match one to one and run the regression, but not 1,000 times.

 

Thank you very much!

Best regards,

Thierry


One thing that might help is to define what you mean by quintile group in this context. Do you mean a quintile within each SIC, fyear, gvkey, some combination of those? Very likely a first step is assigning that quintile so it is available for matching.

 

I suggest that you 1) start over by picking data from the two sets that have the SIC codes that match, 2) show some of the possible matches using that example data (random can come later after we understand what you are attempting to do), 3) indicate which other variables you want. And it won't hurt to show what the base regression code you are planning on running.

 

 

Do you know which regression you are planning? If so show an example of the code.

PGStats
Opal | Level 21

"I can match one to one and run the regression". That's where we should start from. Please show us that SAS code.

PG
tritringuyen
Quartz | Level 8

Thank you all for your response. I have edited the message to make it more clear. Please see the original post for edited question. What I already have is a code to match one treatment with one control based on sic3 and Q (same quintile), but not random match. I suppose that I need new code for random match. Thank you very much!

 

Best regards,

Tri

PGStats
Opal | Level 21

I guess you may want to do the pairing like this:

 

%let nbRep=10;    /* start small for testing */

data t1;
do rep = 1 to &nbRep;
    set treatment;
    rnd = rand("uniform");
    output;
    end;
run;

data c1;
do rep = 1 to &nbRep;
    set control;
    rnd = rand("uniform");
    output;
    end;
run;

proc sort data=t1; by rep sic3 rnd; run;
proc sort data=c1; by rep sic3 rnd; run;

proc rank data=t1 out=t2;
by rep sic3;
var rnd;
run;

proc rank data=c1 out=c2;
by rep sic3;
var rnd;
run;

data want;
merge 
    t2 (in=in_t)
    c2 (in=in_c rename=(gvkey=gvkey_c at=at_c Y=Y_c X1=X1_c X2=X2_c) drop=Q fyear);
by rep sic3 rnd;
if in_t and in_c;
run;

and then do your regression analysis by rep;

 

 

(untested)

PG
tritringuyen
Quartz | Level 8
Thank you very much! The code is helpful although it does not work exactly what I want. The code seems to divide the treatment sample into nbRep of subsamples (I tested when nbRep=10, treatment will be matched with controls, but total observations in all nbRep are exactly the same treatment sample). What I want is that, assume that I have 200 treatment and need to find 200 control in the same SIC 3 and same Q (your code does not consider same Q, unfortunately). The first matched sample have 200 observations (or 200 pairs of controls and treatments). I need to replicate this process for 1,000 times. So, the final matched sample must have 200*1,000 = 200,000 observations. I hope that it is a little bit more clear now.
tritringuyen
Quartz | Level 8

Thank @PGStats for providing some guidelines. I have modified the code as following:

 

                                                                                                                                            


* Random match each treatment with one firm in the same SIC3 and Q;
%macro simulation;
%do i=1 %to 5;

data t1&i;
set treatment;
rnd = rand("uniform");
output;
run;


data c1&i;
set eligible_controls;
rnd = rand("uniform");
output;
run;

proc sort data=t1&i; by sic3 Q rnd; run;
proc sort data=c1&i; by sic3 Q rnd; run;


proc rank data=t1&i out=t2&i;
by sic3 Q;
var rnd;
run;

proc rank data=c1&i out=c2&i;
by sic3 Q;
var rnd;
run;

data matched&i;
merge
t2&i (in=in_t)
c2&i (in=in_c rename=(gvkey=gvkey_c at=at_c));
by sic3 Q rnd;
if in_t and in_c;
run;

data matched&i; set matched&i; rep = &i; run;

* Combine all matched samples;
proc append base = matched_full data = matched&i force; run;
proc datasets library=work; delete stat t1&i t2&i c1&i c2&i matched&i; quit; run;

%end;
%mend simulation;
%simulation

     

It seems that the code works. @PGStats, what do you think? Thanks.                                                                                                 

 

PGStats
Opal | Level 21

Macro coding will only complicate things for you. SAS can handle sample sizes of 200*1000 very easily with by-processing, and you will get your regression results in a single dataset.

 

The matching that I illustrated was without replacement, i.e. a control would not be matched with multiple treatment cases in any given repetition. You seem to want random matching with replacement. That would require a different algorithm.

PG
tritringuyen
Quartz | Level 8

Dear @PGStats, you are right. My macro code results in matched samples with replacement. The fact is that I have more than 10,000 treatments. Thus, the full matched sample after 1,000 times has more than 10 millions observations (more than 8GB for the full matched sample with variables needed). I need a macro for matching because I think it is more efficient. I tested my code and it works perfectly (with replacement). I wonder if you can help me adjust my code a little bit from matching with replacement to without replacement. I would be helpful. Thank you very much!

 

Best regards, Thierry

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1887 views
  • 0 likes
  • 4 in conversation