Solved: Re: FINDING MATCHED COUNTERPARTS

redrover99 · Posted 08-03-2012 06:49 AM

To this post I have attached two databases.

Database1: Contains the event firm information, involving the headings: Permno1, Dclrdt, marketcap1, Year1 and industry1. There are approximately 10869 event firm observations in database 1.

Database2: Contains the NON-Event firm information, involving the headings: Permno2, Marketcap2, year2 and industry2.

For each of the event firm observations in database1 (DB1) I need to match with a non-event firm observation in database2 (DB2) to create a database 3 (DB3)

I would like to match the observations on the basis of:

-Industry (must be the same number, between 1-7)

-Year (must be the same year)

-Size (no limits, just the closet match available for the given industry and year)

Ultimately my objective to achieve:

Database 3: Matched database. This contains 10869 event firms and its counterpart, the matched non event firm. DB3 holds the following headings: Permno1, Dclrdt, marketcap1, Year1 , industry1, Permno2, Marketcap2, year2 and industry2.

I seek help on this highly complex command, data is attached below.

PGStats · Posted 08-05-2012 04:48 PM

You'll get decent results using brute force (takes about 15 minutes to run on my modest machine) :

/* Add a unique sequence number to DB1 */
data DB1S;
set DB1;

source = _n_;
run;

/* Considering the distributions of marketcap1 and BMratio1, combine the
difference in log10(marketcap) and twice the difference in BMratio
as the components of distance between DB1 and DB2 observations.
Find the pairs with the lowest distances.

Takes about 15 min to run on a 3 GHz Intel processor with 2Gb memory */

proc sql;
create table matches as
select db1s.*, db2.*,
log10(marketcap1/marketcap2)**2 + 4*(BMratio1-BMratio2)**2 as dist
from db1s inner join db2 on industry1=industry2 and year1=year2
group by source
having dist=min(dist);
quit;

/* Remove duplicates originating from duplicates in DB2 */
proc sort data=matches noduprecs out=DB5; by source; run;

PG

View solution in original post

Tom · Posted 08-03-2012 11:42 AM

Search for propensity score matching algorithms.

In this case within a particular combination of Industry and Year the value of SIZE can be treated as your propensity score.

PGStats · Posted 08-03-2012 12:37 PM

Can the same NON-event firm be matched to many event firm for a given year, and vice-versa? Or are you looking for one-to-one matches?

PG

redrover99 · Posted 08-03-2012 08:50 PM

"The same NON-event firm be matched to many event firm for a given year" yes this should be fine.

But can we give both codes a go?

One code for where the same non-event firm can be matched, and another code where only one-to-one matches are permitted??

Mkeintz:You are right, my mistake it should be "size (no limits, just the closet match available for the given industry and year)". This is not random selection.

If I matched 1a to 2b this is a difference of 152 and if 1b with 2a this is a difference of 149. So, I would like to match 1a with 2a (difference of 49), and 1b with 2b (Difference of 52). Though I understand what you are trying to say, perhaps if we can impose a preference to those closest matches holding the same first digit. ie: 200 with 252. The preference is to be exercised only when two non-event observations are competing for selection.

With my SAS capabilities this matching process is proving a difficult command to conquer. Help would be greatly appreciated.

PGStats · Posted 08-04-2012 02:25 PM

For the closest matches, allowing duplicates :

data DB1S;
set DB1; source = _n_;
run;

proc sql;
create table db as
select source, industry1 as industry, year1 as year, permno1 as permno, dclrdt, marketcap1 as size
from DB1S
union all
select 0, industry2, year2, permno2, 0, marketcap2
from DB2
order by industry, year, size, dclrdt;

data up;
set db; by industry year;
retain p2 s2;
if first.year then call missing(p2, s2);
if source > 0 then do;
     match = coalesce(abs(size - s2), 999999);
     output;
     end;
else do;
     p2 = permno; s2 = size;
     end;
run;

proc sort data=db; by industry year descending size descending dclrdt; run;

data down;
set db; by industry year;
retain p2 s2;
if first.year then call missing(p2, s2);
if source > 0 then do;
     match = coalesce(abs(size - s2), 999999);
     output;
     end;
else do;
     p2 = permno; s2 = size;
     end;
run;

proc sql;
create table DB3 as
select U.industry, U.year, U.permno as permno1, U.dclrdt, U.size as marketcap1,
case when U.match < D.match then U.p2 else D.p2 end as permno2,
case when U.match < D.match then U.s2 else D.s2 end as marketcap2
from up as U inner join down as D on U.source = D.source
order by U.source;

drop table DB1S, db, up, down;
quit;

If duplicates are not allowed, the problem becomes a much harder optimization challenge.

PG

redrover99 · Posted 08-05-2012 01:24 AM

Thanks, allowing duplicates is fine. I like the database 3 I am receiving, however I wonder is it possible to tweak the code to add one more restriction?

So matching the data on 4 levels of criterion:

-Industry (must be the same number)

-Year (must be the same year)

-Size ( the closet match available )

-BM ratio ( the closet match available given matching industry, matching year and closet size)

By adding one more restriction this reduces the number of observations in the event firm DB1 to 5085. Still looking to match each observation with a non event firm db2 thankfully this database is still rather large about 1mill observations for SAS to select from.

So the final product would be much the same:

Database 3: Matched database. This contains 5085 event firms and its counterpart, the matched non event firm. DB3 holds the following headings: Permno1, Dclrdt, marketcap1, Year1 , industry1, mbratio1, Permno2, Marketcap2, year2, industry2 and mbratio2.

I have attached the MBratio data with this post. see below

PGStats · Posted 08-05-2012 08:04 AM

The algorithm implemented in the code is based on sorting. Sorting brings matching sizes in consecutive order. Unless there is a way to combine size and MBratio to produce a unique order, this simple algorithm cannot be adapted to more than one matching criteria.

PG

redrover99 · Posted 08-05-2012 08:18 AM

Hmm, what if I multiplied size and MB ratio in each database, to obtain a "size_MBratio" variable? Then matched DB1 and DB2 on the "size_MBratio"? Could the code provided then potentially match on size and BM ratio?

To the readers of this post, I am very eager to match on 4 levels of criterion, if you have this programming capabilities to help us out, please do! This requires high level programming skills.

PGStats · Posted 08-05-2012 03:26 PM

I tried matching with the product of marketcap and bmratio, this is the code :

/* Add a unique sequence number to DB1 */
data DB1S;
set DB1; source = _n_;
run;

/* Concatenate DB1 and DB2, use the variable SOURCE to distinguish the
source of each observation, sort the resulting dataset by size=marketcap*bmratio */
proc sql;
create table db as
select source, industry1 as industry, year1 as year, permno1 as permno, dclrdt,
marketcap1 as marketcap, bmratio1 as bmratio, marketcap1*bmratio1 as size
from DB1S
union all
select 0, industry2, year2, permno2, 0, marketcap2, bmratio2, marketcap2*bmratio2
from DB2
order by industry, year, size, dclrdt;

/* Scan the sorted data and associate the last observation met from DB2 to each
observation from DB1 */
data up;
set db; by industry year;
retain p2 m2 b2 s2;
if first.year then call missing(p2, m2, b2, s2);
if source > 0 then do;
match = coalesce(abs(size - s2), 999999);
output;
end;
else do;
p2 = permno; m2=marketcap; b2=bmratio; s2 = size;
end;
run;

/* reverse the sort order within each industry-year */
proc sort data=db; by industry year descending size descending dclrdt; run;

/* Scan the sorted data and associate the last observation met from DB2 to each
observation from DB1 */
data down;
set db; by industry year;
retain p2 m2 b2 s2;
if first.year then call missing(p2, m2, b2, s2);
if source > 0 then do;
match = coalesce(abs(size - s2), 999999);
output;
end;
else do;
p2 = permno; m2=marketcap; b2=bmratio; s2 = size;
end;
run;

/* Merge datasets UP and DOWN, keeping the closest match between the size immediately
before and the size immediately after each DB1 observation. Reestablish the original
DB1 order */
proc sql;
create table DB4 as
select
U.industry label="Industry",
U.year label="Year",
U.permno as permno1 label="PermNo1",
U.dclrdt,
case when U.match < D.match then U.p2 else D.p2 end as permno2 label="PermNo2",
U.marketcap as marketcap1 label="MarketCap1",
case when U.match < D.match then U.m2 else D.m2 end as marketcap2 label="MarketCap2",
U.bmratio as bmratio1 label="BMratio1",
case when U.match < D.match then U.b2 else D.b2 end as bmratio2 label="BMratio2"
from up as U inner join down as D on U.source = D.source
order by U.source;
drop table DB1S, db, up, down;
quit;

Compared to the original results (BD3), I don't find these results very satisfactory. I will give it a shot with the brute force approach, using some kind of euclidian distance as matching criteria. Contrary to first impression, it might be manageable within industry and year.

PG

PGStats · Posted 08-05-2012 04:48 PM

You'll get decent results using brute force (takes about 15 minutes to run on my modest machine) :

/* Add a unique sequence number to DB1 */
data DB1S;
set DB1;

source = _n_;
run;

/* Considering the distributions of marketcap1 and BMratio1, combine the
difference in log10(marketcap) and twice the difference in BMratio
as the components of distance between DB1 and DB2 observations.
Find the pairs with the lowest distances.

Takes about 15 min to run on a 3 GHz Intel processor with 2Gb memory */

proc sql;
create table matches as
select db1s.*, db2.*,
log10(marketcap1/marketcap2)**2 + 4*(BMratio1-BMratio2)**2 as dist
from db1s inner join db2 on industry1=industry2 and year1=year2
group by source
having dist=min(dist);
quit;

/* Remove duplicates originating from duplicates in DB2 */
proc sort data=matches noduprecs out=DB5; by source; run;

PG

redrover99 · Posted 08-05-2012 08:10 PM

PGStats you are a commendable mathematician! Thank you

mkeintz · Posted 08-03-2012 01:25 PM

You said:

"Size (no limits, just the closet match available for the given industry and size)"

Did you mean "given industry and year"?.

I think you want, for each record in DB 1, a single record from DB2 with the "closest match" in size, given it has the same industry and year. I.e. this is NOT a random selection, am I correct?

As stated, that would be easy (even the possibility of tied SIZE in DB2 could be easily addressed).

The real problem is whether you are willing to have the same DB2 record ending up being found as the closest to multiple DB1 records (i.e. "sampling with replacement"). I doubt you would want that.

So if you don't want that, my question to you is what rule do you want to use when a given DB2 record is the closest to multiple DB1 recs?

And before you say assign the DB2 record to the DB1 it is closest to, and then go get matches for the other DB1 recs, consider this scenario:

db1rec db1size                 db2rec db2size
1a          100                        2a         51
1b          200                        2b       252

2a is closest to both, but if you assign 2a to 1b (it's closer to 1b), then that leaves 2b (252) assigned to 1a (100) - probably not what you would want. OTOH, introducing some criterion like minimizing the total of all size differences (in a given industry/year group) would be, to me, a difficult programming task.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!