BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
redrover99
Calcite | Level 5

To this post I have attached two databases.

Database1: Contains the event firm information, involving the headings: Permno1, Dclrdt, marketcap1, Year1 and industry1. There are approximately 10869 event firm observations in database 1.

Database2: Contains the NON-Event firm information, involving the headings: Permno2, Marketcap2, year2 and industry2.

For each of the event firm observations in database1 (DB1) I need to match with a non-event firm observation in database2 (DB2) to create a database 3 (DB3)

I would like to match the observations on the basis of:

-Industry (must be the same number, between 1-7)

-Year (must be the same year)

-Size (no limits, just the closet match available for the given industry and year)

Ultimately my objective to achieve:

Database 3: Matched database. This contains 10869 event firms and its counterpart, the matched non event firm.  DB3 holds the following headings: Permno1, Dclrdt, marketcap1, Year1 , industry1, Permno2, Marketcap2, year2 and industry2.

I seek help on this highly complex command, data is attached below.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You'll get decent results using brute force (takes about 15 minutes to run on my modest machine) :

/* Add a unique sequence number to DB1 */
data DB1S;
set DB1;

source = _n_;
run;

/* Considering the distributions of marketcap1 and BMratio1, combine the
difference in log10(marketcap) and twice the difference in BMratio
as the components of distance between DB1 and DB2 observations.
Find the pairs with the lowest distances.

Takes about 15 min to run on a 3 GHz Intel processor with 2Gb memory */


proc sql;
create table matches as
select db1s.*, db2.*,
     log10(marketcap1/marketcap2)**2 + 4*(BMratio1-BMratio2)**2 as dist
from db1s inner join db2 on industry1=industry2 and year1=year2
group by source
having dist=min(dist);
quit;

/* Remove duplicates originating from duplicates in DB2 */
proc sort data=matches noduprecs out=DB5; by source; run;

PG

PG

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

Search for propensity score matching algorithms. 

In this case within a particular combination of Industry and Year the value of SIZE can be treated as your propensity score.

PGStats
Opal | Level 21

Can the same NON-event firm be matched to many event firm for a given year, and vice-versa? Or are you looking for one-to-one matches?

PG

PG
redrover99
Calcite | Level 5

"The same NON-event firm be matched to many event firm for a given year" yes this should be fine.

But can we give both codes a go?

One code for where the same non-event firm can be matched, and another code where only one-to-one matches are permitted??

Mkeintz:You are right, my mistake it should be "size (no limits, just the closet match available for the given industry and year)". This is not random selection.

If I matched 1a to 2b this is a difference of 152 and if 1b with 2a this is a difference of 149. So, I would like to match 1a with 2a (difference of 49), and 1b with 2b (Difference of 52). Though I understand what you are trying to say, perhaps if we can impose a preference to those closest matches holding the same first digit. ie: 200 with 252. The preference is to be exercised only when two non-event observations are competing for selection.

With my SAS capabilities this matching process is proving a difficult command to conquer. Help would be greatly appreciated.

PGStats
Opal | Level 21

For the closest matches, allowing duplicates :

data DB1S;
set DB1; source = _n_;
run;

proc sql;
create table db as
select source, industry1 as industry, year1 as year, permno1 as permno, dclrdt, marketcap1 as size
from DB1S
union all
select 0, industry2, year2, permno2, 0, marketcap2
from DB2
order by industry, year, size, dclrdt;

data up;
set db; by industry year;
retain p2 s2;
if first.year then call missing(p2, s2);
if source > 0 then do;
     match = coalesce(abs(size - s2), 999999);
     output;
     end;
else do;
     p2 = permno; s2 = size;
     end;
run;

proc sort data=db; by industry year descending size descending dclrdt; run;

data down;
set db; by industry year;
retain p2 s2;
if first.year then call missing(p2, s2);
if source > 0 then do;
     match = coalesce(abs(size - s2), 999999);
     output;
     end;
else do;
     p2 = permno; s2 = size;
     end;
run;

proc sql;
create table DB3 as
select U.industry, U.year, U.permno as permno1, U.dclrdt, U.size as marketcap1,
     case when U.match < D.match then U.p2 else D.p2 end as permno2,
     case when U.match < D.match then U.s2 else D.s2 end as marketcap2
from up as U inner join down as D on U.source = D.source
order by U.source;


drop table DB1S, db, up, down;
quit;

If duplicates are not allowed, the problem becomes a much harder optimization challenge.

PG

PG
redrover99
Calcite | Level 5

Thanks, allowing duplicates is fine. I like the database 3 I am receiving, however I wonder is it possible to tweak the code to add one more restriction?

So matching the data on 4 levels of criterion:

-Industry (must be the same number)

-Year (must be the same year)

-Size ( the closet match available )

-BM ratio ( the closet match available given matching industry, matching year and closet size)

By adding one more restriction this reduces the number of observations in the event firm DB1 to 5085. Still looking to match each observation with a non event firm db2 thankfully this database is still rather large about 1mill observations for SAS to select from.

So the final product would be much the same:

Database 3: Matched database. This contains 5085 event firms and its counterpart, the matched non event firm.  DB3 holds the following headings: Permno1, Dclrdt, marketcap1, Year1 , industry1, mbratio1, Permno2, Marketcap2, year2, industry2 and mbratio2.

I have attached the MBratio data with this post. see below

PGStats
Opal | Level 21

The algorithm implemented in the code is based on sorting. Sorting brings matching sizes in consecutive order. Unless there is a way to combine size and MBratio to produce a unique order, this simple algorithm cannot be adapted to more than one matching criteria.

PG

PG
redrover99
Calcite | Level 5

Hmm, what if I multiplied size and MB ratio in each database, to obtain a "size_MBratio" variable? Then matched DB1 and DB2 on the "size_MBratio"? Could the code provided then potentially match on size and BM ratio?

To the readers of this post, I am very eager to match on 4 levels of criterion, if you have this programming capabilities to help us out, please do! This requires high level programming skills.

PGStats
Opal | Level 21

I tried matching with the product of marketcap and bmratio, this is the code :

/* Add a unique sequence number to DB1 */
data DB1S;
set DB1; source = _n_;
run;

/* Concatenate DB1 and DB2, use the variable SOURCE to distinguish the
source of each observation, sort the resulting dataset by size=marketcap*bmratio */
proc sql;
create table db as
select source, industry1 as industry, year1 as year, permno1 as permno, dclrdt,
marketcap1 as marketcap, bmratio1 as bmratio, marketcap1*bmratio1 as size
from DB1S
union all
select 0, industry2, year2, permno2, 0, marketcap2, bmratio2, marketcap2*bmratio2
from DB2
order by industry, year, size, dclrdt;

/* Scan the sorted data and associate the last observation met from DB2 to each
observation from DB1 */
data up;
set db; by industry year;
retain p2 m2 b2 s2;
if first.year then call missing(p2, m2, b2, s2);
if source > 0 then do;
match = coalesce(abs(size - s2), 999999);
output;
end;
else do;
p2 = permno; m2=marketcap; b2=bmratio; s2 = size;
end;
run;

/* reverse the sort order within each industry-year */
proc sort data=db; by industry year descending size descending dclrdt; run;

/* Scan the sorted data and associate the last observation met from DB2 to each
observation from DB1 */
data down;
set db; by industry year;
retain p2 m2 b2 s2;
if first.year then call missing(p2, m2, b2, s2);
if source > 0 then do;
match = coalesce(abs(size - s2), 999999);
output;
end;
else do;
p2 = permno; m2=marketcap; b2=bmratio; s2 = size;
end;
run;

/* Merge datasets UP and DOWN, keeping the closest match between the size immediately
before and the size immediately after each DB1 observation. Reestablish the original
DB1 order */
proc sql;
create table DB4 as
select
U.industry label="Industry",
U.year label="Year",
U.permno as permno1 label="PermNo1",
U.dclrdt,
case when U.match < D.match then U.p2 else D.p2 end as permno2 label="PermNo2",
U.marketcap as marketcap1 label="MarketCap1",
case when U.match < D.match then U.m2 else D.m2 end as marketcap2 label="MarketCap2",
U.bmratio as bmratio1 label="BMratio1", 
case when U.match < D.match then U.b2 else D.b2 end as bmratio2 label="BMratio2"
from up as U inner join down as D on U.source = D.source
order by U.source;
drop table DB1S, db, up, down;
quit;

Compared to the original results (BD3), I don't find these results very satisfactory. I will give it a shot with the brute force approach, using some kind of euclidian distance as matching criteria. Contrary to first impression, it might be manageable within industry and year.

PG

PG
PGStats
Opal | Level 21

You'll get decent results using brute force (takes about 15 minutes to run on my modest machine) :

/* Add a unique sequence number to DB1 */
data DB1S;
set DB1;

source = _n_;
run;

/* Considering the distributions of marketcap1 and BMratio1, combine the
difference in log10(marketcap) and twice the difference in BMratio
as the components of distance between DB1 and DB2 observations.
Find the pairs with the lowest distances.

Takes about 15 min to run on a 3 GHz Intel processor with 2Gb memory */


proc sql;
create table matches as
select db1s.*, db2.*,
     log10(marketcap1/marketcap2)**2 + 4*(BMratio1-BMratio2)**2 as dist
from db1s inner join db2 on industry1=industry2 and year1=year2
group by source
having dist=min(dist);
quit;

/* Remove duplicates originating from duplicates in DB2 */
proc sort data=matches noduprecs out=DB5; by source; run;

PG

PG
redrover99
Calcite | Level 5

PGStats you are a commendable mathematician! Thank you

mkeintz
PROC Star

You said:

  "Size (no limits, just the closet match available for the given industry and size)"

Did you mean "given industry and year"?.

I think you want, for each record in DB 1, a single record from DB2 with the "closest match" in size, given it has the same industry and year.  I.e. this is NOT a random selection, am I correct? 

As stated, that would be easy (even the possibility of tied SIZE in DB2 could be easily addressed).

The real problem is whether you are willing to have the same DB2 record ending up being found as the closest to multiple DB1 records (i.e. "sampling with replacement").  I doubt you would want that.

So if you don't want that, my question to you is what rule do you want to use when a given DB2 record is the closest to multiple DB1 recs?

And before you say assign the DB2 record to the DB1 it is closest to, and then go get matches for the other DB1 recs, consider this scenario:

db1rec  db1size                 db2rec  db2size
  1a          100                        2a         51
  1b          200                        2b       252

2a is closest to both, but if you assign 2a to 1b (it's closer to 1b), then that leaves 2b (252) assigned to 1a (100) - probably not what you would want.   OTOH, introducing some criterion like minimizing the total of all size differences (in a given industry/year group) would be, to me, a difficult programming task.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2843 views
  • 14 likes
  • 4 in conversation