BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Almond
Calcite | Level 5


Hello all,

Could anyone help me on data matching?

I have a data set and separate it into two: sample and total control based on a dummy variable.

In the sample, I have about 5000 firm-year observations. In the total control, I have about 30,000 firm-year observation.

I need a performance (retrun on assets) based 1:1 matching if possible.

I tried following the proc sql, but get lost.

Please provide some guidance.

Your response is very much appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Are you trying to match on single or multiple characteristics?

View solution in original post

9 REPLIES 9
esjackso
Quartz | Level 8

Can you provide the code you are using? Perhaps a little more details about your issue and data would help as well.

EJ

Almond
Calcite | Level 5

Hello there,

Thank you for the quick response.

I have been trying to get the 1:1 matching based on (1)performance (ROA) ( within 30% of absolute difference), (2) the same industry and (3) year.

For example,

I created the Sample (5000 obs.) and the Totalcontrol (30,000 obs.) based on my dummy variable "Suspect";

Each observation has an id, year, industry (sic), ROA, and other variables.

I followed the

proc sql;

     create table potentialmatch1 as 

     select *

     from sample, totalcontrol

     where sample. sic = totalcontrol.sic;

quit;

It seems that I did not get it right. The output "potentialmatch1" has over 8 million observations.

Did I get a wrong start? I mean do I have to rename the variables in the Sample so that these two datasets have different variable names?

Thank you very much. I really appreciate it.

ballardw
Super User

Are you trying to match on single or multiple characteristics?

Almond
Calcite | Level 5

Hello there,

Thank you for the quick response.

I have been trying to get the 1:1 matching based on (1)performance (ROA) ( within 30% of absolute difference), (2) the same industry and (3) year.

For example,

I created the Sample (5000 obs.) and the Totalcontrol (30,000 obs.) based on my dummy variable "Suspect";

Each observation has an id, year, industry (sic), ROA, and other variables.

I followed the

proc sql;

     create table potentialmatch1 as 

     select *

     from sample, totalcontrol

     where sample. sic = totalcontrol.sic;

quit;

It seems that I did not get it right. The output "potentialmatch1" has over 8 million observations.

Did I get a wrong start? I mean do I have to rename the variables in the Sample so that these two datasets have different variable names?

Thank you very much. I really appreciate it.

esjackso
Quartz | Level 8

It looks like to me you are merging on just industry so I guessing you are getting a cartesian product for like industries between the two files. So here are the assumptions I am using to figure out what you want:

1) Both files were created from the same source file but the 5000 are the suspect = 1 (or what ever) and the totalcontrol is all the rest??

2) you dont want to match company to company but you want to find a like company from the non suspect pool that is like the companies in the suspect pool??? If so  you will probably get multiples and will have to have another step to pick the ones for each suspect that you want based on some criteria.

I would try something like the following (untested):

proc sql;

     create table potentialmatch1 as

     select a.year, a.sic, a.id, a.roa, b.id, b.roa

     from sample as a, totalcontrol as b

     where a. sic = b.sic and a.year = b.year and a.roa between (.7*b.roa) and (1.3*b.roa)

;

quit;

See if that helps

EJ

Almond
Calcite | Level 5

Hello EJ,

Thank you very much. It seems that it is working. As I change the performance criteria value, the number of observations in the "potentialmatch1" changes.

I am really new to sas programming. Please forgive my slow understanding.

For the "potentialmatch1" data we get from the proc sql step, does it include the sample and the possible match, or just the possible match? And I also notice some repeated observations, should I delete them?

Thanks for the help.

esjackso
Quartz | Level 8

So here so further explanation of what I did:

the comma in the from statement of sql means an INNER JOIN of the two datasets listed. Inner join return the variables youve asked for in the select statement (a. and b. indicate which dataset that piece comes from), where the two datasets match on the where criteria specified (in this case they have to match on industry year and within the roa range specified). If there are no matches then there will not be a row returned from the inner join (you could change to a LEFT JOIN (switch the "where" to an "on" and the comma to "left join") which would return at least one row for every observation in sample and multiple rows if there are multiple matches from totalcontrol (the single no match row would have the information from sample but empty fields for the totalcontrol variables).

To answer the question about deleting observations is harder not knowing your data well. If they are exact matches then you might have duplicates in one of the source files based on the three criteria given. You can add the "distinct" keyword right after the select which should get rid of exact duplicates. I imagine most are not exact duplicates because I am assuming the range on the roa is allowing multiple matches.

Hope that helps!

EJ

Astounding
PROC Star

If you have 5K records in 1 data set, and 30K records in the other, what does it mean that you need a 1:1 match?

hsharmas
Fluorite | Level 6

More details please as you are not clear. 1:1 match can be done if you have  a unique key.  What is the issue.?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1585 views
  • 0 likes
  • 5 in conversation