Solved: Re: Help on data matching

Almond · Posted 10-03-2013 09:57 AM

Hello all,

Could anyone help me on data matching?

I have a data set and separate it into two: sample and total control based on a dummy variable.

In the sample, I have about 5000 firm-year observations. In the total control, I have about 30,000 firm-year observation.

I need a performance (retrun on assets) based 1:1 matching if possible.

I tried following the proc sql, but get lost.

Please provide some guidance.

Your response is very much appreciated.

ballardw · Posted 10-03-2013 10:54 AM

Are you trying to match on single or multiple characteristics?

View solution in original post

esjackso · Posted 10-03-2013 10:38 AM

Can you provide the code you are using? Perhaps a little more details about your issue and data would help as well.

EJ

Almond · Posted 10-03-2013 12:52 PM

Hello there,

Thank you for the quick response.

I have been trying to get the 1:1 matching based on (1)performance (ROA) ( within 30% of absolute difference), (2) the same industry and (3) year.

For example,

I created the Sample (5000 obs.) and the Totalcontrol (30,000 obs.) based on my dummy variable "Suspect";

Each observation has an id, year, industry (sic), ROA, and other variables.

I followed the

proc sql;

create table potentialmatch1 as

select *

from sample, totalcontrol

where sample. sic = totalcontrol.sic;

quit;

It seems that I did not get it right. The output "potentialmatch1" has over 8 million observations.

Did I get a wrong start? I mean do I have to rename the variables in the Sample so that these two datasets have different variable names?

Thank you very much. I really appreciate it.

ballardw · Posted 10-03-2013 10:54 AM

Are you trying to match on single or multiple characteristics?

Almond · Posted 10-03-2013 12:50 PM

Hello there,

Thank you for the quick response.

I have been trying to get the 1:1 matching based on (1)performance (ROA) ( within 30% of absolute difference), (2) the same industry and (3) year.

For example,

I created the Sample (5000 obs.) and the Totalcontrol (30,000 obs.) based on my dummy variable "Suspect";

Each observation has an id, year, industry (sic), ROA, and other variables.

I followed the

proc sql;

create table potentialmatch1 as

select *

from sample, totalcontrol

where sample. sic = totalcontrol.sic;

quit;

It seems that I did not get it right. The output "potentialmatch1" has over 8 million observations.

Did I get a wrong start? I mean do I have to rename the variables in the Sample so that these two datasets have different variable names?

Thank you very much. I really appreciate it.

esjackso · Posted 10-03-2013 01:17 PM

It looks like to me you are merging on just industry so I guessing you are getting a cartesian product for like industries between the two files. So here are the assumptions I am using to figure out what you want:

1) Both files were created from the same source file but the 5000 are the suspect = 1 (or what ever) and the totalcontrol is all the rest??

2) you dont want to match company to company but you want to find a like company from the non suspect pool that is like the companies in the suspect pool??? If so you will probably get multiples and will have to have another step to pick the ones for each suspect that you want based on some criteria.

I would try something like the following (untested):

proc sql;

create table potentialmatch1 as

select a.year, a.sic, a.id, a.roa, b.id, b.roa

from sample as a, totalcontrol as b

where a. sic = b.sic and a.year = b.year and a.roa between (.7*b.roa) and (1.3*b.roa)

;

quit;

See if that helps

EJ

Almond · Posted 10-03-2013 03:02 PM

Hello EJ,

Thank you very much. It seems that it is working. As I change the performance criteria value, the number of observations in the "potentialmatch1" changes.

I am really new to sas programming. Please forgive my slow understanding.

For the "potentialmatch1" data we get from the proc sql step, does it include the sample and the possible match, or just the possible match? And I also notice some repeated observations, should I delete them?

Thanks for the help.

esjackso · Posted 10-03-2013 03:35 PM

So here so further explanation of what I did:

the comma in the from statement of sql means an INNER JOIN of the two datasets listed. Inner join return the variables youve asked for in the select statement (a. and b. indicate which dataset that piece comes from), where the two datasets match on the where criteria specified (in this case they have to match on industry year and within the roa range specified). If there are no matches then there will not be a row returned from the inner join (you could change to a LEFT JOIN (switch the "where" to an "on" and the comma to "left join") which would return at least one row for every observation in sample and multiple rows if there are multiple matches from totalcontrol (the single no match row would have the information from sample but empty fields for the totalcontrol variables).

To answer the question about deleting observations is harder not knowing your data well. If they are exact matches then you might have duplicates in one of the source files based on the three criteria given. You can add the "distinct" keyword right after the select which should get rid of exact duplicates. I imagine most are not exact duplicates because I am assuming the range on the roa is allowing multiple matches.

Hope that helps!

EJ

Astounding · Posted 10-03-2013 11:18 AM

If you have 5K records in 1 data set, and 30K records in the other, what does it mean that you need a 1:1 match?

hsharmas · Posted 10-03-2013 12:40 PM

More details please as you are not clear. 1:1 match can be done if you have a unique key. What is the issue.?

Registration is open

SAS Training: Just a Click Away