Help using Base SAS procedures

Help on data matching

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Help on data matching


Hello all,

Could anyone help me on data matching?

I have a data set and separate it into two: sample and total control based on a dummy variable.

In the sample, I have about 5000 firm-year observations. In the total control, I have about 30,000 firm-year observation.

I need a performance (retrun on assets) based 1:1 matching if possible.

I tried following the proc sql, but get lost.

Please provide some guidance.

Your response is very much appreciated.


Accepted Solutions
Solution
‎10-03-2013 10:54 AM
Super User
Posts: 10,507

Re: Help on data matching

Are you trying to match on single or multiple characteristics?

View solution in original post


All Replies
Super Contributor
Posts: 333

Re: Help on data matching

Can you provide the code you are using? Perhaps a little more details about your issue and data would help as well.

EJ

Occasional Contributor
Posts: 6

Re: Help on data matching

Hello there,

Thank you for the quick response.

I have been trying to get the 1:1 matching based on (1)performance (ROA) ( within 30% of absolute difference), (2) the same industry and (3) year.

For example,

I created the Sample (5000 obs.) and the Totalcontrol (30,000 obs.) based on my dummy variable "Suspect";

Each observation has an id, year, industry (sic), ROA, and other variables.

I followed the

proc sql;

     create table potentialmatch1 as 

     select *

     from sample, totalcontrol

     where sample. sic = totalcontrol.sic;

quit;

It seems that I did not get it right. The output "potentialmatch1" has over 8 million observations.

Did I get a wrong start? I mean do I have to rename the variables in the Sample so that these two datasets have different variable names?

Thank you very much. I really appreciate it.

Solution
‎10-03-2013 10:54 AM
Super User
Posts: 10,507

Re: Help on data matching

Are you trying to match on single or multiple characteristics?

Occasional Contributor
Posts: 6

Re: Help on data matching

Hello there,

Thank you for the quick response.

I have been trying to get the 1:1 matching based on (1)performance (ROA) ( within 30% of absolute difference), (2) the same industry and (3) year.

For example,

I created the Sample (5000 obs.) and the Totalcontrol (30,000 obs.) based on my dummy variable "Suspect";

Each observation has an id, year, industry (sic), ROA, and other variables.

I followed the

proc sql;

     create table potentialmatch1 as 

     select *

     from sample, totalcontrol

     where sample. sic = totalcontrol.sic;

quit;

It seems that I did not get it right. The output "potentialmatch1" has over 8 million observations.

Did I get a wrong start? I mean do I have to rename the variables in the Sample so that these two datasets have different variable names?

Thank you very much. I really appreciate it.

Super Contributor
Posts: 333

Re: Help on data matching

It looks like to me you are merging on just industry so I guessing you are getting a cartesian product for like industries between the two files. So here are the assumptions I am using to figure out what you want:

1) Both files were created from the same source file but the 5000 are the suspect = 1 (or what ever) and the totalcontrol is all the rest??

2) you dont want to match company to company but you want to find a like company from the non suspect pool that is like the companies in the suspect pool??? If so  you will probably get multiples and will have to have another step to pick the ones for each suspect that you want based on some criteria.

I would try something like the following (untested):

proc sql;

     create table potentialmatch1 as

     select a.year, a.sic, a.id, a.roa, b.id, b.roa

     from sample as a, totalcontrol as b

     where a. sic = b.sic and a.year = b.year and a.roa between (.7*b.roa) and (1.3*b.roa)

;

quit;

See if that helps

EJ

Occasional Contributor
Posts: 6

Re: Help on data matching

Hello EJ,

Thank you very much. It seems that it is working. As I change the performance criteria value, the number of observations in the "potentialmatch1" changes.

I am really new to sas programming. Please forgive my slow understanding.

For the "potentialmatch1" data we get from the proc sql step, does it include the sample and the possible match, or just the possible match? And I also notice some repeated observations, should I delete them?

Thanks for the help.

Super Contributor
Posts: 333

Re: Help on data matching

So here so further explanation of what I did:

the comma in the from statement of sql means an INNER JOIN of the two datasets listed. Inner join return the variables youve asked for in the select statement (a. and b. indicate which dataset that piece comes from), where the two datasets match on the where criteria specified (in this case they have to match on industry year and within the roa range specified). If there are no matches then there will not be a row returned from the inner join (you could change to a LEFT JOIN (switch the "where" to an "on" and the comma to "left join") which would return at least one row for every observation in sample and multiple rows if there are multiple matches from totalcontrol (the single no match row would have the information from sample but empty fields for the totalcontrol variables).

To answer the question about deleting observations is harder not knowing your data well. If they are exact matches then you might have duplicates in one of the source files based on the three criteria given. You can add the "distinct" keyword right after the select which should get rid of exact duplicates. I imagine most are not exact duplicates because I am assuming the range on the roa is allowing multiple matches.

Hope that helps!

EJ

Super User
Posts: 5,085

Re: Help on data matching

If you have 5K records in 1 data set, and 30K records in the other, what does it mean that you need a 1:1 match?

Occasional Contributor
Posts: 15

Re: Help on data matching

More details please as you are not clear. 1:1 match can be done if you have  a unique key.  What is the issue.?

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 715 views
  • 0 likes
  • 5 in conversation