Solved: Re: Matching without replacement

mspak · Posted 09-24-2012 04:22 AM

Dear all,

I matched a set of company (tem1_tic) with another set of company (tic) by year, industry (sic) and closest_size (see attached). However, there are many matched firms for every tem1_tic. I wish only have one matched firm for each tem1.tic.

I wish to select the first matched firms. However, a condition that I have to consider is that: some matched firms might be the matched firm of another tem1.tic. My question is that: how to ensure that I will have one unique matched firm for every tem1.tic with the following conditions?

1. the matched firm must not be selected as matched firm for another tem1.tic.

2. If there is the only (1) matched firm for a given tem1.tic, that matched firm must be selected as matched firm, in priority.

I am thinking to solve the problem by sorting the data into the number of matched firms, then select the firms according to the priority condition, with firms with lowest number of matched firms should be matched earlier.

I also wish to know firms that do not have any matched firm, as their match firm might have been a matched firm of another firms (in priority)?

Thank you.

Regards,

Mspak

Ksharp · Posted 09-25-2012 01:30 AM

OK. mspak .

The following code is worked for your sample data, but you need to check it more. I have no time to make it all right.

There is a special scenario maybe you need to consider, if two TEM1_tic ,tem1_fyear have the same 'tic','fyear' , which one you need to retain ?

0213B 2004 TAM 2005 <- 0213B 2004 has only one obs

0234A 2003 TAM 2005 <- 0234A 2003 has only one obs

libname x v9 'c:\temp';

proc sql;
create table x as
 select *,count(*) as n
  from x.temp
   group by TEM1_tic ,tem1_fyear
    order by calculated n,TEM1_tic ,tem1_fyear;
quit;

data want;
if _n_ eq 1 then do;
 if 0 then set x;
  declare hash ha();
   ha.definekey('tic','fyear');
   ha.definedone();
end;
 set x;
 by n TEM1_tic tem1_fyear ;
 retain found;
 if first.tem1_fyear then found=0;
 if not found and ha.check() ne 0 then do;found=1;ha.add();output;end;
run;


/* this is important to make sure all of TEM1_tic ,tem1_fyear have been involved.
if result is blank then all of them are involved.
For your sample data, it all been involved. i.e. the result is blank */
proc sql;
 select TEM1_tic,tem1_fyear from want
 except
 select TEM1_tic,tem1_fyear from x.temp;
quit;

Ksharp

View solution in original post

Patrick · Posted 09-24-2012 05:48 AM

What about loading your firms into a hash with option multidata? You then delete the entry in case of a match unless it is the last entry for this key.

You could also add something like a "match flag" field in your hash and in the endoutput all records where the flag is 0 (=firms which never matched).

Do you think you can code this on your own or do you need some sample code?

I just had a look at your sample data. This seems to be the result of your join. In case you're after code: Could you please post the 2 source data sets.

This closest_size: Is this a exact match or a match where the difference is minimal?

mspak · Posted 09-24-2012 10:44 AM

Hi Patrick,

Thanks for your suggestion. I have no idea on how to code it as I am not familiar with hash object. My objective is to match all the firms (tem1.tic) with ONE matched firm (tic) at my best effort.

The lg_asset is the firms size. I tried to matched firms with the smallest difference in firm size.

The following is my code prior to the sample data I posted previously:

data BENCH1(keep=fyear tic lg_asset sic1) BENCH0 (keep=fyear tic lg_asset sic1);

set geodata.exfinance_match;

if BENCH= 1 then output BENCH1;

if BENCH= 0 then output BENCH0;

run;

proc sql;

create table geodata.temp as

select a.tic as TEM1_tic, a.fyear as tem1_fyear, b.*,

abs(sum(a.LG_ASSET, -b.LG_ASSET)) as diffPS,

min(calculated diffPS) as closest_size

from BENCH1(where=(LG_ASSET is not missing)) a ,

BENCH0(where=(LG_ASSET is not missing)) b

group by a.tic, a.fyear

having diffPS = closest_size and a.sic1=b.sic1;

quit;

The data that I posted, after running the above programs; the firms, tem1.tic have been matched with tic based on the condition: diffPS = closest_size and a.sic1=b.sic1 grouped by a.tic, a.fyear.

This is stage that I wish to ensure every tem1.tic firm has a matched tic (for a given year ). I have to ensure that there is no duplication, that is, a particular tic cannot be matched with more than one tem1.tic. Firms which have no (little) alternative(s) matched firm should be given a priority to be matched earlier.

I hope that my explanation is clear now.

Thank you.

Regards,

MSPAK

Ksharp · Posted 09-24-2012 10:51 PM

well, mspak.

I just notice there are lots of obs have the same LG_ASSET , that is right ? Or you can simply remove these duplicated obs by using

proc sort data=xxx nodupkey; by TEM1_tic tem1_fyear ;run;

Otherwise, You should use Hash Table as Patrick suggested to get it ,but that is a tough thing.

Ksharp

mspak · Posted 09-24-2012 11:21 PM

Dear Ksharp,

It is right that there are many firms can be matched with a particular tem1.tic. If I were to delete the firms with duplicate of tem1.tic and tem1.fyear, I will get one matched firms for each tem1.tic for a given year (tem1.fyear). However, there is a possibility that the same matched firm (with tic and fyear) also a matched firm for another tem1.tic in given year (tem1.fyear).

At the end, I have to ensure there are unique pairs of firms. Once there is a duplicate matched firms, I have to eliminate the duplicates, the pairing procedure is not done properly.

Hope to get any suggestions on codes, if any.

Thank you.

Regards,

mspak

Ksharp · Posted 09-25-2012 01:30 AM

OK. mspak .

The following code is worked for your sample data, but you need to check it more. I have no time to make it all right.

There is a special scenario maybe you need to consider, if two TEM1_tic ,tem1_fyear have the same 'tic','fyear' , which one you need to retain ?

0213B 2004 TAM 2005 <- 0213B 2004 has only one obs

0234A 2003 TAM 2005 <- 0234A 2003 has only one obs

libname x v9 'c:\temp';

proc sql;
create table x as
 select *,count(*) as n
  from x.temp
   group by TEM1_tic ,tem1_fyear
    order by calculated n,TEM1_tic ,tem1_fyear;
quit;

data want;
if _n_ eq 1 then do;
 if 0 then set x;
  declare hash ha();
   ha.definekey('tic','fyear');
   ha.definedone();
end;
 set x;
 by n TEM1_tic tem1_fyear ;
 retain found;
 if first.tem1_fyear then found=0;
 if not found and ha.check() ne 0 then do;found=1;ha.add();output;end;
run;


/* this is important to make sure all of TEM1_tic ,tem1_fyear have been involved.
if result is blank then all of them are involved.
For your sample data, it all been involved. i.e. the result is blank */
proc sql;
 select TEM1_tic,tem1_fyear from want
 except
 select TEM1_tic,tem1_fyear from x.temp;
quit;

Ksharp

mspak · Posted 09-25-2012 02:55 AM

Thank you Ksharp,

It is workable!!!

Have a nice days!

Regards,

mspak

Registration is open

SAS Training: Just a Click Away