match two data sets with company names - Page 2

PGStats · Posted 01-05-2019 06:46 PM

Here is what I tried, using complev. Seems to work pretty well. You might have to experiment with the best distance limit (10 in this example)

libname t01 excel "&sasforum.\datasets\temp01a.xlsx";
libname t02 excel "&sasforum.\datasets\temp02.xlsx";

proc sql;
create table t01 as select * from t01.temp01a;
create table t02 as select * from t02.temp02;
quit;

proc sql;
create table names as
select ticker, upcase(comnam) as name from t01 
union
select ticker, upcase(name) as name from t02
order by ticker;
quit; 

data nameEqv;
array n{10} $60;
array e{10} $60;

do i = 1 by 1 until(last.ticker);
    set names; by ticker;
    n{i} = name;
    e{i} = name;
    do j = 1 to i-1;
        if complev(n{j}, n{i}, 10) < 10 then do;
            e{i} = e{j};
            leave;
            end;
        end;
    eqvName = e{i};
    output;
    end;

keep ticker name eqvName;
run;

proc sql;
create table t03 as
select 
    A.*,
    upcase(B.name) as otherName,
    B.issuerDescription,
    B.startDate,
    B.endDate
from 
    (select T01.*, n01.eqvName 
        from 
            T01 inner join 
            nameEqv as n01 
                on T01.ticker=n01.ticker and upcase(T01.comnam)=n01.name) as A inner join
    (select T02.*, n02.eqvName 
        from 
            T02 inner join 
            nameEqv as n02 
                on T02.ticker=n02.ticker and upcase(T02.name)=n02.name) as B
        on A.ticker=B.ticker and A.eqvName=B.eqvName;
quit;

PG

Songchan · Posted 01-05-2019 07:03 PM

Thank you for your reply. I never used macro before, could you explain what do the first two rows mean (i.e. what "&sasforum.\datasets "stands for) ?

PGStats · Posted 01-06-2019 12:19 AM

The two libname statements and the first proc SQL step bring the data from your Excel files into two datasets : t01 and t02. The real work starts on the second proc SQL step where a list of all distinct ticker-name combinations from both datasets is assembled.

PG

Songchan · Posted 01-06-2019 10:41 AM

Thank you PG! That works! there are about 1,600s out of 2,889 observations matched. For the unmatched ones, do i have to match them manually or there is a better way?

PGStats · Posted 01-06-2019 02:24 PM

Should there always be a match?

You could play with the edit distance limit (set to 10 above). As you increase the limit you will get more matches, but you risk also getting some false matches.

Cases involving name changes will have to be handled separately.

PG

Reeza · Posted 01-05-2019 05:21 PM

Isn’t there a data set somewhere that has ticker symbols and the dates they belong to a particular company? That would likely be a better way. Also, you’re going to miss name changes, same ticker but company rebrands under a different name. That’s more likely to occur IME than finding companies reusing the same symbol over time.

I would do a distance calc like suggested and take the nearest match.

For sample data, please try this, i know it’s not perfect but it’s a lot easier for people to work with. You should try to include as many variations of the issue as you can.

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

@Songchan wrote:

Hi guys,

I'm trying to merge two data sets by ticker symbols, but ticker symbols are not unique to every company, they can be reused by other companies, so I also need to merge two data sets by company names, The problem is names for same company in different data sets may be different in formats (such as up or lower case, inc. or corp, with "-" or " " in two words), how could I solve this problem? Please have a look of below codes

Kind regards,

Songchan
proc sql;
     create table temp03 as
	 select distinct a.*, b.permno, b.comnam, b.ticker, b.date
	 from temp02 as a 
	      left join temp01a as b
		  on a.ticker=b.ticker;
quit; 

mkeintz · Posted 01-06-2019 01:42 PM

@Reeza

Yes, there is a data set for CRSP data (the first of the OP's two data sets), that allows construction of a date range connecting a particular TICKER to a particular PERMNO (the CRSP id variable).

However the OP might not have such a table for the other source, which only uses company name, not an id variable. As I'm sure you realize, that's why I asked the OP whether the ticker in the 2nd data set is correct for the stated date range, or even the ENDDATE (or STARTDATE).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 01-06-2019 06:36 PM

If the second data set has any date though, it would help with the merge, or at least be better than a text merge.

Re: match two data sets with company names

Re: match two data sets with company names

Re: match two data sets with company names

Re: match two data sets with company names

Re: match two data sets with company names

Re: match two data sets with company names

Re: match two data sets with company names

Re: match two data sets with company names

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!