BookmarkSubscribeRSS Feed
PGStats
Opal | Level 21

Here is what I tried, using complev. Seems to work pretty well. You might have to experiment with the best distance limit (10 in this example)

 

libname t01 excel "&sasforum.\datasets\temp01a.xlsx";
libname t02 excel "&sasforum.\datasets\temp02.xlsx";

proc sql;
create table t01 as select * from t01.temp01a;
create table t02 as select * from t02.temp02;
quit;

proc sql;
create table names as
select ticker, upcase(comnam) as name from t01 
union
select ticker, upcase(name) as name from t02
order by ticker;
quit; 

data nameEqv;
array n{10} $60;
array e{10} $60;

do i = 1 by 1 until(last.ticker);
    set names; by ticker;
    n{i} = name;
    e{i} = name;
    do j = 1 to i-1;
        if complev(n{j}, n{i}, 10) < 10 then do;
            e{i} = e{j};
            leave;
            end;
        end;
    eqvName = e{i};
    output;
    end;

keep ticker name eqvName;
run;

proc sql;
create table t03 as
select 
    A.*,
    upcase(B.name) as otherName,
    B.issuerDescription,
    B.startDate,
    B.endDate
from 
    (select T01.*, n01.eqvName 
        from 
            T01 inner join 
            nameEqv as n01 
                on T01.ticker=n01.ticker and upcase(T01.comnam)=n01.name) as A inner join
    (select T02.*, n02.eqvName 
        from 
            T02 inner join 
            nameEqv as n02 
                on T02.ticker=n02.ticker and upcase(T02.name)=n02.name) as B
        on A.ticker=B.ticker and A.eqvName=B.eqvName;
quit;
PG
Songchan
Calcite | Level 5

Thank you for your reply. I never used macro before, could you explain what do the first two rows mean (i.e. what "&sasforum.\datasets "stands for) ?

PGStats
Opal | Level 21

The two libname statements and the first proc SQL step bring the data from your Excel files into two datasets : t01 and t02. The real work starts on the second proc SQL step where a list of all distinct ticker-name combinations from both datasets is assembled.

PG
Songchan
Calcite | Level 5

Thank you PG! That works! there are about 1,600s out of 2,889 observations matched. For the unmatched ones, do i have to match them manually or there is a better way?

PGStats
Opal | Level 21

Should there always be a match?

 

You could play with the edit distance limit (set to 10 above). As you increase the limit you will get more matches, but you risk also getting some false matches.

 

Cases involving name changes will have to be handled separately.

PG
Reeza
Super User

Isn’t there a data set somewhere that has ticker symbols and the dates they belong to a particular company? That would likely be a better way. Also, you’re going to miss name changes, same ticker but company rebrands under a different name. That’s more likely to occur IME than finding companies reusing the same symbol over time. 

 

I would do a distance calc like suggested and take the nearest match. 

 

For sample data, please try this, i know it’s not perfect but it’s a lot easier for people to work with. You should try to include as many variations of the issue as you can. 

 

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

 


@Songchan wrote:

Hi guys,

 

I'm trying  to merge two data sets by ticker symbols, but ticker symbols are not unique to every company, they can be reused by other companies, so I also need to merge two data sets by company names, The problem is names for same company in different data sets may be different in formats (such as up or lower case, inc. or corp, with "-" or " " in two words), how could I solve this problem? Please have a look of below codes

 

Kind regards,

Songchan

proc sql;
     create table temp03 as
	 select distinct a.*, b.permno, b.comnam, b.ticker, b.date
	 from temp02 as a 
	      left join temp01a as b
		  on a.ticker=b.ticker;
quit; 

 

mkeintz
PROC Star

@Reeza

 

Yes, there is a data set for CRSP data (the first of the OP's two data sets), that allows construction of a date range connecting a particular TICKER to a particular PERMNO (the CRSP id variable). 

 

However the OP might not have such a table for the other source, which only uses company name, not an id variable.  As I'm sure you realize, that's why I asked the OP whether the ticker in the 2nd data set is correct for the stated date range, or even the ENDDATE (or STARTDATE).

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

If the second data set has any date though, it would help with the merge, or at least be better than a text merge.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 2757 views
  • 11 likes
  • 7 in conversation