Hi guys,
I'm trying to merge two data sets by ticker symbols, but ticker symbols are not unique to every company, they can be reused by other companies, so I also need to merge two data sets by company names, The problem is names for same company in different data sets may be different in formats (such as up or lower case, inc. or corp, with "-" or " " in two words), how could I solve this problem? Please have a look of below codes
Kind regards,
Songchan
proc sql;
create table temp03 as
select distinct a.*, b.permno, b.comnam, b.ticker, b.date
from temp02 as a
left join temp01a as b
on a.ticker=b.ticker;
quit;
Can you provide some sample data for us to work with?
.xlsx files are not too good for sample data. However, there are a few ways to do this.
You could use the COMPGED or COMPLEV Functions. These are functions that compare strings and return some numeric constant of the 'distance' between the two strings. The smaller the returned constant, the more the two strings look alike.
Hi,
What kind of sample data you wanna see? Because it's very hard for me to find which ticker has several companies to corresponded.
if everytime it is last word is one different then you can something like this may work
proc sql;
create table temp03 as
select distinct a.*, b.permno, b.comnam, b.ticker as lookupticker, b.date
from temp02 as a
left join temp01a as b
on upcase(scan(a.ticker,-1))=upcase(scan(b.ticker,-1))
;
quit;
Thank you for your reply,
I need two conditions now, one is with the same ticker, another is with similar company names (because i have to make sure they stand for the same company), the company names vary in many ways ( up or lower case, "blank between two words" or use "-")
@kiranv_s code can be edited to look something like this. Does the logic that he describes work for you?
proc sql;
create table temp03 as
select distinct a.*, b.permno, b.comnam, b.ticker as lookupticker, b.date
from temp02 as a
left join temp01a as b
on a.ticker=b.ticker
and upcase(scan(a.name,-1))=upcase(scan(b.name,-1))
;
quit;
It works to some extents, but still not a good match. I think maybe there is no better way.
Following things might help.
1. you should have lookup table for this as suggested by @Reeza or you need to make a lookup table(with help of business users) and include new variations whenever you have new variations.
2. In absence of lookup table you need to have a program( small ETL code) which address this problem(business people might be able to help on this) to tackle issues. your code should also should updates on regular basis as per business users
Otherwise It is almost impossible to fix this kind of issues.
3. Most importantly you need to sit with your business users and if possible define the rules along with them to handle the data. It would be easier if you can get some help from business users on this. Mostly senior members/business analysts in your Team will have good idea on this kind of things.
As @PeterClemmensen explains a sample will be helpful, otherwise it would be difficult. It made a template for you. Just try to make different ways you can see data will be helpful. And also mention how your output should be from your input datasets
data temp02;
input ticker $ 1-7 anotherval;
datalines;
AB INC 100
AC INC 200
;
data temp01a;
input ticker $ 1-7 anotherval1 9-11;
datalines;
AB INC 500
AC INC 200
AB 200
AD CORP 700
;
data temp02;
input ticker $ 1-7 anotherval;
datalines;ABC INC ABC
AC INC AC
A D INC AD;
data temp01a;
input ticker $ 1-7 anotherval1 9-11;
datalines;abc inc ABC
AC corp AC
A-D corp AD;
the first column is company name, the second is ticker symbol. The result i desire is:
ABC INC ABC abc inc
AC INC AC AC corp
A D INC AD A-D corp
Another company names are various in someways, but they are the same companies and having the same ticker symbols
Hi @Songchan I'm afraid none of the suggestions will work for this scenario
inc. or corp, This is really messy
with "-" or " " in two words) - This can be handled
Can you list all the possible messy stuff or is that all?
If this all, no biggie, but i seriously doubt if that's the case
Hi, thank you for your reply. Yes, that's all.
I am not going to suggest sas code to you, because I don't propose to convert your excel sheets into a sas data step for use in this forum. However, if you do so, I'd be glad to give it a go.
I see the first spreadsheet is CRSP data (id variable PERMNO) and it includes a single DATE variable. And the other data has a NAME and TICKER, together with a STARTDATE and ENDDATE. In the other data set, is the specified ticker in use by that company for the entire date range from STARTDATE to ENDDATE? If so then match on ticker conditional on the CRSP date falling between the startdate and enddate of the other data set.
Now if TICKER in the other data set only represents the status on ENDDATE, then note that you can go back to the CRSP data and make a ticker date range. CRSP publishes another data set, NAMES, which includes events such as ticker change. This would allow you to determine the full crsp date range for a ticker (say TICKERSTART and TICKEREND) that includes the date you already have in your CRSP data. Then you could match on ticker, condition on the ENDDATE from the other dataset falling between TICKERSTART and TICKEREND.
Then you would have all the easy matches.
Edit, due to unnoticed cut and paste result:
which allows you to build a date range for a ticker belonging to a particular PERMNO. Then you could do the same as above, but match on ticker, conditional on the ENDDATE of the other data set falling between the start and end of a ticker in CRSP.
that CRSP publishes other datasets, including a "names" dataset which will allow you to construct the actual time span a given ticker is assigned to a given PERMNO. So you could enhance your CRSP data to include a ticker_start_date and ticker_end_date. Then match on ticker, without having to struggle with trying to standardize company names.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.