BookmarkSubscribeRSS Feed
Songchan
Calcite | Level 5

Hi guys,

 

I'm trying  to merge two data sets by ticker symbols, but ticker symbols are not unique to every company, they can be reused by other companies, so I also need to merge two data sets by company names, The problem is names for same company in different data sets may be different in formats (such as up or lower case, inc. or corp, with "-" or " " in two words), how could I solve this problem? Please have a look of below codes

 

Kind regards,

Songchan

proc sql;
     create table temp03 as
	 select distinct a.*, b.permno, b.comnam, b.ticker, b.date
	 from temp02 as a 
	      left join temp01a as b
		  on a.ticker=b.ticker;
quit; 
22 REPLIES 22
PeterClemmensen
Tourmaline | Level 20

Can you provide some sample data for us to work with?

Songchan
Calcite | Level 5
Hi, thanking for replying me. Please see attached documents
PeterClemmensen
Tourmaline | Level 20

.xlsx files are not too good for sample data. However, there are a few ways to do this. 

 

You could use the COMPGED or COMPLEV Functions. These are functions that compare strings and return some numeric constant of the 'distance' between the two strings. The smaller the returned constant, the more the two strings look alike.

Songchan
Calcite | Level 5

Hi,

What kind of sample data you wanna see? Because it's very hard for me to find which ticker has several companies to corresponded.

 

kiranv_
Rhodochrosite | Level 12

if everytime it is last word is one different then you can something like this may work

proc sql;
     create table temp03 as
	 select distinct a.*, b.permno, b.comnam, b.ticker as lookupticker, b.date
	 from temp02 as a 
	      left join temp01a as b
		  on upcase(scan(a.ticker,-1))=upcase(scan(b.ticker,-1))
		;  
quit; 

 

Songchan
Calcite | Level 5

Thank you for your reply,

 

I need two conditions now, one is with the same ticker, another is with similar company names (because i have to make sure they stand for the same company), the company names vary in many ways ( up or lower case, "blank between two words" or use "-")

PeterClemmensen
Tourmaline | Level 20

@kiranv_s code can be edited to look something like this. Does the logic that he describes work for you?

 

proc sql;
     create table temp03 as
	 select distinct a.*, b.permno, b.comnam, b.ticker as lookupticker, b.date
	 from temp02 as a 
	      left join temp01a as b
         on a.ticker=b.ticker
		   and upcase(scan(a.name,-1))=upcase(scan(b.name,-1))
		;  
quit; 
Songchan
Calcite | Level 5

It works to some extents, but still not a good match. I think maybe there is no better way.

kiranv_
Rhodochrosite | Level 12

 

 

Following things might help. 

 

1. you should have lookup table for this as suggested by @Reeza or you need to make a lookup table(with help of business users) and include new variations whenever you have new variations.

 

2. In absence of lookup table you need to have a program( small ETL code) which address this problem(business people might be able to help on this)  to tackle issues.  your code should also should  updates on regular basis as per business users

Otherwise It is almost impossible to fix this kind of issues.

 

3. Most importantly you need  to sit with your business users and if possible define the rules along with them to handle the data.  It would be easier if you can get some help from business users on this. Mostly senior members/business analysts in your Team will have good idea on this kind of things.

kiranv_
Rhodochrosite | Level 12

As @PeterClemmensen  explains a sample will be helpful, otherwise it would be difficult. It made a template for you. Just try to make different ways you can see data will be helpful. And also mention how your output should be from your input datasets

 

data temp02;
input ticker $  1-7  anotherval;
datalines;
AB INC 100
AC INC 200
;

data temp01a;
input ticker $ 1-7  anotherval1 9-11;
datalines;
AB INC  500
AC INC  200
AB      200
AD CORP 700
;
Songchan
Calcite | Level 5
data temp02;
input ticker $  1-7  anotherval;
datalines;ABC INC   ABC
AC INC    AC
A D INC AD
;
data temp01a; input ticker $ 1-7 anotherval1 9-11; datalines;abc inc ABC AC corp AC A-D corp AD;
the first column is company name, the second is ticker symbol. The result i desire is:
ABC INC   ABC   abc inc
AC INC    AC    AC corp
A D INC AD A-D corp
Another company names are various in someways, but they are the same companies and having the same ticker symbols




novinosrin
Tourmaline | Level 20

Hi @Songchan  I'm afraid none of the suggestions will work for this scenario

 

 inc. or corp,  This is really messy

with "-" or " " in two words) - This can be handled

 

Can you list all the possible messy stuff or is that all?

 

If this all, no biggie, but i seriously doubt if that's the case 

Songchan
Calcite | Level 5

Hi, thank you for your reply. Yes, that's all.

mkeintz
PROC Star

I am not going to suggest sas code to you, because I don't propose to convert your excel sheets into a sas data step for use in this forum.  However, if you do so, I'd be glad to give it a go.

 

I see the first spreadsheet is CRSP data (id variable PERMNO) and it includes a single DATE variable.  And the other data has a NAME and TICKER, together with a STARTDATE and ENDDATE.  In the other data set, is the specified ticker in use by that company for the entire date range from STARTDATE to ENDDATE? If so then match on ticker conditional on the CRSP date falling between the startdate and enddate of the other data set.

 

Now if TICKER in the other data set only represents the status on ENDDATE, then note that you can go back to the CRSP data and make a ticker date range.  CRSP publishes another data set, NAMES, which includes events such as ticker change.  This would allow you to determine the full crsp date range for a ticker (say TICKERSTART and TICKEREND) that includes the date you already have in your CRSP data.  Then you could match on ticker, condition on the ENDDATE from the other dataset falling between TICKERSTART and TICKEREND.

 

Then you would have all the easy matches.

 

Edit, due to unnoticed cut and paste result: 

which allows you to build a date range for a ticker belonging to a particular PERMNO.  Then you could do the same as above, but match on ticker, conditional on the ENDDATE of the other data set falling between the start and end of a ticker in CRSP.

 

 

that CRSP publishes other datasets, including a "names" dataset which will allow you to construct the actual time span a given ticker is assigned to a given PERMNO.  So you could enhance your CRSP data to include a ticker_start_date and ticker_end_date.   Then match on ticker, without having to struggle with trying to standardize company names.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 2751 views
  • 11 likes
  • 7 in conversation