Dear All,
I have two sas files and the only common variable between these two datasets is the company names. Nevertheless, these names are not exactly identical. For example;
dataset1 is the following form:
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
dataset2 is the following form:
name
Amazon Inc
Hospitals
Kellwood Inc
I have about 100,000 obs in the first dataset and 10,000 obs in the second dataset. In such a case, I obtain only one match (Amazon Inc) when I merge them. ( i.e. merge dataset1 dataset2; by name)
I was wondering if there was a way to have SAS recognize these names as identical. I would very much appreciate your help on this issue.
Thanks in advance
Thanks art297!
my sas code:
data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc 20061017 1 0
Amazon Inc 20070808 0 1
Google Inc 20091212 0 1
Google Inc 20091214 0 1
Hospitals Inc 20070425 1 0
Kellwood Co 20100621 1 0
run;
proc sort;
by newname;
data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc 121
Google 131
Hospitals 133
Kellwood Inc 147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;
The expensive but least intensive staff time solution: the SAS component called dataflux.
However, if that isn't in your budget, take a look at: http://www.sconsig.com/sastips/tip00000.htm and some of the other fuzzy matching suggestions offered at that site.
Thank you so much for your fast response. I already started looking at the suggestions on that website. Hopefully, I will be able to figure it out.
I have a golden retriever who is telling me that I have more important things to do, thus I can only provide what I've done thus far. You can take the results of the following, turn them into a format (using the cntlnin option of proc format) and, I think, come close to what you want:
data dataset1;
informat name $50.;
input name &;
cards;
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
data dataset2;
informat name $50.;
input name &;
cards;
Amazon Inc
Hospitals
Kellwood Inc
;
data both;
set data:;
run;
/*get rid of duplicates*/
proc sort data=both nodupkey;
by name;
run;
/*** get number of records in lookup dataset ***/
data _null_;
if 0 then set both nobs=nobs; /*** no need to read dataset - just metadata ***/
CALL SYMPUT('NUMREC',nobs); /*** put # of records into NUMREC macro var ***/
stop; /*** stop, got number of records ***/
run;
/*** now read the lookup records into an array, and then read the querry records
and modify with tranwrd ***/
data matches (keep=name name2);
array Lookup(&numrec) $50; /*** create an array with same number of elements
as there are records in both ***/
i=0;
do until (eof1); /*** load the array with the lookup records ***/
set both end=eof1;
i+1;
lookup(i)=name;
end;
do until (eof2); /*** now read each querry record and apply compged function***/
set both end=eof2;
recnum+1;
do i=recnum+1 to &numrec;
if name ne lookup(i) then do;
if compged(name,lookup(i)) le 600 then do;
name2=lookup(i);
output;
end;
end;
end;
end;
run;
data dataset ;
input names$ 1-30;
cards;
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
run;
data dataset1 ;
input names$ 1-30;
cards;
name
Amazon Inc
Google
Hospitals
Kellwood Inc
;
run;
proc sort data=dataset;
by names;
run;
proc sort data=dataset1;
by names;
run;
data both;
merge dataset(in=a) dataset1(in=b);
by names;
if a=b;
run;
proc print;
run;
If the first word is the same in both datasets. then can merge these two tables based it.
Thank you so much for all of your kind answers. I already implemented art297's suggestion with some partial success. I will work through your suggestions again this afternoon. The issue I had was that the real data actually looks like the following (I apologize for not making it clear in the beginning):
dataset1 is the following form:
name date purchase sale
Amazon Inc 20061017 1 0
Amazon Inc 20070808 0 1
Google Inc 20091212 0 1
Google Inc 20091214 0 1
Hospitals Inc 20070425 1 0
Kellwood Co 20100621 1 0
dataset2 is the following form:
name permno
Amazon Inc 121
Google 131
Hospitals 133
Kellwood Inc 147
My goal is to have:
name date purchase sale permno
Amazon Inc 20061017 1 0 121
Amazon Inc 20070808 0 1 121
Google Inc 20091212 0 1 131
Google Inc 20091214 0 1 131
Hospitals Inc 20070425 1 0 133
Kellwood Co 20100621 1 0 147
The first word is the same most of the cases, but it is misleading (e.g. Goldman Sachs vs Goldman Sachs Equity Fund vs Goldman Sachs Securities-all unique ids).
I thank you so much again for your help and patience with me.
Hi All,
I'd like to answer the question but I can't copy-paste my SAS code here. Any suggestions?
Thank you very much!
Take a look at: http://communities.sas.com/message/106343#106343
Thanks art297!
my sas code:
data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc 20061017 1 0
Amazon Inc 20070808 0 1
Google Inc 20091212 0 1
Google Inc 20091214 0 1
Hospitals Inc 20070425 1 0
Kellwood Co 20100621 1 0
run;
proc sort;
by newname;
data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc 121
Google 131
Hospitals 133
Kellwood Inc 147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;
Dear All,
I am very thankful to each one of you for your help and patience with me. The code that Linlin provided does exactly what I needed at this stage. I expect to run into further difficulties when the match should be done on the second or third part of the name, but at this stage, these solutions are a great leap forward for me and I now should be able to handle more complicated cases.
Best
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.