Dear All,
I have two sas files and the only common variable between these two datasets is the company names. Nevertheless, these names are not exactly identical. For example;
dataset1 is the following form:
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
dataset2 is the following form:
name
Amazon Inc
Hospitals
Kellwood Inc
I have about 100,000 obs in the first dataset and 10,000 obs in the second dataset. In such a case, I obtain only one match (Amazon Inc) when I merge them. ( i.e. merge dataset1 dataset2; by name)
I was wondering if there was a way to have SAS recognize these names as identical. I would very much appreciate your help on this issue.
Thanks in advance
Thanks art297!
my sas code:
data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc 20061017 1 0
Amazon Inc 20070808 0 1
Google Inc 20091212 0 1
Google Inc 20091214 0 1
Hospitals Inc 20070425 1 0
Kellwood Co 20100621 1 0
run;
proc sort;
by newname;
data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc 121
Google 131
Hospitals 133
Kellwood Inc 147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;
The expensive but least intensive staff time solution: the SAS component called dataflux.
However, if that isn't in your budget, take a look at: http://www.sconsig.com/sastips/tip00000.htm and some of the other fuzzy matching suggestions offered at that site.
Thank you so much for your fast response. I already started looking at the suggestions on that website. Hopefully, I will be able to figure it out.
I have a golden retriever who is telling me that I have more important things to do, thus I can only provide what I've done thus far. You can take the results of the following, turn them into a format (using the cntlnin option of proc format) and, I think, come close to what you want:
data dataset1;
informat name $50.;
input name &;
cards;
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
data dataset2;
informat name $50.;
input name &;
cards;
Amazon Inc
Hospitals
Kellwood Inc
;
data both;
set data:;
run;
/*get rid of duplicates*/
proc sort data=both nodupkey;
by name;
run;
/*** get number of records in lookup dataset ***/
data _null_;
if 0 then set both nobs=nobs; /*** no need to read dataset - just metadata ***/
CALL SYMPUT('NUMREC',nobs); /*** put # of records into NUMREC macro var ***/
stop; /*** stop, got number of records ***/
run;
/*** now read the lookup records into an array, and then read the querry records
and modify with tranwrd ***/
data matches (keep=name name2);
array Lookup(&numrec) $50; /*** create an array with same number of elements
as there are records in both ***/
i=0;
do until (eof1); /*** load the array with the lookup records ***/
set both end=eof1;
i+1;
lookup(i)=name;
end;
do until (eof2); /*** now read each querry record and apply compged function***/
set both end=eof2;
recnum+1;
do i=recnum+1 to &numrec;
if name ne lookup(i) then do;
if compged(name,lookup(i)) le 600 then do;
name2=lookup(i);
output;
end;
end;
end;
end;
run;
data dataset ;
input names$ 1-30;
cards;
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
run;
data dataset1 ;
input names$ 1-30;
cards;
name
Amazon Inc
Google
Hospitals
Kellwood Inc
;
run;
proc sort data=dataset;
by names;
run;
proc sort data=dataset1;
by names;
run;
data both;
merge dataset(in=a) dataset1(in=b);
by names;
if a=b;
run;
proc print;
run;
If the first word is the same in both datasets. then can merge these two tables based it.
Thank you so much for all of your kind answers. I already implemented art297's suggestion with some partial success. I will work through your suggestions again this afternoon. The issue I had was that the real data actually looks like the following (I apologize for not making it clear in the beginning):
dataset1 is the following form:
name date purchase sale
Amazon Inc 20061017 1 0
Amazon Inc 20070808 0 1
Google Inc 20091212 0 1
Google Inc 20091214 0 1
Hospitals Inc 20070425 1 0
Kellwood Co 20100621 1 0
dataset2 is the following form:
name permno
Amazon Inc 121
Google 131
Hospitals 133
Kellwood Inc 147
My goal is to have:
name date purchase sale permno
Amazon Inc 20061017 1 0 121
Amazon Inc 20070808 0 1 121
Google Inc 20091212 0 1 131
Google Inc 20091214 0 1 131
Hospitals Inc 20070425 1 0 133
Kellwood Co 20100621 1 0 147
The first word is the same most of the cases, but it is misleading (e.g. Goldman Sachs vs Goldman Sachs Equity Fund vs Goldman Sachs Securities-all unique ids).
I thank you so much again for your help and patience with me.
Hi All,
I'd like to answer the question but I can't copy-paste my SAS code here. Any suggestions?
Thank you very much!
Take a look at: http://communities.sas.com/message/106343#106343
Thanks art297!
my sas code:
data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc 20061017 1 0
Amazon Inc 20070808 0 1
Google Inc 20091212 0 1
Google Inc 20091214 0 1
Hospitals Inc 20070425 1 0
Kellwood Co 20100621 1 0
run;
proc sort;
by newname;
data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc 121
Google 131
Hospitals 133
Kellwood Inc 147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;
Dear All,
I am very thankful to each one of you for your help and patience with me. The code that Linlin provided does exactly what I needed at this stage. I expect to run into further difficulties when the match should be done on the second or third part of the name, but at this stage, these solutions are a great leap forward for me and I now should be able to handle more complicated cases.
Best
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.