Solved: Merging Two Datasets with Characters

finans_sas · Posted 10-31-2011 07:49 PM

Dear All,

I have two sas files and the only common variable between these two datasets is the company names. Nevertheless, these names are not exactly identical. For example;

dataset1 is the following form:

name

Amazon Inc

Google Inc

Hospitals Inc

Kellwood Company

dataset2 is the following form:

name

Amazon Inc

Google

Hospitals

Kellwood Inc

I have about 100,000 obs in the first dataset and 10,000 obs in the second dataset. In such a case, I obtain only one match (Amazon Inc) when I merge them. ( i.e. merge dataset1 dataset2; by name)

I was wondering if there was a way to have SAS recognize these names as identical. I would very much appreciate your help on this issue.

Thanks in advance

Linlin · Posted 11-01-2011 02:10 PM

Thanks art297!

my sas code:

data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc        20061017           1             0
Amazon Inc       20070808            0              1
Google Inc         20091212            0              1
Google Inc         20091214            0              1
Hospitals Inc      20070425            1              0
Kellwood Co       20100621            1              0
run;
proc sort;
by newname;

data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc    121
Google            131
Hospitals         133
Kellwood Inc     147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;

View solution in original post

art297 · Posted 10-31-2011 08:15 PM

The expensive but least intensive staff time solution: the SAS component called dataflux.

However, if that isn't in your budget, take a look at: http://www.sconsig.com/sastips/tip00000.htm and some of the other fuzzy matching suggestions offered at that site.

finans_sas · Posted 10-31-2011 09:47 PM

Thank you so much for your fast response. I already started looking at the suggestions on that website. Hopefully, I will be able to figure it out.

art297 · Posted 10-31-2011 10:11 PM

I have a golden retriever who is telling me that I have more important things to do, thus I can only provide what I've done thus far. You can take the results of the following, turn them into a format (using the cntlnin option of proc format) and, I think, come close to what you want:

data dataset1;

informat name $50.;

input name &;

cards;

Amazon Inc

Google Inc

Hospitals Inc

Kellwood Company

;

data dataset2;

informat name $50.;

input name &;

cards;

Amazon Inc

Google

Hospitals

Kellwood Inc

;

data both;

set data:;

run;

/*get rid of duplicates*/

proc sort data=both nodupkey;

by name;

run;

/*** get number of records in lookup dataset ***/

data _null_;

if 0 then set both nobs=nobs; /*** no need to read dataset - just metadata ***/

CALL SYMPUT('NUMREC',nobs); /*** put # of records into NUMREC macro var ***/

stop; /*** stop, got number of records ***/

run;

/*** now read the lookup records into an array, and then read the querry records

and modify with tranwrd ***/

data matches (keep=name name2);

array Lookup(&numrec) $50; /*** create an array with same number of elements

as there are records in both ***/

i=0;

do until (eof1); /*** load the array with the lookup records ***/

set both end=eof1;

i+1;

lookup(i)=name;

end;

do until (eof2); /*** now read each querry record and apply compged function***/

set both end=eof2;

recnum+1;

do i=recnum+1 to &numrec;

if name ne lookup(i) then do;

if compged(name,lookup(i)) le 600 then do;

name2=lookup(i);

output;

end;

run;

My_SAS · Posted 11-01-2011 03:52 AM

data dataset ;
input names$ 1-30;
cards;
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
run;
data dataset1 ;
input names$ 1-30;
cards;
name
Amazon Inc
Google
Hospitals
Kellwood Inc
;
run;

proc sort data=dataset;
by names;
run;

proc sort data=dataset1;
by names;
run;

data both;
merge dataset(in=a) dataset1(in=b);
by names;
if a=b;
run;
proc print;
run;

Ksharp · Posted 11-01-2011 04:04 AM

If the first word is the same in both datasets. then can merge these two tables based it.

finans_sas · Posted 11-01-2011 09:34 AM

Thank you so much for all of your kind answers. I already implemented art297's suggestion with some partial success. I will work through your suggestions again this afternoon. The issue I had was that the real data actually looks like the following (I apologize for not making it clear in the beginning):

dataset1 is the following form:

name date purchase sale

Amazon Inc 20061017 1 0

Amazon Inc 20070808 0 1

Google Inc 20091212 0 1

Google Inc 20091214 0 1

Hospitals Inc 20070425 1 0

Kellwood Co 20100621 1 0

dataset2 is the following form:

name permno

Amazon Inc 121

Google 131

Hospitals 133

Kellwood Inc 147

My goal is to have:

name date purchase sale permno

Amazon Inc 20061017 1 0 121

Amazon Inc 20070808 0 1 121

Google Inc 20091212 0 1 131

Google Inc 20091214 0 1 131

Hospitals Inc 20070425 1 0 133

Kellwood Co 20100621 1 0 147

The first word is the same most of the cases, but it is misleading (e.g. Goldman Sachs vs Goldman Sachs Equity Fund vs Goldman Sachs Securities-all unique ids).

I thank you so much again for your help and patience with me.

Linlin · Posted 11-01-2011 01:51 PM

Hi All,

I'd like to answer the question but I can't copy-paste my SAS code here. Any suggestions?

Thank you very much!

art297 · Posted 11-01-2011 01:55 PM

Take a look at: http://communities.sas.com/message/106343#106343

Linlin · Posted 11-01-2011 02:10 PM

Thanks art297!

my sas code:

data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc        20061017           1             0
Amazon Inc       20070808            0              1
Google Inc         20091212            0              1
Google Inc         20091214            0              1
Hospitals Inc      20070425            1              0
Kellwood Co       20100621            1              0
run;
proc sort;
by newname;

data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc    121
Google            131
Hospitals         133
Kellwood Inc     147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;

finans_sas · Posted 11-01-2011 08:57 PM

Dear All,

I am very thankful to each one of you for your help and patience with me. The code that Linlin provided does exactly what I needed at this stage. I expect to run into further difficulties when the match should be done on the second or third part of the name, but at this stage, these solutions are a great leap forward for me and I now should be able to handle more complicated cases.

Best

Merging Two Datasets with Characters

Re: Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Merging Two Datasets with Characters

Re: Merging Two Datasets with Characters

Merging Two Datasets with Characters

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!