Help using Base SAS procedures

Merging Two Datasets with Characters

Accepted Solution Solved
Reply
Contributor
Posts: 57
Accepted Solution

Merging Two Datasets with Characters

Dear All,

I have two sas files and the only common variable between these two datasets is the company names. Nevertheless, these names are not exactly identical. For example;

dataset1 is the following form:

name

Amazon Inc

Google Inc

Hospitals Inc

Kellwood Company

dataset2 is the following form:

name

Amazon Inc

Google

Hospitals

Kellwood Inc

I have about 100,000 obs in the first dataset and 10,000 obs in the second dataset. In such a case, I obtain only one match (Amazon Inc) when I merge them. ( i.e. merge dataset1 dataset2; by name)

I was wondering if there was a way to have SAS recognize these names as identical. I would very much appreciate your help on this issue.

Thanks in advance


Accepted Solutions
Solution
‎11-01-2011 02:10 PM
Super Contributor
Posts: 1,636

Re: Merging Two Datasets with Characters

Thanks  art297!

my sas code:

data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc        20061017           1             0
Amazon Inc       20070808            0              1
Google Inc         20091212            0              1
Google Inc         20091214            0              1
Hospitals Inc      20070425            1              0
Kellwood Co       20100621            1              0
run;
proc sort;
by newname;

data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc    121
Google            131
Hospitals         133
Kellwood Inc     147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;

View solution in original post


All Replies
PROC Star
Posts: 7,492

Merging Two Datasets with Characters

Posted in reply to finans_sas

The expensive but least intensive staff time solution: the SAS component called dataflux.

However, if that isn't in your budget, take a look at: http://www.sconsig.com/sastips/tip00000.htm and some of the other fuzzy matching suggestions offered at that site.

Contributor
Posts: 57

Merging Two Datasets with Characters

Thank you so much for your fast response. I already started looking at the suggestions on that website. Hopefully, I will be able to figure it out.

PROC Star
Posts: 7,492

Merging Two Datasets with Characters

Posted in reply to finans_sas

I have a golden retriever who is telling me that I have more important things to do, thus I can only provide what I've done thus far.  You can take the results of the following, turn them into a format (using the cntlnin option of proc format) and, I think, come close to what you want:

data dataset1;

  informat name $50.;

  input name &;

  cards;

Amazon Inc

Google Inc

Hospitals Inc

Kellwood Company

;

data dataset2;

  informat name $50.;

  input name &;

  cards;

Amazon Inc

Google

Hospitals

Kellwood Inc

;

data both;

  set data:;

run;

/*get rid of duplicates*/

proc sort data=both nodupkey;

  by name;

run;

/*** get number of records in lookup dataset ***/

data _null_;

  if 0 then set both nobs=nobs;  /*** no need to read dataset - just metadata ***/

  CALL SYMPUT('NUMREC',nobs);      /*** put # of records into NUMREC macro var ***/

  stop;                            /*** stop, got number of records ***/

run;

/*** now read the lookup records into an array, and then read the querry records

     and modify with tranwrd ***/

data matches (keep=name name2);

  array Lookup(&numrec) $50; /*** create an array with same number of elements

                                  as there are records in both ***/

  i=0;

  do until (eof1);           /*** load the array with the lookup records ***/

    set both end=eof1;

    i+1;

    lookup(i)=name;

  end;

  do until (eof2); /*** now read each querry record and apply compged function***/

    set both end=eof2;

    recnum+1;

    do i=recnum+1 to &numrec;

            if name ne lookup(i) then do;

        if compged(name,lookup(i)) le 600 then do;

          name2=lookup(i);

          output;

        end;

      end;

    end;

  end;

run;

Contributor
Posts: 37

Merging Two Datasets with Characters

Posted in reply to finans_sas


data dataset ;
input names$ 1-30;
cards;
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
run;
data dataset1 ;
input names$ 1-30;
cards;
name
Amazon Inc
Google
Hospitals
Kellwood Inc
;
run;

proc sort data=dataset;
by names;
run;

proc sort data=dataset1;
by names;
run;

data both;
merge dataset(in=a) dataset1(in=b);
by names;
if a=b;
run;
proc print;
run;

Super User
Posts: 10,044

Merging Two Datasets with Characters

Posted in reply to finans_sas

If the first word is the same in both datasets. then can merge these two tables based it.

Contributor
Posts: 57

Merging Two Datasets with Characters

Posted in reply to finans_sas

Thank you so much for all of your kind answers. I already implemented art297's suggestion with some partial success. I will work through your suggestions again this afternoon. The issue I had was that the real data actually looks like the following (I apologize for not making it clear in the beginning):

dataset1 is the following form:

name                date            purchase      sale

Amazon Inc        20061017           1             0

Amazon Inc       20070808            0              1

Google Inc         20091212            0              1

Google Inc         20091214            0              1

Hospitals Inc      20070425            1              0

Kellwood Co       20100621            1              0

dataset2 is the following form:

name             permno

Amazon Inc    121

Google            131

Hospitals         133

Kellwood Inc     147

My goal is to have:

name                date            purchase         sale     permno

Amazon Inc        20061017           1              0         121

Amazon Inc       20070808            0              1         121

Google Inc         20091212            0              1         131

Google Inc         20091214            0              1         131

Hospitals Inc      20070425            1              0         133

Kellwood Co       20100621            1              0         147 

The first word is the same most of the cases, but it is misleading (e.g. Goldman Sachs vs Goldman Sachs Equity Fund vs Goldman Sachs Securities-all unique ids).

I thank you so much again for your help and patience with me.

Super Contributor
Posts: 1,636

Merging Two Datasets with Characters

Posted in reply to finans_sas

Hi All,

I'd like to answer the question but I can't copy-paste my SAS code here. Any suggestions? 

Thank you very much!

PROC Star
Posts: 7,492

Merging Two Datasets with Characters

Solution
‎11-01-2011 02:10 PM
Super Contributor
Posts: 1,636

Re: Merging Two Datasets with Characters

Thanks  art297!

my sas code:

data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc        20061017           1             0
Amazon Inc       20070808            0              1
Google Inc         20091212            0              1
Google Inc         20091214            0              1
Hospitals Inc      20070425            1              0
Kellwood Co       20100621            1              0
run;
proc sort;
by newname;

data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc    121
Google            131
Hospitals         133
Kellwood Inc     147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;

Contributor
Posts: 57

Merging Two Datasets with Characters

Posted in reply to finans_sas

Dear All,

I am very thankful to each one of you for your help and patience with me. The code that Linlin provided does exactly what I needed at this stage. I expect to run into further difficulties when the match should be done on the second or third part of the name, but at this stage, these solutions are a great leap forward for me and I now should be able to handle more complicated cases.

Best

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 226 views
  • 8 likes
  • 5 in conversation