BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
finans_sas
Quartz | Level 8

Dear All,

I have two sas files and the only common variable between these two datasets is the company names. Nevertheless, these names are not exactly identical. For example;

dataset1 is the following form:

name

Amazon Inc

Google Inc

Hospitals Inc

Kellwood Company

dataset2 is the following form:

name

Amazon Inc

Google

Hospitals

Kellwood Inc

I have about 100,000 obs in the first dataset and 10,000 obs in the second dataset. In such a case, I obtain only one match (Amazon Inc) when I merge them. ( i.e. merge dataset1 dataset2; by name)

I was wondering if there was a way to have SAS recognize these names as identical. I would very much appreciate your help on this issue.

Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions
Linlin
Lapis Lazuli | Level 10

Thanks  art297!

my sas code:

data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc        20061017           1             0
Amazon Inc       20070808            0              1
Google Inc         20091212            0              1
Google Inc         20091214            0              1
Hospitals Inc      20070425            1              0
Kellwood Co       20100621            1              0
run;
proc sort;
by newname;

data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc    121
Google            131
Hospitals         133
Kellwood Inc     147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;

View solution in original post

10 REPLIES 10
art297
Opal | Level 21

The expensive but least intensive staff time solution: the SAS component called dataflux.

However, if that isn't in your budget, take a look at: http://www.sconsig.com/sastips/tip00000.htm and some of the other fuzzy matching suggestions offered at that site.

finans_sas
Quartz | Level 8

Thank you so much for your fast response. I already started looking at the suggestions on that website. Hopefully, I will be able to figure it out.

art297
Opal | Level 21

I have a golden retriever who is telling me that I have more important things to do, thus I can only provide what I've done thus far.  You can take the results of the following, turn them into a format (using the cntlnin option of proc format) and, I think, come close to what you want:

data dataset1;

  informat name $50.;

  input name &;

  cards;

Amazon Inc

Google Inc

Hospitals Inc

Kellwood Company

;

data dataset2;

  informat name $50.;

  input name &;

  cards;

Amazon Inc

Google

Hospitals

Kellwood Inc

;

data both;

  set data:;

run;

/*get rid of duplicates*/

proc sort data=both nodupkey;

  by name;

run;

/*** get number of records in lookup dataset ***/

data _null_;

  if 0 then set both nobs=nobs;  /*** no need to read dataset - just metadata ***/

  CALL SYMPUT('NUMREC',nobs);      /*** put # of records into NUMREC macro var ***/

  stop;                            /*** stop, got number of records ***/

run;

/*** now read the lookup records into an array, and then read the querry records

     and modify with tranwrd ***/

data matches (keep=name name2);

  array Lookup(&numrec) $50; /*** create an array with same number of elements

                                  as there are records in both ***/

  i=0;

  do until (eof1);           /*** load the array with the lookup records ***/

    set both end=eof1;

    i+1;

    lookup(i)=name;

  end;

  do until (eof2); /*** now read each querry record and apply compged function***/

    set both end=eof2;

    recnum+1;

    do i=recnum+1 to &numrec;

            if name ne lookup(i) then do;

        if compged(name,lookup(i)) le 600 then do;

          name2=lookup(i);

          output;

        end;

      end;

    end;

  end;

run;

My_SAS
Calcite | Level 5


data dataset ;
input names$ 1-30;
cards;
name
Amazon Inc
Google Inc
Hospitals Inc
Kellwood Company
;
run;
data dataset1 ;
input names$ 1-30;
cards;
name
Amazon Inc
Google
Hospitals
Kellwood Inc
;
run;

proc sort data=dataset;
by names;
run;

proc sort data=dataset1;
by names;
run;

data both;
merge dataset(in=a) dataset1(in=b);
by names;
if a=b;
run;
proc print;
run;

Ksharp
Super User

If the first word is the same in both datasets. then can merge these two tables based it.

finans_sas
Quartz | Level 8

Thank you so much for all of your kind answers. I already implemented art297's suggestion with some partial success. I will work through your suggestions again this afternoon. The issue I had was that the real data actually looks like the following (I apologize for not making it clear in the beginning):

dataset1 is the following form:

name                date            purchase      sale

Amazon Inc        20061017           1             0

Amazon Inc       20070808            0              1

Google Inc         20091212            0              1

Google Inc         20091214            0              1

Hospitals Inc      20070425            1              0

Kellwood Co       20100621            1              0

dataset2 is the following form:

name             permno

Amazon Inc    121

Google            131

Hospitals         133

Kellwood Inc     147

My goal is to have:

name                date            purchase         sale     permno

Amazon Inc        20061017           1              0         121

Amazon Inc       20070808            0              1         121

Google Inc         20091212            0              1         131

Google Inc         20091214            0              1         131

Hospitals Inc      20070425            1              0         133

Kellwood Co       20100621            1              0         147 

The first word is the same most of the cases, but it is misleading (e.g. Goldman Sachs vs Goldman Sachs Equity Fund vs Goldman Sachs Securities-all unique ids).

I thank you so much again for your help and patience with me.

Linlin
Lapis Lazuli | Level 10

Hi All,

I'd like to answer the question but I can't copy-paste my SAS code here. Any suggestions? 

Thank you very much!

Linlin
Lapis Lazuli | Level 10

Thanks  art297!

my sas code:

data one;
informat name $50. date yymmdd10.;
input name & date purchase sale;
newname=upcase(scan(name,1));
format date yymmdd10.;
cards;
Amazon Inc        20061017           1             0
Amazon Inc       20070808            0              1
Google Inc         20091212            0              1
Google Inc         20091214            0              1
Hospitals Inc      20070425            1              0
Kellwood Co       20100621            1              0
run;
proc sort;
by newname;

data two;
informat name $50.;
input name & permo;
newname=upcase(scan(name,1));
cards;
Amazon Inc    121
Google            131
Hospitals         133
Kellwood Inc     147
run;
proc sort;
by newname;
run;
data want (drop=newname);
merge one (in=a) two;
by newname;
if a;
run;

finans_sas
Quartz | Level 8

Dear All,

I am very thankful to each one of you for your help and patience with me. The code that Linlin provided does exactly what I needed at this stage. I expect to run into further difficulties when the match should be done on the second or third part of the name, but at this stage, these solutions are a great leap forward for me and I now should be able to handle more complicated cases.

Best

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1202 views
  • 8 likes
  • 5 in conversation