How to create unique ID if firm ID changes across years

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

How to create unique ID if firm ID changes across years

Hi all,

I have a firm-year panel data set, and I want to create a panel data set. However, some firms's ID changed across years for some reason, and firms' names are inconsistent across years. So, it's difficult to sort the firms.

I would like to create a unique ID as "Desired ID" for each firm, but it seems a data challenge to me. I really appreciate if you can help me and give me some hints. Thank you!

Below is an example.

NameYearIDDesired ID
ST. Lucys Firm19964005041712
ST. Lucys Firm19974005041712
ST. Lucys Firm19984005041712
ST. Lucy s Firm19994171241712
Saint Lucy s Firm20014171241712
Saint Lucy s Firm20024171241712
Saint Lucy s Firm20034171241712
Saint Lucy s Firm20044171241712
Sun Firm19965101751820
Sun Firm19975101751820
Sun Firm19985101751820
Sun Firm19995101751820
Sun County Firm20005101751820
Sun County Firm20015101751820
Sun County Firm20025182051820
Sun County Firm20035182051820

I also have another dataset that contains the timing the firm changes its ID, but again, firms' name may be inconsistent to the main dataset...

NameYearID
ST. Lucy's Firm199941712
Sun County Firm200251820

Accepted Solutions
Solution
‎04-26-2014 01:23 PM
PROC Star
Posts: 7,364

Re: How to create unique ID if firm ID changes across years

The task you are asking about is almost exactly like the one I had to solve for an Expert Panel at last year's MWSUG meeting (see: Expert Panel Solution MWSUG 2013-Tabachneck - sasCommunity )

See if the following accomplishes what you want:

data main;

  informat name $50.;

  informat id $5.;

  infile cards delimiter='09'x;

  input Name Year ID;

  cards;

ST. Lucys Firm 1996 40050

ST. Lucys Firm 1997 40050

ST. Lucys Firm 1998 40050

ST. Lucy s Firm 1999 41712

Saint Lucy s Firm 2001 41712

Saint Lucy s Firm 2002 41712

Saint Lucy s Firm 2003 41712

Saint Lucy s Firm 2004 41712

Sun Firm 1996 51017

Sun Firm 1997 51017

Sun Firm 1998 51017

Sun Firm 1999 51017

Sun County Firm 2000 51017

Sun County Firm 2001 51017

Sun County Firm 2002 51820

Sun County Firm 2003 51820

;

data changes;

  informat name $50.;

  informat id $5.;

  infile cards delimiter='09'x;

  input Name Year ID;

  cards;

ST. Lucy's Firm 1999 41712

Sun County Firm 2002 51820

;

/*Create partial main dataset*/

proc sort data=main(keep=Name)

          out=partmain nodupkey;

  by Name;

run;

/* get number of records in changes dataset */

data _null_;

if 0 then set changes nobs=nobs;

call symput('numrec',nobs);

stop;

run;

/*Create format dataset by reading firm names into an array and using

  Compged function to find closest matches for the main dataset names */

data fmtDataset (keep=fmtname start label type);

  retain fmtname '$firmid' type 'C';

  array firm(&numrec) $50.;

  array ids(&numrec) $5.;

  do i=1 to &numrec;                  /*load the array with bank names*/

    set changes;

    firm(i)=Name;

    ids(i)=id;

  end;

  do until (eof);    /* read banks names from summarized customer file*/

    set partmain (rename=(Name=start)) end=eof;

    lowscore=5000;

    do i=1 to &numrec;       /*find lowest generalized edit distance*/

      score= compged(start,firm(i));

      if score le lowscore then do;

        lowscore=score;

        closest=i;                    /*keep index with lowest value*/

      end;

     end;

    label=ids(closest);

    output;

  end;

run;

/*Create the necessary format*/

proc format cntlin=fmtDataset;

run;

/*recode firm ids*/

data newmain;

  set main (rename=(id=old_id));

  id=put(Name,$firmid.);

run;

View solution in original post


All Replies
Solution
‎04-26-2014 01:23 PM
PROC Star
Posts: 7,364

Re: How to create unique ID if firm ID changes across years

The task you are asking about is almost exactly like the one I had to solve for an Expert Panel at last year's MWSUG meeting (see: Expert Panel Solution MWSUG 2013-Tabachneck - sasCommunity )

See if the following accomplishes what you want:

data main;

  informat name $50.;

  informat id $5.;

  infile cards delimiter='09'x;

  input Name Year ID;

  cards;

ST. Lucys Firm 1996 40050

ST. Lucys Firm 1997 40050

ST. Lucys Firm 1998 40050

ST. Lucy s Firm 1999 41712

Saint Lucy s Firm 2001 41712

Saint Lucy s Firm 2002 41712

Saint Lucy s Firm 2003 41712

Saint Lucy s Firm 2004 41712

Sun Firm 1996 51017

Sun Firm 1997 51017

Sun Firm 1998 51017

Sun Firm 1999 51017

Sun County Firm 2000 51017

Sun County Firm 2001 51017

Sun County Firm 2002 51820

Sun County Firm 2003 51820

;

data changes;

  informat name $50.;

  informat id $5.;

  infile cards delimiter='09'x;

  input Name Year ID;

  cards;

ST. Lucy's Firm 1999 41712

Sun County Firm 2002 51820

;

/*Create partial main dataset*/

proc sort data=main(keep=Name)

          out=partmain nodupkey;

  by Name;

run;

/* get number of records in changes dataset */

data _null_;

if 0 then set changes nobs=nobs;

call symput('numrec',nobs);

stop;

run;

/*Create format dataset by reading firm names into an array and using

  Compged function to find closest matches for the main dataset names */

data fmtDataset (keep=fmtname start label type);

  retain fmtname '$firmid' type 'C';

  array firm(&numrec) $50.;

  array ids(&numrec) $5.;

  do i=1 to &numrec;                  /*load the array with bank names*/

    set changes;

    firm(i)=Name;

    ids(i)=id;

  end;

  do until (eof);    /* read banks names from summarized customer file*/

    set partmain (rename=(Name=start)) end=eof;

    lowscore=5000;

    do i=1 to &numrec;       /*find lowest generalized edit distance*/

      score= compged(start,firm(i));

      if score le lowscore then do;

        lowscore=score;

        closest=i;                    /*keep index with lowest value*/

      end;

     end;

    label=ids(closest);

    output;

  end;

run;

/*Create the necessary format*/

proc format cntlin=fmtDataset;

run;

/*recode firm ids*/

data newmain;

  set main (rename=(id=old_id));

  id=put(Name,$firmid.);

run;

New Contributor
Posts: 2

Re: How to create unique ID if firm ID changes across years

Thank you guys. I am trying to understand your codes and logic, and I'll try to extent it to my data.

I really appreciate your help!

PROC Star
Posts: 7,364

Re: How to create unique ID if firm ID changes across years

Let us know if you run into any problems.  It should be rather straight forward.

Respected Advisor
Posts: 3,900

Re: How to create unique ID if firm ID changes across years

Do you have by any chance SAS DataFlux / SAS Data Quality Server licensed? If so then I believe standardisation of company names comes "out-of-the-box" meaning that a lot of the information and rules required are already part of the Data Quality Knowledge Base.

Super User
Posts: 9,687

Re: How to create unique ID if firm ID changes across years

It is not a good solution, but you can try it.

data main;
  infile cards expandtabs;
  input Name & $40. Year ID;
  cards;
ST. Lucys Firm     1996     40050
ST. Lucys Firm     1997     40050     
ST. Lucys Firm     1998     40050     
ST. Lucy s Firm      1999     41712     
Saint Lucy s Firm     2001     41712     
Saint Lucy s Firm     2002     41712     
Saint Lucy s Firm     2003     41712     
Saint Lucy s Firm     2004     41712     
Sun Firm     1996     51017     
Sun Firm     1997     51017     
Sun Firm     1998     51017     
Sun Firm     1999     51017     
Sun County Firm      2000     51017     
Sun County Firm      2001     51017     
Sun County Firm      2002     51820     
Sun County Firm      2003     51820     
;

 

data changes;
  infile cards expandtabs;
  input Name  & $40. Year ID;
  cards;
ST. Lucy's Firm  1999  41712
Sun County Firm  2002  51820
;
run;
proc sql;
 create table want as
  select a.*,b.id as desired_id
   from main as a,changes as b
    group by a.name
     having spedis(a.name,b.name)=min(spedis(a.name,b.name));
quit;

Xia Keshan

PROC Star
Posts: 7,364

Re: How to create unique ID if firm ID changes across years

Xia: Two points.  First, compged (in my experience) does a lot better than spedis.  Doesn't work as quickly, but is quite a bit more accurate.  e.g., try your code on the datasets below where the only thing I changed was which spelling of sun country would be used.  The compged method worked correctly for both change files.

Second, we don't know how many records the OP has.  In the example I took this from, the main file had 1,000,000 records but, sorting with nodupkey brought the number of unique spellings down to about 5,000.  It was for that reason that I created a format based on the smaller dataset, and then applied it to the larger file.

data main;

  informat name $50.;

  informat id $5.;

  infile cards delimiter='09'x;

  input Name Year ID;

  cards;

ST. Lucys Firm 1996 40050

ST. Lucys Firm 1997 40050

ST. Lucys Firm 1998 40050

ST. Lucy s Firm 1999 41712

Saint Lucy s Firm 2001 41712

Saint Lucy s Firm 2002 41712

Saint Lucy s Firm 2003 41712

Saint Lucy s Firm 2004 41712

Sun Firm 1996 51017

Sun Firm 1997 51017

Sun Firm 1998 51017

Sun Firm 1999 51017

Sun County Firm 2000 51017

Sun County Firm 2001 51017

Sun County Firm 2002 51820

Sun County Firm 2003 51820

;

data changes;

  informat name $50.;

  informat id $5.;

  infile cards delimiter='09'x;

  input Name Year ID;

  cards;

ST. Lucy's Firm 1999 41712

Sun County 2002 51820

;

Super User
Posts: 9,687

Re: How to create unique ID if firm ID changes across years

Agree, That is reason why I called mine is not a good solution.

Xia Keshan

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 459 views
  • 3 likes
  • 4 in conversation