Thank you Art! I downloaded the paper and the zipped data. I did the first two steps following your paper: 1. create a unique set of banknames using customer data , 2. get number of records in bank info dataset By page 6 of the paper, I could not quite follow the block of code due to my insufficient knowledge of sas. I pasted them here: let me call the sequence of code below (block 1), so I can refer back to it later. data fmtDataset (keep=fmtname start label type); retain fmtname '$banks' type 'C'; array bank(&numrec) $57; do i=1 to &numrec; set bankinfo; bank(i)=BankName; end; do until (eof); set banks (rename=(BankName=start)) end=eof; if length(start) le 4 then label=start; else do; lowscore=5000; do i=1 to &numrec; score= compged(start,bank(i)); if score le lowscore then do; lowscore=score; closest=i; end; end; label=bank(closest); end; output; end; run; 1. I used compged in the past, but the score I set is generally low to ensure the high matching, your code has a line: lowscore=5000; I do not know if it means you allow very distant match, i.e., two names are not close match. 2. my data set has company names in two databases, some are easier to match e.g. AB INC. vs. AB INCORPORATED, i could use SAS Code: &name = tranwrd(&name, "INCORPORATED","INC"); to account for those, but I notice you use /*Create the necessary format*/ proc format cntlin=fmtDataset; run; /*recode bank names*/ data dcandh; set dcandh (rename=(BankName=_BankName)); BankName=put(_BankName,$banks.); run; my question is: should I use block 1 code for my case, my two data sets are : data one has a group of firms that are customers of other firms, data two has all publicly traded firms in the U.S. market ; the two data source could be using different abbreviations such as Inc. Corp, spelled out or not, ; lower case , upper case, for name spelling, accounting for these, I can make two datasets both lower cases, and spell out some abbreviations I can think of. Besides these, should i use your block 1 to build the format? Sincerely, Lan
... View more