hi ... before you start any of the above, since step #1 relies on matching by literals, have you looked at the names in both files and determined if there are thigs you should do before you even start ... for example ... #1 in holding_company, I see ... O.S.K. HOLDINGS BERHAD OSK HOLDINGS BERHAD are they the same company and should you get rid of those periods #2 there's a mix of lower and upper case letters ... should you convert to all uppercase #3 most (90+%) of all the name variables you cite have "BHD" or "BERHAD" as part of the name ... if you are going to look for similarity in names you don't want the fact that the "BHD' or "BERHAD" part of the match contributing anything to a score given to a name comparison #4 sometimes a location is in parentheses (MALAYSIA) and sometimes it's not MALAYSIA just using PROC FREQ on the various name variables would give you some idea as to how to fix up the names before you even try to match names for example, clean up the names and make some new variables to hold those names ... data new_maluw; set z.maluw; * add a record number for later use; mnrec+1; * convert to uppercase, only keep numbers/letters/spaces, convert multiple spaces to one space; nm = compbl(compress(upcase(name),,'kdas')); * get rid of BHD and BERHAD; nm = tranwrd(nm,' BHD',''); nm = tranwrd(nm,' BERHAD',''); run; data new_uw_match; set z.uw_match; unrec+1; nmh = compbl(compress(upcase(holding_company),,'kdas')); nmh = tranwrd(nmh,' BHD',''); nmh = tranwrd(nmh,' BERHAD',''); nmu = compbl(compress(upcase(underwriters_names),,'kdas')); nmu = tranwrd(nmu,' BHD',''); nmu = tranwrd(nmu,' BERHAD',''); run; then run PROC FREQ again on the new variables (nm, nmh, and nmu) and see if there are any other things you should do before you start to match the nm in one file to nmh and nmu in another once you have done the above, here's a suggestion for a start ... haven't used COMPGED much (maybe other folk know about a "good score" level) I usually do this stuf in stages, evaluating the success of each step (e.g. the name match) before I move onto the next ... * use SQL to match the files by a comparison of names, use the COMPGED function to compare names; * you don't have to use all the data since you have pointers (mnrec and unrec); * nm_nmh and nm_nmu are matching scores; proc sql; create table both as select mnrec, unrec, compged(nm, nmh) as nm_nmh, compged(nm, nmu) as nm_nmu from new_maluw, new_uw_match having nm_nmh lt 50 or nm_nmu lt 50; quit; * reconstruct the data using the pointers; * maybe you only add the dates and other vars you need for more work at this point; data both; set both; p1=mnrec; p2=unrec; set new_maluw (keep=nm closdate) point=p1; set new_uw_match (keep=nmh nmu ipo_date) point=p2; run; etc ...
... View more