Help using Base SAS procedures

Lookup solution

Reply
Occasional Contributor
Posts: 6

Lookup solution

Hi All!

I'm new in SAS and I need your help. I need to look up for account id in a database, I have a spreed sheet that contains the account names. I've used an index function but I just can retrieves some of the account ids due to names are different (some of them include dots, comma, etc) and there isn't and exact match.

I was thinking in separete the account name into several fields (in excel) i.e. "Best Buy", will be divided into 2 columns "Best" and "Buy" and then column 1 will be a format, column 2 another one, etc. The excel file will contain these headers Rowid, Index1, Index2, Index3, etc (I want to use Rowid in order to match the excel with the results)

I will import the excel file into SAS and make it a format

Code:

%makefmt(dslib=Work,dsin=Data,fmtname=$indexone,start=index1,lv=Y, other=N);

data AccountsUS (keep= account_name account_id region industry);

set Accounts;

where index(upcase(put(account_name,$indexone.))) >0 ;

run;

How can I add the row id in the table results, due to I cab match some strings, but I would like to know to which string belongs that match with the row id)

Any tip or hint?

Thanks

Occasional Contributor
Posts: 6

Re: Lookup solution

I forgot to correct this line:

where index(upcase(put(account_name,$indexone.)))>0;

Super User
Posts: 10,500

Re: Lookup solution

Some examples of what you have and what you want for output would be helpful.

Occasional Contributor
Posts: 6

Re: Lookup solution

Thank you ballardw:

I would like to have something like this:

Excel File

Row IDIndex1Index2Index3Index4
1DesjardinsGroupMouvementdes
2CanadaPostCorporation(Canada
3NovaScotiaLiquorCorporation
4GoodyearCanadaInc.
5ShellCanadaProducts
6HydroOneNetworksInc.
7BritishColumbiaMinistryof
8WhirlpoolCanadaLP

SAS Match: (accound_id is taken from the database, index1 and RowId are taken of excel file)

Accound_idIndex1Row ID
523223Desjardins1
1348864Nova3
1253532Hydro6
1258563Whirlpool8
Super User
Posts: 10,500

Re: Lookup solution

Have you examined both datasets to verify that the first word is unique per id?

You have British Columbia as part of one account name, what happens if you also have British Petroleum?

The format idea is a good one for establishing look ups but requires fairly clean data to create the format with unique keys.

Approaches can vary depending on:

How many records are involved? The more accounts the less acceptable human eyeball approaches will be.

How reliable is the spelling outside of the punctuation issues? Matching Qoise to Boise isn't going to be automatted completely.

Do you get a significant number of exact matches? If most match then human intervention may be acceptable.

Will you have to do this again with other spreadsheets? Multiple times means it will be worth spending more time to generate a more general approach than would be needed for a one time approach.

A rough outline I do with similar projects:

     See what I can match directly, usually means using as a minimum UPCASE or LOWCASE on comparison text

     With the ones that don't match, try find close matches. This is where the number of records comes into play as the simplest and possibly best means creating a comparison score for every combination of nams in both files (minus the ones that matched from your spreadsheet) and examing the "best" scores for matches. The SAS functions COMPGED, COMPARE and COMPLEV are some ways to generate those scores. These can be used as a threshold or to bring likely matches to human attention for acceptance/rejection.

If everything does match, lower the threshold with the unmatched and reiterate.

Occasional Contributor
Posts: 6

Re: Lookup solution

Thank you ballardw.. let me try with those functions you mentioned.

PROC Star
Posts: 7,363

Re: Lookup solution

Can you attach the two files or at least sub-sets of each?  I just completed a similar task as part of a presentation that I'll be doing at the September MWSUG meeting that uses the technique described by

If I see that it will work with your data, I'd be glad to post the code.

Art

Occasional Contributor
Posts: 6

Re: Lookup solution

Thank you Arthur, I've attached an excel file that contains an extract of the database in sheet named Database and in the other tab named lookup contains some accounts to be matched, as you can see there is just an exact match on row 76.

PROC Star
Posts: 7,363

Re: Lookup solution

Can't test it with that file.  Other than that one record, none of the others (that I noticed at least) have a match.

To see if it works, the database file would have to contain acct_names that match (not exact match) the records in the lookup table.

Super Contributor
Posts: 339

Re: Lookup solution

Hi Carlos,

As pointed out by Ballard, if I were you, I'd use one of the text-distance metrics. However, since you've mentioned sometimes its word separators and the like that are causing inexact match, I would do so on a pre processed "compressed" string (removing spaces, hyphens, dots, commas, etc. basically any special character). The issue with a discrimination rule with just those distances are word ordering ("mouvement desjardins" and "desjardins mouvement" would be extremely far appart).

There are other text distance metrics although I am not aware that they are built-in SAS. Another alternative is to try to minimize character transitions with a 27*27 matrix of transition for each string.

Here is a simplified version of something I had done a while back. I had build a FCMP function to compare a single variable, for each record, against an entire Levenstein dictionary (complev function). I'm not providing anything with regards to proc FCMP but rather how you can use _TEMPORARY_ arrays as a dictionary when dictionary size is not too big.

/* steps to put the number of records in your accounts dataset into macro variable dbsize*/

data want;

    if _N_=1 then do;

                array accountname{&dbsize} $80. _temporary_;

                array accountid{&dbsize} $3. _temporary_;

                do i = 1 to &dbsize;

                        set database;

                        accountname{i} =

                         accountid{i} = accountid;

                end; /* read your masters dataset accountname/id into temporary arrays that are retained in memory through the entire data step */

        end;

     set have; /* loop on dataset have or, in your case, the lookup excel sheet */

     bestmatchid=""; bestmatchdistance=80;

     do i=1 to &dbsize;

           temp=complev(compress(accountname{i}, " -;:,.\/#'*&?!$@"), compress(account_name, " -;:,.\/#'*&?!$@"), "nl"); 

           if temp<bestmatchdistance then bestmatchid=accountid{i};

      end;

     drop bestmatchdistance temp;

run;

for testing you may want to remove the drop statement~

Basically what this does is run every single account_name in your lookup dataset against a "dictionary" of account names & id pairs preformatting the data. Note that the complev modifier n ignores case hence I did not add an upcase or lowcase function to the compress.

There's currently no case handling if multiple matches are found or if the distance is too small or even if the best distance is way too big. Levenstein computes the number of transformation to go from string A to string B comp GED might suit your need more or you may even need a custom string distance function but the above concept of using temporary arrays to read your masters data only once still holds and can be quite useful. Significantly more than doing a cartesian product of the 2 DS and run a distance function on each such record.

Vincent

Occasional Contributor
Posts: 6

Re: Lookup solution

Thank you all for your help, I just did something outside SAS that was faster, for this time, I just used Access 2010 and simply did a SQL string containing LIKE instruction and it worked.

Super User
Posts: 17,826

Re: Lookup solution

You can use SQL inside SAS as well, and the like function is there as well.

Ask a Question
Discussion stats
  • 11 replies
  • 423 views
  • 4 likes
  • 5 in conversation