Re: Lookup solution

Carlosmx · Posted 07-30-2013 04:50 PM

Hi All!

I'm new in SAS and I need your help. I need to look up for account id in a database, I have a spreed sheet that contains the account names. I've used an index function but I just can retrieves some of the account ids due to names are different (some of them include dots, comma, etc) and there isn't and exact match.

I was thinking in separete the account name into several fields (in excel) i.e. "Best Buy", will be divided into 2 columns "Best" and "Buy" and then column 1 will be a format, column 2 another one, etc. The excel file will contain these headers Rowid, Index1, Index2, Index3, etc (I want to use Rowid in order to match the excel with the results)

I will import the excel file into SAS and make it a format

Code:

%makefmt(dslib=Work,dsin=Data,fmtname=$indexone,start=index1,lv=Y, other=N);

data AccountsUS (keep= account_name account_id region industry);

set Accounts;

where index(upcase(put(account_name,$indexone.))) >0 ;

run;

How can I add the row id in the table results, due to I cab match some strings, but I would like to know to which string belongs that match with the row id)

Any tip or hint?

Thanks

Carlosmx · Posted 07-30-2013 04:58 PM

I forgot to correct this line:

where index(upcase(put(account_name,$indexone.)))>0;

ballardw · Posted 07-30-2013 05:02 PM

Some examples of what you have and what you want for output would be helpful.

Carlosmx · Posted 07-30-2013 05:11 PM

Thank you ballardw:

I would like to have something like this:

Excel File

Row ID	Index1	Index2	Index3	Index4
1	Desjardins	Group	Mouvement	des
2	Canada	Post	Corporation	(Canada
3	Nova	Scotia	Liquor	Corporation
4	Goodyear	Canada	Inc.
5	Shell	Canada	Products
6	Hydro	One	Networks	Inc.
7	British	Columbia	Ministry	of
8	Whirlpool	Canada	LP

SAS Match: (accound_id is taken from the database, index1 and RowId are taken of excel file)

Accound_id	Index1	Row ID
523223	Desjardins	1
1348864	Nova	3
1253532	Hydro	6
1258563	Whirlpool	8

ballardw · Posted 07-30-2013 05:46 PM

Have you examined both datasets to verify that the first word is unique per id?

You have British Columbia as part of one account name, what happens if you also have British Petroleum?

The format idea is a good one for establishing look ups but requires fairly clean data to create the format with unique keys.

Approaches can vary depending on:

How many records are involved? The more accounts the less acceptable human eyeball approaches will be.

How reliable is the spelling outside of the punctuation issues? Matching Qoise to Boise isn't going to be automatted completely.

Do you get a significant number of exact matches? If most match then human intervention may be acceptable.

Will you have to do this again with other spreadsheets? Multiple times means it will be worth spending more time to generate a more general approach than would be needed for a one time approach.

A rough outline I do with similar projects:

See what I can match directly, usually means using as a minimum UPCASE or LOWCASE on comparison text

With the ones that don't match, try find close matches. This is where the number of records comes into play as the simplest and possibly best means creating a comparison score for every combination of nams in both files (minus the ones that matched from your spreadsheet) and examing the "best" scores for matches. The SAS functions COMPGED, COMPARE and COMPLEV are some ways to generate those scores. These can be used as a threshold or to bring likely matches to human attention for acceptance/rejection.

If everything does match, lower the threshold with the unmatched and reiterate.

Carlosmx · Posted 07-31-2013 10:27 AM

Thank you ballardw.. let me try with those functions you mentioned.

art297 · Posted 07-31-2013 10:47 AM

Can you attach the two files or at least sub-sets of each? I just completed a similar task as part of a presentation that I'll be doing at the September MWSUG meeting that uses the technique described by

If I see that it will work with your data, I'd be glad to post the code.

Art

Carlosmx · Posted 07-31-2013 11:31 AM

Thank you Arthur, I've attached an excel file that contains an extract of the database in sheet named Database and in the other tab named lookup contains some accounts to be matched, as you can see there is just an exact match on row 76.

art297 · Posted 07-31-2013 12:37 PM

Can't test it with that file. Other than that one record, none of the others (that I noticed at least) have a match.

To see if it works, the database file would have to contain acct_names that match (not exact match) the records in the lookup table.

Vince28_Statcan · Posted 07-31-2013 02:48 PM

Hi Carlos,

As pointed out by Ballard, if I were you, I'd use one of the text-distance metrics. However, since you've mentioned sometimes its word separators and the like that are causing inexact match, I would do so on a pre processed "compressed" string (removing spaces, hyphens, dots, commas, etc. basically any special character). The issue with a discrimination rule with just those distances are word ordering ("mouvement desjardins" and "desjardins mouvement" would be extremely far appart).

There are other text distance metrics although I am not aware that they are built-in SAS. Another alternative is to try to minimize character transitions with a 27*27 matrix of transition for each string.

Here is a simplified version of something I had done a while back. I had build a FCMP function to compare a single variable, for each record, against an entire Levenstein dictionary (complev function). I'm not providing anything with regards to proc FCMP but rather how you can use _TEMPORARY_ arrays as a dictionary when dictionary size is not too big.

/* steps to put the number of records in your accounts dataset into macro variable dbsize*/

data want;

if _N_=1 then do;

array accountname{&dbsize} $80. _temporary_;

array accountid{&dbsize} $3. _temporary_;

do i = 1 to &dbsize;

set database;

accountname{i} =

accountid{i} = accountid;

end; /* read your masters dataset accountname/id into temporary arrays that are retained in memory through the entire data step */

end;

set have; /* loop on dataset have or, in your case, the lookup excel sheet */

bestmatchid=""; bestmatchdistance=80;

do i=1 to &dbsize;

temp=complev(compress(accountname{i}, " -;:,.\/#'*&?!$@"), compress(account_name, " -;:,.\/#'*&?!$@"), "nl");

if temp<bestmatchdistance then bestmatchid=accountid{i};

end;

drop bestmatchdistance temp;

run;

for testing you may want to remove the drop statement~

Basically what this does is run every single account_name in your lookup dataset against a "dictionary" of account names & id pairs preformatting the data. Note that the complev modifier n ignores case hence I did not add an upcase or lowcase function to the compress.

There's currently no case handling if multiple matches are found or if the distance is too small or even if the best distance is way too big. Levenstein computes the number of transformation to go from string A to string B comp GED might suit your need more or you may even need a custom string distance function but the above concept of using temporary arrays to read your masters data only once still holds and can be quite useful. Significantly more than doing a cartesian product of the 2 DS and run a distance function on each such record.

Vincent

Carlosmx · Posted 08-13-2013 05:42 PM

Thank you all for your help, I just did something outside SAS that was faster, for this time, I just used Access 2010 and simply did a SQL string containing LIKE instruction and it worked.

Reeza · Posted 08-13-2013 05:59 PM

You can use SQL inside SAS as well, and the like function is there as well.

Registration is open

SAS Training: Just a Click Away