BookmarkSubscribeRSS Feed
Carlosmx
Calcite | Level 5

Hi All!

I'm new in SAS and I need your help. I need to look up for account id in a database, I have a spreed sheet that contains the account names. I've used an index function but I just can retrieves some of the account ids due to names are different (some of them include dots, comma, etc) and there isn't and exact match.

I was thinking in separete the account name into several fields (in excel) i.e. "Best Buy", will be divided into 2 columns "Best" and "Buy" and then column 1 will be a format, column 2 another one, etc. The excel file will contain these headers Rowid, Index1, Index2, Index3, etc (I want to use Rowid in order to match the excel with the results)

I will import the excel file into SAS and make it a format

Code:

%makefmt(dslib=Work,dsin=Data,fmtname=$indexone,start=index1,lv=Y, other=N);

data AccountsUS (keep= account_name account_id region industry);

set Accounts;

where index(upcase(put(account_name,$indexone.))) >0 ;

run;

How can I add the row id in the table results, due to I cab match some strings, but I would like to know to which string belongs that match with the row id)

Any tip or hint?

Thanks

11 REPLIES 11
Carlosmx
Calcite | Level 5

I forgot to correct this line:

where index(upcase(put(account_name,$indexone.)))>0;

ballardw
Super User

Some examples of what you have and what you want for output would be helpful.

Carlosmx
Calcite | Level 5

Thank you ballardw:

I would like to have something like this:

Excel File

Row IDIndex1Index2Index3Index4
1DesjardinsGroupMouvementdes
2CanadaPostCorporation(Canada
3NovaScotiaLiquorCorporation
4GoodyearCanadaInc.
5ShellCanadaProducts
6HydroOneNetworksInc.
7BritishColumbiaMinistryof
8WhirlpoolCanadaLP

SAS Match: (accound_id is taken from the database, index1 and RowId are taken of excel file)

Accound_idIndex1Row ID
523223Desjardins1
1348864Nova3
1253532Hydro6
1258563Whirlpool8
ballardw
Super User

Have you examined both datasets to verify that the first word is unique per id?

You have British Columbia as part of one account name, what happens if you also have British Petroleum?

The format idea is a good one for establishing look ups but requires fairly clean data to create the format with unique keys.

Approaches can vary depending on:

How many records are involved? The more accounts the less acceptable human eyeball approaches will be.

How reliable is the spelling outside of the punctuation issues? Matching Qoise to Boise isn't going to be automatted completely.

Do you get a significant number of exact matches? If most match then human intervention may be acceptable.

Will you have to do this again with other spreadsheets? Multiple times means it will be worth spending more time to generate a more general approach than would be needed for a one time approach.

A rough outline I do with similar projects:

     See what I can match directly, usually means using as a minimum UPCASE or LOWCASE on comparison text

     With the ones that don't match, try find close matches. This is where the number of records comes into play as the simplest and possibly best means creating a comparison score for every combination of nams in both files (minus the ones that matched from your spreadsheet) and examing the "best" scores for matches. The SAS functions COMPGED, COMPARE and COMPLEV are some ways to generate those scores. These can be used as a threshold or to bring likely matches to human attention for acceptance/rejection.

If everything does match, lower the threshold with the unmatched and reiterate.

Carlosmx
Calcite | Level 5

Thank you ballardw.. let me try with those functions you mentioned.

art297
Opal | Level 21

Can you attach the two files or at least sub-sets of each?  I just completed a similar task as part of a presentation that I'll be doing at the September MWSUG meeting that uses the technique described by

If I see that it will work with your data, I'd be glad to post the code.

Art

Carlosmx
Calcite | Level 5

Thank you Arthur, I've attached an excel file that contains an extract of the database in sheet named Database and in the other tab named lookup contains some accounts to be matched, as you can see there is just an exact match on row 76.

art297
Opal | Level 21

Can't test it with that file.  Other than that one record, none of the others (that I noticed at least) have a match.

To see if it works, the database file would have to contain acct_names that match (not exact match) the records in the lookup table.

Vince28_Statcan
Quartz | Level 8

Hi Carlos,

As pointed out by Ballard, if I were you, I'd use one of the text-distance metrics. However, since you've mentioned sometimes its word separators and the like that are causing inexact match, I would do so on a pre processed "compressed" string (removing spaces, hyphens, dots, commas, etc. basically any special character). The issue with a discrimination rule with just those distances are word ordering ("mouvement desjardins" and "desjardins mouvement" would be extremely far appart).

There are other text distance metrics although I am not aware that they are built-in SAS. Another alternative is to try to minimize character transitions with a 27*27 matrix of transition for each string.

Here is a simplified version of something I had done a while back. I had build a FCMP function to compare a single variable, for each record, against an entire Levenstein dictionary (complev function). I'm not providing anything with regards to proc FCMP but rather how you can use _TEMPORARY_ arrays as a dictionary when dictionary size is not too big.

/* steps to put the number of records in your accounts dataset into macro variable dbsize*/

data want;

    if _N_=1 then do;

                array accountname{&dbsize} $80. _temporary_;

                array accountid{&dbsize} $3. _temporary_;

                do i = 1 to &dbsize;

                        set database;

                        accountname{i} =

                         accountid{i} = accountid;

                end; /* read your masters dataset accountname/id into temporary arrays that are retained in memory through the entire data step */

        end;

     set have; /* loop on dataset have or, in your case, the lookup excel sheet */

     bestmatchid=""; bestmatchdistance=80;

     do i=1 to &dbsize;

           temp=complev(compress(accountname{i}, " -;:,.\/#'*&?!$@"), compress(account_name, " -;:,.\/#'*&?!$@"), "nl"); 

           if temp<bestmatchdistance then bestmatchid=accountid{i};

      end;

     drop bestmatchdistance temp;

run;

for testing you may want to remove the drop statement~

Basically what this does is run every single account_name in your lookup dataset against a "dictionary" of account names & id pairs preformatting the data. Note that the complev modifier n ignores case hence I did not add an upcase or lowcase function to the compress.

There's currently no case handling if multiple matches are found or if the distance is too small or even if the best distance is way too big. Levenstein computes the number of transformation to go from string A to string B comp GED might suit your need more or you may even need a custom string distance function but the above concept of using temporary arrays to read your masters data only once still holds and can be quite useful. Significantly more than doing a cartesian product of the 2 DS and run a distance function on each such record.

Vincent

Carlosmx
Calcite | Level 5

Thank you all for your help, I just did something outside SAS that was faster, for this time, I just used Access 2010 and simply did a SQL string containing LIKE instruction and it worked.

Reeza
Super User

You can use SQL inside SAS as well, and the like function is there as well.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1577 views
  • 4 likes
  • 5 in conversation