topic Re: deal with the variations in merge in SAS Procedures

deal with the variations in merge

deleted_user — Mon, 02 Mar 2009 00:40:35 GMT

I think this is a very challenging problem I am facing and I have no idea how to deal with it
Suppose I have two csv files

A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue
Acura Inf,2000,yellow

B.csv
Toyota Inc. Camry, 2000km
Honda Corp Civic,1500km
HondaUSA Inf, 2000, 2300km

I want to generate C.csv
Toyota Camry,1998,blue ,2000km
Honda Civic,1999,blue,1500km
HondaUSA Inf,2000,yellow,2300km

The worst part of the task is that there needs to be error tolerance to deal with the variations in the company name
1.extra spaces
2.extra dots
3.phrases such as Inc, corp.
4.Create a list of manual translation tables(Acura translates to HondaUSA)

Is this mission impossible?

Re: deal with the variations in merge

venkatesh — Mon, 02 Mar 2009 09:52:30 GMT

I Don't know how much it will help to you....

From A.csv file split the name in to two parts by using scan function like Toyota and carmy as name1 and name2.
from B.csv file also do same thing.

finally merge by either name1 or name2 ........

Thanks....

Re: deal with the variations in merge

deleted_user — Tue, 03 Mar 2009 10:47:53 GMT

It looks a little complicated but by no means mission impossible.

You obviously need to get a standardised variable in each dataset (having read them in from the csv files.

Have a look at the INDEXW function...

http://support.sas.com/documentation/cdl/en/lrdict/61724/HTML/default/a000212235.htm

This will search a string for a specific word.

You could create a list of words you are looking for e.g. Honda Acura Nissan

and use the indexw function to look for them in the variable. If you find the word then put that word into a variable. Do this on both datasets, then change any Accura to to Honda etc.

Remember the INDEXW function is looking for words, so it uses a delimiter. So it won't pick the Honda out of HondaUSA. It is looking for words, not character strings. You can use regular expressions for that if you are really keen.

When you have a standard variable in each data set you can merge them.

Re: deal with the variations in merge

Flip — Tue, 03 Mar 2009 12:48:03 GMT

Just a random thought, but have you given the Sounds-Like (=*) opperator a try? Also, you can use Regular expressions to match oun similar text.