How to compare variables in different data sets.

sos · Posted 11-02-2011 07:14 PM

Hello, folks,

I want to combine about 50 data sets; the first two variables are city names. For some of the names, they have diffrent spellings. Like they put 'Sarasota/Bradenton, FL' the same as 'Sarasota, FL'. So some of the data sets have 'Sarasota/Bradenton, FL' while some 'Sarasota, FL'. In order to make them look exactly the same, I should do two things:

1. find how many data sets have 'Sarasota/Bradenton, FL' and how many data sets have 'Sarasota, FL';

2. try to use command to unify these two terms-either change it into 'Sarasota/Bradenton, FL' or the other way around.

Could someone help me this out?

Thanks a lot!

art297 · Posted 11-03-2011 12:18 AM

Are they SAS files already or are you importing spreadsheets or delimited files or what? I would be surprised if the above example is the only problem you have. Misspellings, abbreviations and punctuation are also common problems in such tasks.

sos · Posted 11-03-2011 01:19 AM

They are sas files. I did not realize the problem before I imported them into SAS.

Since I use the following code:

Data one;

input city1 :$3. city2 :$3. pass;

datalines;

chi det 3

chi ks 4

la dc 5

la chi 6

;

proc sort; by city1 city2; run;

data two;

input city1 :$3. city2 :$3. pass;

datalines;

chi det 13

chi ks 14

chi dc 6

la dc 51

la chi 16

;

run;

proc sort; by city1 city2; run;

/*combine the data sets by city1 and city2. Rename the PASS variables so they aren't overwritten*/

Data total;

merge one (in=in1 rename=(pass=pass1)) two(in=in2 rename=(pass = pass2));

by city1 city2;

if in1 and in2;

run;

proc print; run;

I need to make sure the observation in 'city1' and 'city2' are exactly the same, right?

Thanks!

LinusH · Posted 11-03-2011 09:00 AM

If this is a large warehouse implementation, it could be a work for DataFlux/SAS DQ Server.

If not, you need to do sopmething by hand.

If this job is repetitive, I would build a macro that searches all your tables in specified columns for you search string, in you case it would be 'Sarasota'. When found, report it to a table, from which you could do a summary report or whatever you want to do.

/Linus

Data never sleeps

art297 · Posted 11-03-2011 09:22 AM

I agree with Linus that it depends on how huge of an effort it is. If it isn't a large implementation, and just a one time thing, I would simply:

1. create one large dataset by just setting all of the files within one datastep.

2. do two proc sort nodupkeys on that file using the two city variables (as the by variable) for each, respectively

3. manually review the file to see just how many such errors exist

4. create new city variables using such techniques as proc format, only capturing the start of city names, or the various proximity functions (complev, comged, soundex, etc.) depending upon the nature of what you find as a result of the previous steps.

sos · Posted 11-03-2011 06:22 PM

Actually, I tried that 'set' procedure, which combines all the data sets. I use 'findw' to unify the spellings of the city names. It is not that hard since it is in one big data set. But there is no way that we can split the big data set into 50 original ones, right? Can we go back to the original data sets after correcting the names of cities?

My case is like this: I have 50 years of data sets, with each year as a data set. I want to combine this 50 data sets into a panel one. Since those 50 data sets do not have the same amount of cities, I want keep the overlap of the cities in the final big data set. Ideally, my final data set would be a balanced panel with 50 time periods.

Thanks!

art297 · Posted 11-03-2011 06:35 PM

Not to interfere with the latest suggestion(s) you were offered but, in answer to your question, why not? One can always analyze files to discover what approach would be optimal.

The only problem with going back and forth is the question of how long it takes. Unless one deletes or corrupts their files, there is no reason one can't take multiple passes through them unless doing so is economically infeasable in itself.

sos · Posted 11-03-2011 09:14 PM

Right. I can do it. But I have to implement the same procedure for 50 times. Is there an easier way to do it? Like to split the corrected big total data set into 50 smaller data sets with other variables the same as before?

Thanks!

art297 · Posted 11-03-2011 11:04 PM

You've never (that I remember at least) mentioned how big the datasets are. How many records would you guess would be in all 50 files combined?

ballardw · Posted 11-03-2011 01:28 PM

Do you by any chance have ZIP codes in the data? If so you might find the ZIPCITY function helpful.

If not restricing a search in the SASHELP.ZIP code data set to your state(s) of interest and LIKE or SOUNDEX functions to get a consistent name may be possible.

sos · Posted 11-03-2011 06:24 PM

Thanks for your suggestion. I will try it.