Help using Base SAS procedures

How to compare variables in different data sets.

Reply
Occasional Contributor sos
Occasional Contributor
Posts: 12

How to compare variables in different data sets.

Hello, folks,

I want to combine about 50 data sets; the first two variables are city names. For some of the names, they have diffrent spellings. Like they put 'Sarasota/Bradenton, FL' the same as 'Sarasota, FL'. So some of the data sets have 'Sarasota/Bradenton, FL' while some 'Sarasota, FL'. In order to make them look exactly the same, I should do two things:

1. find how many data sets have 'Sarasota/Bradenton, FL' and how many data sets have 'Sarasota, FL';

2. try to use command to unify these two terms-either change it into 'Sarasota/Bradenton, FL' or the other way around.

Could someone help me this out?

Thanks a lot!     

PROC Star
Posts: 7,363

How to compare variables in different data sets.

Are they SAS files already or are you importing spreadsheets or delimited files or what?  I would be surprised if the above example is the only problem you have.  Misspellings, abbreviations and punctuation are also common problems in such tasks.

Occasional Contributor sos
Occasional Contributor
Posts: 12

How to compare variables in different data sets.

They are sas files. I did not realize the problem before I imported them into SAS.

Since I use the following code:

Data one;

input city1 :$3. city2 :$3.  pass;

datalines;

chi    det    3

chi     ks    4

la      dc    5

la      chi   6

;

proc sort; by city1 city2; run;

data two;

input city1 :$3. city2 :$3.  pass;

datalines;

chi    det    13

chi     ks    14

chi     dc    6

la      dc    51

la      chi   16

;

run;

proc sort; by city1 city2; run;

/*combine the data sets by city1 and city2. Rename the PASS variables so they aren't overwritten*/

Data total;

merge one (in=in1 rename=(pass=pass1)) two(in=in2 rename=(pass = pass2));

by city1 city2;

if in1 and in2;

run;

proc print; run;

I need to make sure the observation in 'city1' and 'city2' are exactly the same, right?

Thanks!

Super User
Posts: 5,256

How to compare variables in different data sets.

If this is a large warehouse implementation, it could be a work for DataFlux/SAS DQ Server.

If not, you need to do sopmething by hand.

If this job is repetitive, I would build a macro that searches all your tables in specified columns for you search string, in you case it would be 'Sarasota'. When found, report it to a table, from which you could do a summary report or whatever you want to do.

/Linus

Data never sleeps
PROC Star
Posts: 7,363

How to compare variables in different data sets.

I agree with Linus that it depends on how huge of an effort it is.  If it isn't a large implementation, and just a one time thing, I would simply:

1. create one large dataset by just setting all of the files within one datastep.

2. do two proc sort nodupkeys on that file using the two city variables (as the by variable) for each, respectively

3. manually review the file to see just how many such errors exist

4. create new city variables using such techniques as proc format, only capturing the start of city names, or the various proximity functions (complev, comged, soundex, etc.) depending upon the nature of what you find as a result of the previous steps.

Occasional Contributor sos
Occasional Contributor
Posts: 12

How to compare variables in different data sets.

Actually, I tried that 'set' procedure, which combines all the data sets. I use 'findw' to unify the spellings of the city names. It is not that hard since it is in one big data set. But there is no way that we can split the big data set into 50 original ones, right? Can we go back to the original data sets after correcting the names of cities?

My case is like this: I have 50 years of data sets, with each year as a data set. I want to combine this 50 data sets into a panel one. Since those 50 data sets do not have the same amount of cities, I want keep the overlap of the cities in the final big data set. Ideally, my final data set would be a balanced panel with 50 time periods.

Thanks!

PROC Star
Posts: 7,363

How to compare variables in different data sets.

Not to interfere with the latest suggestion(s) you were offered but, in answer to your question, why not?  One can always analyze files to discover what approach would be optimal.

The only problem with going back and forth is the question of how long it takes.  Unless one deletes or corrupts their files, there is no reason one can't take multiple passes through them unless doing so is economically infeasable in itself.

Occasional Contributor sos
Occasional Contributor
Posts: 12

Re: How to compare variables in different data sets.

Right. I can do it. But I have to implement the same procedure for 50 times. Is there an easier way to do it? Like to split the corrected big total data set into 50 smaller data sets with other variables the same as before?

Thanks!

PROC Star
Posts: 7,363

Re: How to compare variables in different data sets.

You've never (that I remember at least) mentioned how big the datasets are.  How many records would you guess would be in all 50 files combined?

Super User
Posts: 10,500

How to compare variables in different data sets.

Do you by any chance have ZIP codes in the data? If so you might find the ZIPCITY function helpful.

If not restricing a search in the SASHELP.ZIP code data set to your state(s) of interest and LIKE or SOUNDEX functions to get a consistent name may be possible.

Occasional Contributor sos
Occasional Contributor
Posts: 12

How to compare variables in different data sets.

Thanks for your suggestion. I will try it.

Ask a Question
Discussion stats
  • 10 replies
  • 180 views
  • 0 likes
  • 4 in conversation