BookmarkSubscribeRSS Feed
sos
Calcite | Level 5 sos
Calcite | Level 5

Hello, folks,

I want to combine about 50 data sets; the first two variables are city names. For some of the names, they have diffrent spellings. Like they put 'Sarasota/Bradenton, FL' the same as 'Sarasota, FL'. So some of the data sets have 'Sarasota/Bradenton, FL' while some 'Sarasota, FL'. In order to make them look exactly the same, I should do two things:

1. find how many data sets have 'Sarasota/Bradenton, FL' and how many data sets have 'Sarasota, FL';

2. try to use command to unify these two terms-either change it into 'Sarasota/Bradenton, FL' or the other way around.

Could someone help me this out?

Thanks a lot!     

10 REPLIES 10
art297
Opal | Level 21

Are they SAS files already or are you importing spreadsheets or delimited files or what?  I would be surprised if the above example is the only problem you have.  Misspellings, abbreviations and punctuation are also common problems in such tasks.

sos
Calcite | Level 5 sos
Calcite | Level 5

They are sas files. I did not realize the problem before I imported them into SAS.

Since I use the following code:

Data one;

input city1 :$3. city2 :$3.  pass;

datalines;

chi    det    3

chi     ks    4

la      dc    5

la      chi   6

;

proc sort; by city1 city2; run;

data two;

input city1 :$3. city2 :$3.  pass;

datalines;

chi    det    13

chi     ks    14

chi     dc    6

la      dc    51

la      chi   16

;

run;

proc sort; by city1 city2; run;

/*combine the data sets by city1 and city2. Rename the PASS variables so they aren't overwritten*/

Data total;

merge one (in=in1 rename=(pass=pass1)) two(in=in2 rename=(pass = pass2));

by city1 city2;

if in1 and in2;

run;

proc print; run;

I need to make sure the observation in 'city1' and 'city2' are exactly the same, right?

Thanks!

LinusH
Tourmaline | Level 20

If this is a large warehouse implementation, it could be a work for DataFlux/SAS DQ Server.

If not, you need to do sopmething by hand.

If this job is repetitive, I would build a macro that searches all your tables in specified columns for you search string, in you case it would be 'Sarasota'. When found, report it to a table, from which you could do a summary report or whatever you want to do.

/Linus

Data never sleeps
art297
Opal | Level 21

I agree with Linus that it depends on how huge of an effort it is.  If it isn't a large implementation, and just a one time thing, I would simply:

1. create one large dataset by just setting all of the files within one datastep.

2. do two proc sort nodupkeys on that file using the two city variables (as the by variable) for each, respectively

3. manually review the file to see just how many such errors exist

4. create new city variables using such techniques as proc format, only capturing the start of city names, or the various proximity functions (complev, comged, soundex, etc.) depending upon the nature of what you find as a result of the previous steps.

sos
Calcite | Level 5 sos
Calcite | Level 5

Actually, I tried that 'set' procedure, which combines all the data sets. I use 'findw' to unify the spellings of the city names. It is not that hard since it is in one big data set. But there is no way that we can split the big data set into 50 original ones, right? Can we go back to the original data sets after correcting the names of cities?

My case is like this: I have 50 years of data sets, with each year as a data set. I want to combine this 50 data sets into a panel one. Since those 50 data sets do not have the same amount of cities, I want keep the overlap of the cities in the final big data set. Ideally, my final data set would be a balanced panel with 50 time periods.

Thanks!

art297
Opal | Level 21

Not to interfere with the latest suggestion(s) you were offered but, in answer to your question, why not?  One can always analyze files to discover what approach would be optimal.

The only problem with going back and forth is the question of how long it takes.  Unless one deletes or corrupts their files, there is no reason one can't take multiple passes through them unless doing so is economically infeasable in itself.

sos
Calcite | Level 5 sos
Calcite | Level 5

Right. I can do it. But I have to implement the same procedure for 50 times. Is there an easier way to do it? Like to split the corrected big total data set into 50 smaller data sets with other variables the same as before?

Thanks!

art297
Opal | Level 21

You've never (that I remember at least) mentioned how big the datasets are.  How many records would you guess would be in all 50 files combined?

ballardw
Super User

Do you by any chance have ZIP codes in the data? If so you might find the ZIPCITY function helpful.

If not restricing a search in the SASHELP.ZIP code data set to your state(s) of interest and LIKE or SOUNDEX functions to get a consistent name may be possible.

sos
Calcite | Level 5 sos
Calcite | Level 5

Thanks for your suggestion. I will try it.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1007 views
  • 0 likes
  • 4 in conversation