BookmarkSubscribeRSS Feed
Rohit_1990
Calcite | Level 5

Hi all,


I have two tables
as follows:

Table A

City                                    ZIP Code(s)
Alabama (AL)Huntsville       35801
Alaska (AK)Anchorage        99501
Arizona (AZ)Phoenix           85001
Arkansas (AR)Little Rock   72201
California (CA)Sacramento 94203
Los Angeles                        90001
Beverly Hills                        90209
Colorado (CO)Denver        80201
Conneticut (CT)Hartford    06101
Deleware (DE)Dover         19901
District of Columbia           20001

Table B

Zip      subcity
35801 alabama
35801 albama
35801 alabam
35801 hills1
35801 hills2
67980 hills3
67980 hills
67980 hills
90001 los
90001 albama
90001 rtff
80076 hartford
85001 Arizona
39801 alabama


I need to extract subcity from city of table A using look up within zip.

so that final output would be something like this


City                                          ZIP Code(s)     ZIP        SUBcity
Alabama (AL)Huntsville               35801        35801      alabama
Alaska (AK)Anchorage                 99501
Arizona (AZ)Phoenix                    85001       85001      Arizona
Arkansas (AR)Little Rock             72201
California (CA)Sacrament            94203
Los Angeles                                 90001       90001      los
Beverly Hills                                90209
Colorado (CO)Denver                80201
Conneticut (CT)Hartford             06101
Deleware (DE)Dover 19901
District of Columbia 20001


I have used left join using where condition of zip=ZIP Code(s)
AND CITY CONTAINS SUBCITY

BUT EXECUTION TIME IS VERY SLOW.
Is there any way to achieve result more efficiently?

 

Also,can we use a fuzzy search so that we have result like this as well


City                            ZIP Code(s) ZIP    SUBcity
Alabama (AL)Huntsville 35801     35801 alabama
Alabama (AL)Huntsville 35801     35801 albama
Alabama (AL)Huntsville 35801     35801 alabam

 

 

Thanks a lot !!!!!!!!!

 

10 REPLIES 10
tomrvincent
Rhodochrosite | Level 12
Other than a horrible number of misspellings, I'd suggest pulling everything to the right of ')' as the subcity. That won't fix your *Delaware* record but handles most of the other records.
Rohit_1990
Calcite | Level 5

Thanks for your response .

 

The part displayed as part in example has data in certain pattern where' ( 'can be used to delimit .

 

But This is not the actual data as actual data does not has any fixed pattern and somehow I need to first index my subcity in city within a zip and if condition holds true then need to pull out subcity.

ballardw
Super User

It would probably help to have the entire Proc SQL code you attempted.

 

Your "execution is very slow" might be improved by coding changes but need to see the whole thing you are using.

 

For example it seems likely your choice of " where condition of zip=ZIP Code(s)" might be better done as a JOIN ON of some flavor.

 

Reeza
Super User
Are the spelling issues in your actual data then?

There's several different versions of Alabama in there. If so I would say, first clean up Table 2. Then use it as a look up to table 1, via zip. You can then find the state name and remove that from the text. You also want to remove the (ST) or the state abbreviation and you'll be left with the city name. If you just merged on zip wouldn't you get the results indicated though? So not sure what more you may need here.
Rohit_1990
Calcite | Level 5
Yes spelling variations is present in data itself ,if it is a challenge to capture variations it can be left out but is there a way to capture exact matches.

Simply put I need to check subcity string in city string based on zip .
ChrisNZ
Tourmaline | Level 20

1. Merging on equality (equi-joins) is the fastest you can merge. If that's not fast enough you need to presort the tables.

2. As soon as you do fuzzy joins, performance plummets.

 

So the course of action would be:

1. Sort the tables by ZIP

2. Merge on ZIP equality and SUBCITY equal to the start of city (use scan() for the first word if long enough   or   scan() using the parentheses as delimiter  or  operator  =:  ,  or all these successively) 

3. What's hasn't been matched can be retried with other criteria including fuzzy ones, like using the function compged() . The cost get higher but the volumes get smaller.

 

Rohit_1990
Calcite | Level 5
Hi Chris thanks for you reply.

But I can use scan function as it is not necessary that the subcity exist in city at 1st position.

The city string can be also like this

Cuz ABC Alabama 80501
Ftt ytr ghi 80501 Alabama

Since pattern is not consistent I used contain operator that is city contains subcity but performance of such query is very slow.


Regards,
Rohit
ChrisNZ
Tourmaline | Level 20

You missed the essence of my reply:

To speed up the matches AND increase the match rate, perform successive matches rather than try to do it all in one go.

Rohit_1990
Calcite | Level 5
Hi Chris,

Can you please elaborate more on how to carry out successive matches. It would be of great help.

Regards
ChrisNZ
Tourmaline | Level 20

> Can you please elaborate more on how to carry out successive matches.

 

Not too sure whats unclear.

So the course of action would be:

1. Sort the tables by ZIP

2. Merge on ZIP equality and SUBCITY equal to the start of city (use scan() for the first word if long enough   or   scan() using the parentheses as delimiter  or  operator  =:  ,  or all these successively) 

3. What's hasn't been matched can be retried with other criteria including fuzzy ones, like using the function compged() . The cost get higher but the volumes get smaller.

1. Join the easily found matches using an obvious criterion like ZIP equality and SUBCITY = first word  => function scan()

2. Join the unmatched data on a less direct criterion like ZIP equality and SUBCITY = any word     => function index()

3. Repeat the process for unmatched data until satisfied: the volume to match goes down as the criterion increases in fuzziness.

4. When finished, append the successive matches. It is a good idea to keep track of the match method so the data includes some sort of match-quality score.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1179 views
  • 2 likes
  • 5 in conversation