Re: INDEX of a substring in as string

Rohit_1990 · Posted 05-16-2019 02:13 PM

Hi all,

I have two tables
as follows:

Table A

City ZIP Code(s)
Alabama (AL)Huntsville 35801
Alaska (AK)Anchorage 99501
Arizona (AZ)Phoenix 85001
Arkansas (AR)Little Rock 72201
California (CA)Sacramento 94203
Los Angeles 90001
Beverly Hills 90209
Colorado (CO)Denver 80201
Conneticut (CT)Hartford 06101
Deleware (DE)Dover 19901
District of Columbia 20001

Table B

Zip subcity
35801 alabama
35801 albama
35801 alabam
35801 hills1
35801 hills2
67980 hills3
67980 hills
67980 hills
90001 los
90001 albama
90001 rtff
80076 hartford
85001 Arizona
39801 alabama

I need to extract subcity from city of table A using look up within zip.

so that final output would be something like this

City ZIP Code(s) ZIP SUBcity
Alabama (AL)Huntsville 35801 35801 alabama
Alaska (AK)Anchorage 99501
Arizona (AZ)Phoenix 85001 85001 Arizona
Arkansas (AR)Little Rock 72201
California (CA)Sacrament 94203
Los Angeles 90001 90001 los
Beverly Hills 90209
Colorado (CO)Denver 80201
Conneticut (CT)Hartford 06101
Deleware (DE)Dover 19901
District of Columbia 20001

I have used left join using where condition of zip=ZIP Code(s)
AND CITY CONTAINS SUBCITY

BUT EXECUTION TIME IS VERY SLOW.
Is there any way to achieve result more efficiently?

Also,can we use a fuzzy search so that we have result like this as well

City ZIP Code(s) ZIP SUBcity
Alabama (AL)Huntsville 35801 35801 alabama
Alabama (AL)Huntsville 35801 35801 albama
Alabama (AL)Huntsville 35801 35801 alabam

Thanks a lot !!!!!!!!!

tomrvincent · Posted 05-16-2019 02:29 PM

Other than a horrible number of misspellings, I'd suggest pulling everything to the right of ')' as the subcity. That won't fix your *Delaware* record but handles most of the other records.

Rohit_1990 · Posted 05-16-2019 02:41 PM

Thanks for your response .

The part displayed as part in example has data in certain pattern where' ( 'can be used to delimit .

But This is not the actual data as actual data does not has any fixed pattern and somehow I need to first index my subcity in city within a zip and if condition holds true then need to pull out subcity.

ballardw · Posted 05-16-2019 03:29 PM

It would probably help to have the entire Proc SQL code you attempted.

Your "execution is very slow" might be improved by coding changes but need to see the whole thing you are using.

For example it seems likely your choice of " where condition of zip=ZIP Code(s)" might be better done as a JOIN ON of some flavor.

Reeza · Posted 05-16-2019 03:44 PM

Are the spelling issues in your actual data then?

There's several different versions of Alabama in there. If so I would say, first clean up Table 2. Then use it as a look up to table 1, via zip. You can then find the state name and remove that from the text. You also want to remove the (ST) or the state abbreviation and you'll be left with the city name. If you just merged on zip wouldn't you get the results indicated though? So not sure what more you may need here.

Rohit_1990 · Posted 05-16-2019 04:01 PM

Yes spelling variations is present in data itself ,if it is a challenge to capture variations it can be left out but is there a way to capture exact matches.

Simply put I need to check subcity string in city string based on zip .

ChrisNZ · Posted 05-16-2019 10:25 PM

1. Merging on equality (equi-joins) is the fastest you can merge. If that's not fast enough you need to presort the tables.

2. As soon as you do fuzzy joins, performance plummets.

So the course of action would be:

1. Sort the tables by ZIP

2. Merge on ZIP equality and SUBCITY equal to the start of city (use scan() for the first word if long enough or scan() using the parentheses as delimiter or operator =: , or all these successively)

3. What's hasn't been matched can be retried with other criteria including fuzzy ones, like using the function compged() . The cost get higher but the volumes get smaller.

High-Performance SAS Coding - Third Edition

Rohit_1990 · Posted 05-16-2019 10:59 PM

Hi Chris thanks for you reply.

But I can use scan function as it is not necessary that the subcity exist in city at 1st position.

The city string can be also like this

Cuz ABC Alabama 80501
Ftt ytr ghi 80501 Alabama

Since pattern is not consistent I used contain operator that is city contains subcity but performance of such query is very slow.

Regards,
Rohit

ChrisNZ · Posted 05-16-2019 11:47 PM

You missed the essence of my reply:

To speed up the matches AND increase the match rate, perform successive matches rather than try to do it all in one go.

High-Performance SAS Coding - Third Edition

Rohit_1990 · Posted 05-17-2019 12:38 PM

Hi Chris,

Can you please elaborate more on how to carry out successive matches. It would be of great help.

Regards

ChrisNZ · Posted 05-19-2019 05:58 PM

> Can you please elaborate more on how to carry out successive matches.

Not too sure whats unclear.

So the course of action would be:

1. Sort the tables by ZIP

2. Merge on ZIP equality and SUBCITY equal to the start of city (use scan() for the first word if long enough or scan() using the parentheses as delimiter or operator =: , or all these successively)

3. What's hasn't been matched can be retried with other criteria including fuzzy ones, like using the function compged() . The cost get higher but the volumes get smaller.

1. Join the easily found matches using an obvious criterion like ZIP equality and SUBCITY = first word => function scan()

2. Join the unmatched data on a less direct criterion like ZIP equality and SUBCITY = any word => function index()

3. Repeat the process for unmatched data until satisfied: the volume to match goes down as the criterion increases in fuzziness.

4. When finished, append the successive matches. It is a good idea to keep track of the match method so the data includes some sort of match-quality score.

High-Performance SAS Coding - Third Edition

INDEX of a substring in as string