Solved: Re: How to isolate and select the specific string matching on the stri...

Cruise · Posted 05-05-2020 02:17 PM

Hi Folks:

I have about 20 survey tables that needed to be aligned to a REFERENCE data. ID1NAME and IDNAME are keys to merge them. They are province(ID1NAME) and city(IDNAME) names. City names are not unique but combination of ID1NAME and IDNAME are. ID1NAME is cleanable since at least there is a logic. However, IDNAME in survey tables are messy and no logic, to me, at least. Province name is included in the city name at front OR at the end of the variable varying across rows. Explanatory but non-standardized texts appear like: integrated, combined or altogether just to name few. There can be 1-4 open spaces as a delimiter varying across rows. I see the way to standardize IDNAME in survey table probably is to create separate each string separated by space into different variables and use array. Then find if IDNAME of the REFERENCE data in the array of these newly created variables. For example, the IDNAME of third row would lead to three variables such as: var1 for Integrated, var2 for Cheongju and var3 for City. Which is the array of var1-var3.

If IDNAME of REFERENCE data is found in this array of var1-var3 then output as a clean desired IDNAME as a new variable in survey datasets.

Can you please help execute this idea in SAS if this makes sense? Below is mock datasets. But I couldn't figure how to input strings separated by space(s) as one variable. I apologize!

DATA SURVEY;
INPUT ID1NAME $25 IDNAME $25; 
CARDS;
Chung sub Total 
Chung Chungju Chung
Chung Jecheon
Chung Integrated Cheongju City
Chung Boeun-gun
Chung Okcheon-gun
Chang Yeongdong-gun
Chang Chang Jincheon-gun
Chang Chang Goesan-gun 
Chang Combined Eumseong City 
Chang Danyang-gun 
Chang Jeungpyeong
;
DATA REFERENCE; 
INPUT ID1NAME $25 IDNAME $25; 
CARDS;
Chung Chungju
Chung Jecheon
Chung Cheongju
Chung Boeun
Chung Okcheon
Chang Yeongdong
Chang Jincheon
Chang Goesan
Chang Eumseong
Chang Danyang
Chang Jeungpyeong
;

ChrisNZ · Posted 05-05-2020 08:51 PM

Hopefully you needn't work on a row-by-row basis. I'd hope many rows have similar patterns, and once you code for a pattern all the concerned rows are processed adequately.

High-Performance SAS Coding - Third Edition

View solution in original post

biopharma · Posted 05-05-2020 06:43 PM

PRXMATCH is one way to do this.

proc sql ;
    create table match as
       select r.id1name, r.idname, s.idname as messyname 
          from reference as r
          left join survey as s
          on r.id1name = s.id1name 
          where prxmatch("/"||strip(r.idname)||"/",s.idname)>0;
quit ;

ChrisNZ · Posted 05-05-2020 07:45 PM

I would separate the values so that you have:

Chang Combined Eumseong City

going to:

PROVINCE=Chang

CITY=Eumseong

EXPL=Combined City

You can then focus on matching the City names.

PS:

> I couldn't figure how to input strings separated by space(s)

Use the &. See https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/253-29.pdf

High-Performance SAS Coding - Third Edition

Cruise · Posted 05-05-2020 07:49 PM

@ChrisNZ

Thank you so much. I was just thinking of no one would respond to this post.

The problem is that for this particular row '

Eumseong

is the city name in the third column location. But, for another row the city name appear in the second column. Which means that I don't know which column matched to the city in the reference data, i.e., whether which one of Chang, Yeumseong or Combined City were the name for a city. And I have 20 survey tables which makes manual review almost impossible.

ChrisNZ · Posted 05-05-2020 07:54 PM

My point was to remove noise words such as Combined into a third variable.

High-Performance SAS Coding - Third Edition

Cruise · Posted 05-05-2020 07:56 PM

Noise words also happen at the beginning of the spring. Such as Integrated, Combined, Sub total et.c.

ChrisNZ · Posted 05-05-2020 08:02 PM

>Noise words also happen at the beginning of the spring.

Their location does not matter. They are known words, so easy to spot.

*Where they appear* may be used as separator however, maybe between province and city. So when you remove them is also a good time to look at that.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-05-2020 07:59 PM

This kind of work is rather iterative:

You match data, then look at what hasn't matched, then discover new words or new defects or new patterns, match again, etc.

High-Performance SAS Coding - Third Edition

Cruise · Posted 05-05-2020 08:05 PM

This is exactly what I'm doing right now. I wondered if there was more automated way of doing this. I'm working on the mismatched data one on one basis for each mis-matched row. This work involves Google Translate as well.

ChrisNZ · Posted 05-05-2020 08:51 PM

Hopefully you needn't work on a row-by-row basis. I'd hope many rows have similar patterns, and once you code for a pattern all the concerned rows are processed adequately.

High-Performance SAS Coding - Third Edition

Cruise · Posted 05-05-2020 09:51 PM

Thanks for reassuring that all my efforts are at least not an unnecessary iterative work. This is very important to know. Thanks Chris.

How to isolate and select the specific string matching on the string int he second reference dataset

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Re: How to isolate and select the specific string matching on the string int he second reference dat

Classroom Training Available!