BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Hi Folks:

 

I have about 20 survey tables that needed to be aligned to a REFERENCE data. ID1NAME and IDNAME are keys to merge them. They are province(ID1NAME) and city(IDNAME) names. City names are not unique but combination of ID1NAME and IDNAME are. ID1NAME is cleanable since at least there is a logic. However, IDNAME in survey tables are messy and no logic, to me, at least. Province name is included in the city name at front OR at the end of the variable varying across rows. Explanatory but non-standardized texts appear like: integrated, combined or altogether just to name few. There can be 1-4 open spaces as a delimiter varying across rows. I see the way to standardize IDNAME in survey table probably is to create separate each string separated by space into different variables and use array. Then find if IDNAME of the REFERENCE data in the array of these newly created variables. For example, the IDNAME of third row would lead to three variables such as: var1 for Integrated, var2 for Cheongju and var3 for City. Which is the array of var1-var3. 

If IDNAME of REFERENCE data is found in this array of var1-var3 then output as a clean desired IDNAME as a new variable in survey datasets. 

 

Can you please help execute this idea in SAS if this makes sense? Below is mock datasets. But I couldn't figure how to input strings separated by space(s) as one variable. I apologize!  

 

DATA SURVEY;
INPUT ID1NAME $25 IDNAME $25; 
CARDS;
Chung sub Total 
Chung Chungju Chung
Chung Jecheon
Chung Integrated Cheongju City
Chung Boeun-gun
Chung Okcheon-gun
Chang Yeongdong-gun
Chang Chang Jincheon-gun
Chang Chang Goesan-gun 
Chang Combined Eumseong City 
Chang Danyang-gun 
Chang Jeungpyeong
;
DATA REFERENCE; 
INPUT ID1NAME $25 IDNAME $25; 
CARDS;
Chung Chungju
Chung Jecheon
Chung Cheongju
Chung Boeun
Chung Okcheon
Chang Yeongdong
Chang Jincheon
Chang Goesan
Chang Eumseong
Chang Danyang
Chang Jeungpyeong
; 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Hopefully you needn't work on a row-by-row basis. I'd hope many rows have similar patterns, and once you code for a pattern all the concerned rows are processed adequately. 

View solution in original post

10 REPLIES 10
biopharma
Quartz | Level 8

PRXMATCH is one way to do this.

proc sql ;
    create table match as
       select r.id1name, r.idname, s.idname as messyname 
          from reference as r
          left join survey as s
          on r.id1name = s.id1name 
          where prxmatch("/"||strip(r.idname)||"/",s.idname)>0;
quit ;
ChrisNZ
Tourmaline | Level 20

I would separate the values so that you have:

Chang Combined Eumseong City

going to:

PROVINCE=Chang

CITY=Eumseong

EXPL=Combined City

 

You can then focus on matching the City names.

 

PS: 

>  I couldn't figure how to input strings separated by space(s)

Use the &. See https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/253-29.pdf

Cruise
Ammonite | Level 13

@ChrisNZ 

Thank you so much. I was just thinking of no one would respond to this post. 

 

The problem is that for this particular row '

Eumseong

is the city name in the third column location. But, for another row the city name appear in the second column. Which means that I don't know which column matched to the city in the reference data, i.e., whether which one of Chang, Yeumseong or Combined City were the name for a city. And I have 20 survey tables which makes manual review almost impossible.  

ChrisNZ
Tourmaline | Level 20

My point was to remove noise words such as Combined into a third variable.

Cruise
Ammonite | Level 13
Noise words also happen at the beginning of the spring. Such as Integrated, Combined, Sub total et.c.
ChrisNZ
Tourmaline | Level 20

>Noise words also happen at the beginning of the spring. 

Their location does not matter. They are known words, so easy to spot.

*Where they appear* may be used as separator however, maybe between province and city. So when you remove them is also a good time to look at that.

ChrisNZ
Tourmaline | Level 20

This kind of work is rather iterative:

You match data, then look at what hasn't matched, then discover new words or new defects or new patterns, match again, etc.

Cruise
Ammonite | Level 13
This is exactly what I'm doing right now. I wondered if there was more automated way of doing this. I'm working on the mismatched data one on one basis for each mis-matched row. This work involves Google Translate as well.
ChrisNZ
Tourmaline | Level 20

Hopefully you needn't work on a row-by-row basis. I'd hope many rows have similar patterns, and once you code for a pattern all the concerned rows are processed adequately. 

Cruise
Ammonite | Level 13
Thanks for reassuring that all my efforts are at least not an unnecessary iterative work. This is very important to know. Thanks Chris.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 784 views
  • 3 likes
  • 3 in conversation