Hi Folks:
I have about 20 survey tables that needed to be aligned to a REFERENCE data. ID1NAME and IDNAME are keys to merge them. They are province(ID1NAME) and city(IDNAME) names. City names are not unique but combination of ID1NAME and IDNAME are. ID1NAME is cleanable since at least there is a logic. However, IDNAME in survey tables are messy and no logic, to me, at least. Province name is included in the city name at front OR at the end of the variable varying across rows. Explanatory but non-standardized texts appear like: integrated, combined or altogether just to name few. There can be 1-4 open spaces as a delimiter varying across rows. I see the way to standardize IDNAME in survey table probably is to create separate each string separated by space into different variables and use array. Then find if IDNAME of the REFERENCE data in the array of these newly created variables. For example, the IDNAME of third row would lead to three variables such as: var1 for Integrated, var2 for Cheongju and var3 for City. Which is the array of var1-var3.
If IDNAME of REFERENCE data is found in this array of var1-var3 then output as a clean desired IDNAME as a new variable in survey datasets.
Can you please help execute this idea in SAS if this makes sense? Below is mock datasets. But I couldn't figure how to input strings separated by space(s) as one variable. I apologize!
DATA SURVEY;
INPUT ID1NAME $25 IDNAME $25;
CARDS;
Chung sub Total
Chung Chungju Chung
Chung Jecheon
Chung Integrated Cheongju City
Chung Boeun-gun
Chung Okcheon-gun
Chang Yeongdong-gun
Chang Chang Jincheon-gun
Chang Chang Goesan-gun
Chang Combined Eumseong City
Chang Danyang-gun
Chang Jeungpyeong
;
DATA REFERENCE;
INPUT ID1NAME $25 IDNAME $25;
CARDS;
Chung Chungju
Chung Jecheon
Chung Cheongju
Chung Boeun
Chung Okcheon
Chang Yeongdong
Chang Jincheon
Chang Goesan
Chang Eumseong
Chang Danyang
Chang Jeungpyeong
;
Hopefully you needn't work on a row-by-row basis. I'd hope many rows have similar patterns, and once you code for a pattern all the concerned rows are processed adequately.
PRXMATCH is one way to do this.
proc sql ; create table match as select r.id1name, r.idname, s.idname as messyname from reference as r left join survey as s on r.id1name = s.id1name where prxmatch("/"||strip(r.idname)||"/",s.idname)>0; quit ;
I would separate the values so that you have:
Chang Combined Eumseong City
going to:
PROVINCE=Chang
CITY=Eumseong
EXPL=Combined City
You can then focus on matching the City names.
PS:
> I couldn't figure how to input strings separated by space(s)
Use the &. See https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/253-29.pdf
Thank you so much. I was just thinking of no one would respond to this post.
The problem is that for this particular row '
Eumseong
is the city name in the third column location. But, for another row the city name appear in the second column. Which means that I don't know which column matched to the city in the reference data, i.e., whether which one of Chang, Yeumseong or Combined City were the name for a city. And I have 20 survey tables which makes manual review almost impossible.
My point was to remove noise words such as Combined into a third variable.
>Noise words also happen at the beginning of the spring.
Their location does not matter. They are known words, so easy to spot.
*Where they appear* may be used as separator however, maybe between province and city. So when you remove them is also a good time to look at that.
This kind of work is rather iterative:
You match data, then look at what hasn't matched, then discover new words or new defects or new patterns, match again, etc.
Hopefully you needn't work on a row-by-row basis. I'd hope many rows have similar patterns, and once you code for a pattern all the concerned rows are processed adequately.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.