BookmarkSubscribeRSS Feed
amail94
Fluorite | Level 6

Hey all,

 

I currently attempting to create a new database that is just an improved version of one already in existence. One of the big problems (and massive eyesore) of the old dataset is that a particular variable (genotype) is a free text categorical variable so if you misspelled a diagnosis or accidentally added an extra space it counts that as a new level of that variable.

 

To paint the picture of how annoying this is and why we need to change it for the new database: This variable has approximately 35 levels that we are interested in but the current database has about 117 unique values for that variable.

 

I want to consolidate the old diagnoses into the new format (which is a categorical variable that takes a value between 1-13, where a value of 12 allows for a drop-down box to free text if in Redcap). Is there a way to quicken the process of changing the levels so that they don't repeat with weird (and useless) additions or am I about to just write a massive "if-then" statement?

4 REPLIES 4
Reeza
Super User
Does this need to repeatable or is this a one time process? If it's one time, then use Open Refine which is great to help you clean up categories quickly. But otherwise you're stuck writing IF/THEN statements. You could try fuzzy matching but you don't have much data so cleaning it manually is the fastest approach but won't scale though.
amail94
Fluorite | Level 6

Ok, I was worried that was probably the better, albeit not quickest, solution. It will be a one-time process (in theory) since once the new database goes live your only choice will be to select a value between 1-13 corresponding to that diagnosis.

 

When you say fuzzy matching, are you talking about the COMPGED function?

 

Thanks for your help

Reeza
Super User
There's more than one way to do fuzzy matching, but compged is a good start.
Patrick
Opal | Level 21

If this is a one time process and only about mapping 117 values to 35 then doing this "manually" in code appears as the quickes solution to me. 

Instead of IF/THEN you could create and use a SAS format.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 788 views
  • 2 likes
  • 3 in conversation