I have three variables that contain genotyping data for patients. Each data line is a patient.
Spoligotype is a character variable which has 15 digits. Each sequence corresponds to a unique genotype and several patients can share the same genotype/sequence (meaning patients in one cluster will have the same genotype). This is how the proc freq output looks like. The dataset has ~700 observation.
Variable 2 is called MIRU. This contains a string of characters, alphabets and digits. Each unique string corresponds to a unique genotype which can be shared by a number of patients (meaning patients in one cluster will have the same genotype). The proc freq looks like this:
I want to group patients by the i) same MIRU ii) same spoligotype and iii) same MIRU and same spoligotype patterns. Since there are a large number of combinations and patterns, I am unable to use 'if/then' statements. How can I accomplish it?
If say patient 1, 11 & 200 have the same spoligotype value, I want to assign the value A to all three patients in a new variable. Each value in this new variable should represent a set of patients with same sequence of digits for spoligotype.
This is what my data looks like. Each line is a patient.
@Maheen123 wrote:
This is what my data looks like. Each line is a patient.
Show us the desired output for a portion of your data, please.
You didn't specify what rule(s) might be involved in assigning a value such as your example of A to patients 1, 11 and 200. Which is the wrong way to think about things. It is based on the value of the existing variable not the patient so if you think and refer the VALUE of spoligotype the answer gets a bit simpler. Since you have two variables, names starting with S and M I would think of assigning codes like S_1, S_2, S_3 and M_1, M_2 and M_3. The question would be how to pick the value that gets the code S_1 and then how to use it.
My choice would be to create custom formats based on the frequency of the values with S_1 being assigned to the most common S value, S_2 the next.
Here is an example that counts separate values of two variables from a data set, orders them by descending frequency, creates codes and then a format to use with those values. Note that since this is for an example the "codes" end up longer than the starting values. You should have the data set SASHELP.Class available so can run this code. One thing to remember with this approach is below the original values are not changed just using a format to display the value in Proc print. For many purposes you would not need to actually create a new variable as long as you have the formats, or the sets used to create the formats to rebuild them as needed.
proc freq data=sashelp.class noprint order=freq; tables sex /out=sexcount; tables age /out=agecount; run; data sexcntlin; set sexcount(keep=sex); length label $ 5; fmtname='Sexcode'; type='C'; retain label ; label = cats('S_',_n_); rename sex=start; run; data agecntlin; set agecount (keep=age); length label $ 5; fmtname='Agecode'; type='N'; retain label ; label = cats('A_',put(_n_,z2.)); rename age=start; run; proc format cntlin=sexcntlin; run; proc format cntlin=agecntlin; run; proc print data=sashelp.class noobs; format sex $sexcode. age agecode.; run;
What the code does:
Proc freq creates two data sets that count the values in descending frequency of the value. Proc SQL could do this as well but would require two Proc SQL calls to create separate sets. If wanted the codes in code value order remove the Order= option and the codes will be assigned in alphabetical order of the variable value.
The next two data steps use those sets to add in specially named variables that Proc Format uses: Fmtname, the name of a format; type whether the values are character, C, or numeric, N; start the value to base the code; label the value to display as the coded value.
Then there are two calls to the Proc Format using those created data sets to actually create the formats. Last is an example of Proc Print displaying the original values with the format.
This is appropriate even if you want to create new variables as can use Agecode = put(age, agecode.); to create those variables.
If you want to have a code for the combination of sex and age: Sex_Age_code = catx('_',put(sex,$sexcode.),put(age,agecode.)) would put another _ between the codes.
This avoids use of even attempting to use If/Then/Else or for multiple values the Select/when coding option.
Custom formats like this can be used in reports, analysis and usually for graphing (exceptions for custom date, time or datetime formats) removing much of the need for extra variables.
Some caveats: Use of formats is very flexible but you need to make sure the format(s) are available. One way to to have the created format catalog in a permanent library then you can point to that with the SAS FMTSEARCH option.
Or rerun the Proc Format code that generates the format in the SAS session you need to use them. If you have cntlin data sets as above then you would want to have them in a permanent library so you can rerun the code using them.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.