BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ann297
Calcite | Level 5

Hello,everone.

I am new to SAS and learning GWAS, I want to do followed things in SAS program:


Using “X_Y” present the genotype with the these form and convert them into numeric, like “A_G"= 0,

"X_X" present “A_A"=1, and "Y_Y" as "G_G"=2,
Moreover, I have thousands of these data. The same, set “G_C”=0 (like form of X_Y), “G_G”=1(X_X),“C_C”=2(Y_Y) colunm by colunm.
My SAS program followed. My confusion is how to define the form (X_Y), (X_X), (Y_Y), them i can select all the observations with the same format in SAS.

 

I do a loop to convert the genotype into 0, 1, 2, but only including the genotype 'AA', 'AG', 'AC' 

Ann297_1-1677725259324.png

 

This one i want to use XY to define a form to selecting genotype with the same format.

Ann297_0-1677725238693.png

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

In the DATA step, use the IN operator, and the mutually exclusive ELSE IF:

if svars{i} in ("A_A","T_T","C_C") then gvars{i} = 0;
else if substr(svars{i},1,2) in ("A_","T_","C_") then gvars{i} = 1;
else gvars{i} = 2;

View solution in original post

8 REPLIES 8
ballardw
Super User

If you have a lot of variables with the same values that need to be mapped then perhaps a custom informat or format is better as it is an assignment statement instead of a bunch of if/then/else or Case/when blocks.

 

An example of some of what I think you are wanting:

proc format;
invalue xvalues (upcase)
'X_Y','A_A'=0
'X_X','A_G'=1
'Y_Y','A_C'=2
other= . 
;
run;

data example;
   input s $ t $ u $ v $;
datalines;
X_x z_x X_ X_Y
X_X Y_Y X_Y Y_Y
A_A X_Y A_c A_g
;
/* how to use to create a numeric variable*/
data use;
   set example;
   array in(*) s t u v;
   array out(4);
   do i=1 to dim(in);
      out[i]=input(in[i],xvalues.);
   end;
run;

Some details: In proc format the INVALUE statement say you can read the value of the left side of the = and get the result on the right. Since the name of the informat (the word after invalue) does not start with a $ it means the result is intended to be numeric. The optional (UPCASE) means convert the string value to upper case before comparing to the values in the lists. This is very helpful if your data entry has issues with getting values like X_X, X_x, x_x and x_X as it handles then all the same.

The data step creates some values and shows code parallel to yours to assign the values to new variables but uses INPUT instead of if/then/else. Note that the Invalue has 2 values that both get assigned to the same result, which I think may be part of what you are looking for.

 

Now for two other bits.  If I knew I needed this sort of transformation it could be done when the data is read into SAS if using a data step which saves the whole step (unless you have something else that needs all those other codes). This step behaves like reading an external file and reads all those strings into the numeric value.

data example2;
   informat s t u v xvalues.;
   input s  t  u  v ;
datalines;
X_x z_x X_ X_Y
X_X Y_Y X_Y Y_Y
A_A X_Y A_c A_g
;

And another feature, and it is a real feature, you can actually build some data checking into that informat, such as reporting an error for any value other than the explicitly stated value that map to 0, 1 and 2. Which is quit helpful in some situations.

 

proc format;
invalue yvalues (upcase)
'X_Y','A_A'=0
'X_X','A_G'=1
'Y_Y','A_C'=2
other= _error_
;
run;
data example3;
   informat s t u v yvalues.;
   input s  t  u  v ;
datalines;
X_x z_x X_ X_Y
X_X Y_Y X_Y Y_Y
A_A X_Y A_c A_g
;

 

Ann297
Calcite | Level 5

Thank you so much, this example helps me have more understanding.

Sorry i am not a English speaker. And what i am trying to say that X_Y, X_X, Y_Y are not a specific value but a specific format. Like below:

Ann297_0-1677900333528.png

In my case, this can only convert one kind of SNP where genotype has 'A_A'. 'A_G','C_C'.

Ann297_1-1677900543551.png

 

 

 

Tom
Super User Tom
Super User

what i am trying to say that X_Y, X_X, Y_Y are not a specific value but a specific format

I suspect you are using the word FORMAT as a synonym for STYLE or PATTERN instead.

 

The word FORMAT has a very specific meaning in SAS and I don't think your use of format in this sentence is the SAS meaning.  In SAS a FORMAT is used to translate values into text, normally for PRINTING or DISPLAYING in a report.  So the DATE format is used to translate a number that represents the number of days since 1960 in a string in the style of DDMONYYYY where DD is the day of the month, MON is the 3 letter abbreviation of the month name and YYYY is year.

 

SAS also have INFORMATs, which are similar to formats, but work the other way.  You use an INFORMAT to convert text into values.  So the DATE informat can convert a string like '01DEC2022' into a number that represents how SAS stores that date.  If you goal is to convert a string like 'A_A' to a number like, 0, then you probably will want to use an INFORMAT.

 

 

A little more SAS terminology.  SAS has datasets.  Datasets have variables and observations.  In your picture of your data you appear to have 7 variables, but for only 3 of them have you shown the variable name you want to use for that variable.  Every variable in a SAS dataset has to have its own name.  Your picture appears to have 4 observations, but perhaps the ... in the value of that first unnamed variable is suppose to indicate that your real data has many more than just 4 observations?

 

What is the pattern that you want to detect?  Do you wan to covert the string 'A_G' to the number 0 no matter which of the three VARIABLES in your example data that have similar three character strings it appears?  Or is somehow A_A supposed to indicate that you want to match strings where the same two letters are separated by an underscore?  But if we are talking an genetic SNPs then there is a very limited set of letters that can appear so perhaps it would easiest to just write out all of the possible combinations and what number you them mapped onto.

 

Also what is the overall analysis you want to perform?  Converting 'A_A' to zero is probably just a step along the way to your real goal.  But perhaps the real path to your goal does not need to include that step at all.

Ann297
Calcite | Level 5

Thank you very much both! Really appraciate helping me get to the point!

I think i make the issue more clear now.

 

With this code i can run a loop but can't use "like" to select specific obeservations. 

Ann297_1-1678433178834.png

 

And with this code i can use "like" but do a loop for each columns (s38563_1--s11633_1).

Ann297_2-1678433245823.png

I want to combine two code with modifying one of them. Thank you in advance.

 

 

Kurt_Bremser
Super User

Both your codes make no sense because of semantically invalid conditions.

Do NOT post pictures of code, post code as text by copy/pasting it into a code box opened with the appropriate buttons. It is then easier for us to point out mistakes and make suggestions.

Kurt_Bremser
Super User

In the DATA step, use the IN operator, and the mutually exclusive ELSE IF:

if svars{i} in ("A_A","T_T","C_C") then gvars{i} = 0;
else if substr(svars{i},1,2) in ("A_","T_","C_") then gvars{i} = 1;
else gvars{i} = 2;
Ann297
Calcite | Level 5
Thank you so much! It worked!
Kurt_Bremser
Super User

Define an informat, so you can use a simple INPUT function for the conversion.

But it will still be a lot of writing in SQL, or need macro programming, as SQL does not have the array concept.

Either stay with the DATA step, or transpose to long first, run the SQL, and transpose back to wide.

Or create the SQL code from a variable list (kept in a dataset) and write it to a temporary file which you then %INCLUDE.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 955 views
  • 0 likes
  • 4 in conversation