BookmarkSubscribeRSS Feed
rwwalker
Calcite | Level 5

I have a file in which rows represent individual IDs and columns are genotype data for a given genetic variant. Below is a sample of the structure, but the overall data is ~300K IDs x ~400 variants.

 

I need to recode the "0, 1, 2" numerical categories for any given variant (column) into corresponding genotypes.

 

For example, I need to convert the 0, 1, 2 data of the column variable "rs12044597_G" into AA, GA, GG, where 0 = AA, 1 = GA and 2 = GG.

 

IDrs12044597_Grs12711521_A
112
202
322

 

I have tried the following:

 

data chr1_recode;
set chr1;

rs12044597 = .;

if (rs12044597_G =0) then rs12044597 = "AA";

if (rs12044597_G =1) then rs12044597 = "GA";

if (rs12044597_G =2) then rs12044597 = "GG";
RUN;

 

This does not work and results in large #s of errors, that look like this:

 

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
76:56 76:101 76:146
NOTE: Invalid numeric data, 'GA' , at line 76 column 101.
 
I realize that I am trying to convert numeric to character data, which is likely causing the issue. I'm familiar with the PUT statement for converting single data points from numeric to character, but how can this be incorporated into an if-then statement to achieve the data outcome I need, which should look like this:
IDrs12044597_Grs12711521_A
1GAAA
2AAAA
3GGAA
4GGAA
 
I'm using SAS University Edition.
Thanks!
7 REPLIES 7
SuryaKiran
Meteorite | Level 14

rs12044597 = .;  --> Remove this in your code. Your giving a value period which is missing numeric values and then later your giving a character value, that column is already defined as numeric and won't take character values.

 

You can also use format:

proc format ;
value cat 0='AA'
		  1='GA'
		  2='GG'
		  ;
run;

data chr1_recode;
format rs12044597_G rs12711521_A cat.;
set chr1;
RUN;

 

Thanks,
Suryakiran
PGStats
Opal | Level 21

Do you have a table/dataset listing the correspondance between numerical categories and genotypes for each genetic variant? If so, that table could easily be converted to formats (one for each variant).

 

The other way to proceed is to switch to a long data format (Variables: ID variant genotype) and join that table with the correspondance table.

PG
rwwalker
Calcite | Level 5

PGStats, I just generated a table. Sample below:

 

snp012
rs12044597_GAAGAGG
rs12711521_ACCACAA
rs3820071_AGGAGAA
rs3766160_AGGAGAA
rs10799790_TCCTCTT
rs3765407_TGGTGTT
rs477830_TCCTCTT

 

What would code look like to get me where I need to go? Keep in mind this is for ~500 variants across almost 400,000 ids

 

thanks

Reeza
Super User
What is the SNP in this context? The variable that this would map to? What do you mean by 500 variants? The number of records doesn't really matter here.
andreas_lds
Jade | Level 19

I thought that using a hash-object could be an easy way to solve the problem, totally forgetting that hash-objects are something beginners should not torture theirs minds with.

 

The following code uses the example-datasets, kindly supplied by @PGStats:

 

proc sql noprint;
   select cats('str_', Name, '=', Name)
      into :renameList separated by ' '
      from sashelp.vcolumn
         where LibName = 'WORK' and MemName = 'CHR1' and Name like 'rs%'
   ;

   select cats('str_', Name)
      into :nameList separated by ' '
      from sashelp.vcolumn
         where LibName = 'WORK' and MemName = 'CHR1' and Name like 'rs%'
   ;
quit;


data want;
   if 0 then set snp;

   if _n_ = 1 then do;
      declare hash h(dataset: 'work.snp');
      h.defineKey('snp', 'variant');
      h.defineData('code');
      h.defineDone();
   end;

   set chr1;
   length &nameList. $ 2;

   array old rs:;
   array new str_:;

   do i = 1 to dim(old);
      Snp = vname(old[i]);
      variant = old[i];

      if h.find() = 0 then do;
         new[i] = Code;
      end;
      else do;
         new[i] = .;
         put 'WARNING: No Code found';
         put _all_;
      end;
   end; 

   drop rs: i snp variant code;
   rename &renameList.;
run;

proc print;run;
PGStats
Opal | Level 21

The join method would go like this:

 

data chr1;
input ID rs12044597_G rs12711521_A ;
datalines;
1 1 2 
2 0 2 
3 2 2 
;

data snp;
input snp :$12. @;
do variant = 0 to 2;
    input code :$2. @;
    output;
    end;
datalines; 
rs12044597_G AA GA GG 
rs12711521_A CC AC AA 
rs3820071_A GG AG AA 
rs3766160_A GG AG AA 
rs10799790_T CC TC TT 
rs3765407_T GG TG TT 
rs477830_T CC TC TT 
;

proc transpose data=chr1 out=chrLong prefix=chr name=genotype;
by id;
var rs: ;
run;

proc sql;
create table chrLongCodes as
select 
    ID,
    genotype,
    code
from chrLong as a inner join snp on snp=genotype and chr1=variant
order by id, genotype;
quit;

proc transpose data=chrLongCodes out=chrCodes(drop=_name_);
by id;
id genotype;
var code;
run;

proc print data=chrCodes noobs; run;
PG
rwwalker
Calcite | Level 5

Thanks, I'll give this a shot

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1074 views
  • 3 likes
  • 5 in conversation