I have a file in which rows represent individual IDs and columns are genotype data for a given genetic variant. Below is a sample of the structure, but the overall data is ~300K IDs x ~400 variants.
I need to recode the "0, 1, 2" numerical categories for any given variant (column) into corresponding genotypes.
For example, I need to convert the 0, 1, 2 data of the column variable "rs12044597_G" into AA, GA, GG, where 0 = AA, 1 = GA and 2 = GG.
ID | rs12044597_G | rs12711521_A |
1 | 1 | 2 |
2 | 0 | 2 |
3 | 2 | 2 |
I have tried the following:
data chr1_recode;
set chr1;
rs12044597 = .;
if (rs12044597_G =0) then rs12044597 = "AA";
if (rs12044597_G =1) then rs12044597 = "GA";
if (rs12044597_G =2) then rs12044597 = "GG";
RUN;
This does not work and results in large #s of errors, that look like this:
ID | rs12044597_G | rs12711521_A |
1 | GA | AA |
2 | AA | AA |
3 | GG | AA |
4 | GG | AA |
rs12044597 = .; --> Remove this in your code. Your giving a value period which is missing numeric values and then later your giving a character value, that column is already defined as numeric and won't take character values.
You can also use format:
proc format ;
value cat 0='AA'
1='GA'
2='GG'
;
run;
data chr1_recode;
format rs12044597_G rs12711521_A cat.;
set chr1;
RUN;
Do you have a table/dataset listing the correspondance between numerical categories and genotypes for each genetic variant? If so, that table could easily be converted to formats (one for each variant).
The other way to proceed is to switch to a long data format (Variables: ID variant genotype) and join that table with the correspondance table.
PGStats, I just generated a table. Sample below:
snp | 0 | 1 | 2 |
rs12044597_G | AA | GA | GG |
rs12711521_A | CC | AC | AA |
rs3820071_A | GG | AG | AA |
rs3766160_A | GG | AG | AA |
rs10799790_T | CC | TC | TT |
rs3765407_T | GG | TG | TT |
rs477830_T | CC | TC | TT |
What would code look like to get me where I need to go? Keep in mind this is for ~500 variants across almost 400,000 ids
thanks
I thought that using a hash-object could be an easy way to solve the problem, totally forgetting that hash-objects are something beginners should not torture theirs minds with.
The following code uses the example-datasets, kindly supplied by @PGStats:
proc sql noprint;
select cats('str_', Name, '=', Name)
into :renameList separated by ' '
from sashelp.vcolumn
where LibName = 'WORK' and MemName = 'CHR1' and Name like 'rs%'
;
select cats('str_', Name)
into :nameList separated by ' '
from sashelp.vcolumn
where LibName = 'WORK' and MemName = 'CHR1' and Name like 'rs%'
;
quit;
data want;
if 0 then set snp;
if _n_ = 1 then do;
declare hash h(dataset: 'work.snp');
h.defineKey('snp', 'variant');
h.defineData('code');
h.defineDone();
end;
set chr1;
length &nameList. $ 2;
array old rs:;
array new str_:;
do i = 1 to dim(old);
Snp = vname(old[i]);
variant = old[i];
if h.find() = 0 then do;
new[i] = Code;
end;
else do;
new[i] = .;
put 'WARNING: No Code found';
put _all_;
end;
end;
drop rs: i snp variant code;
rename &renameList.;
run;
proc print;run;
The join method would go like this:
data chr1;
input ID rs12044597_G rs12711521_A ;
datalines;
1 1 2
2 0 2
3 2 2
;
data snp;
input snp :$12. @;
do variant = 0 to 2;
input code :$2. @;
output;
end;
datalines;
rs12044597_G AA GA GG
rs12711521_A CC AC AA
rs3820071_A GG AG AA
rs3766160_A GG AG AA
rs10799790_T CC TC TT
rs3765407_T GG TG TT
rs477830_T CC TC TT
;
proc transpose data=chr1 out=chrLong prefix=chr name=genotype;
by id;
var rs: ;
run;
proc sql;
create table chrLongCodes as
select
ID,
genotype,
code
from chrLong as a inner join snp on snp=genotype and chr1=variant
order by id, genotype;
quit;
proc transpose data=chrLongCodes out=chrCodes(drop=_name_);
by id;
id genotype;
var code;
run;
proc print data=chrCodes noobs; run;
Thanks, I'll give this a shot
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.