Hello,
I am relatively new to SAS and I am working with microarray SNP data, fairly large. I need to reshape my date from long to wide. I tried the code below on a smaller data set and works well. An example data set is shown below:
Data MAlong
ID | SNPs | Genotype |
23456 | rs1234 | CC |
23456 | rs1235 | CC |
23456 | rs1236 | TT |
23456 | rs1237 | AA |
23456 | rs1238 | TT |
23456 | rs1239 | GG |
23456 | rs1240 | GG |
23456 | rs1241 | TT |
23456 | rs1242 | CC |
23456 | rs1243 | AA |
17235 | rs1234 | TT |
17235 | rs1235 | GG |
17235 | rs1236 | TT |
17235 | rs1237 | CC |
17235 | rs1238 | AA |
17235 | rs1239 | AA |
17235 | rs1240 | AG |
17235 | rs1241 | GG |
17235 | rs1242 | GG |
17235 | rs1243 | TC |
25342 | rs1234 | AA |
25342 | rs1235 | AG |
SAS code:
PROC TRANSPOSE data = MAlong out= MAWide ;
by ID not sorted;
var Genotype ;
ID SNPs;
run;
Resulting data set
Data MAwide
ID | rs1234 | rs1235 | rs1236 | rs1237 | rs1238 | rs1239 | rs1240 | rs1241 | rs1242 | rs1243 |
23456 | CC | CC | TT | AA | TT | GG | GG | TT | CC | AA |
17235 | TT | GG | TT | CC | AA | AA | AG | GG | GG | TC |
25342 | AA | AG | AA | AG | TT | CC | -- | GG | GG | GG |
For the larger data set, I am getting duplicate error by the ID SNPS that I am using. I know the SNP names are not duplicates, they are slightly different at the tale end of name but it seems SAS is assuming they are duplicates early. How can I make my script ignore duplicates? Using nodupkeys does not work as the SNP names are re-used for all SNPs. This is data from Illumina infinium array and their report lists genotypes per sample in long format, one at a time, and then start again.
Thanks,
Joy
@joa2 wrote:
... stuff deleted ...
For the larger data set, I am getting duplicate error by the ID SNPS that I am using. I know the SNP names are not duplicates, they are slightly different at the tale end of name but it seems SAS is assuming they are duplicates early. How can I make my script ignore duplicates? Using nodupkeys does not work as the SNP names are re-used for all SNPs. This is data from Illumina infinium array and their report lists genotypes per sample in long format, one at a time, and then start again.
Thanks,
Joy
You "know SNP names are not duplicates", but apparently SAS doesn't agree (you haven't shown your log, so I am assuming that is what you are describing). So find the duplicates that sas sees.
proc sort data=MAlong out=non_unique nouniquekey;
by id snps;
run;
Every obs in non_unique is a duplicate ID/SNPS dyad. You can examine them and decide what to do with them to prepare for proc transpose.
This is probably because the length of the SNPs variable is 8. Try increasing it. For example, these statements without the LENGTH statement produces an error, but is ok with the LENGTH statement.
data a;
length SNPs $ 200;
input ID SNPs $ Genotype $;
datalines;
23456 rs1234aaaa CC
23456 rs1234aaab CC
23456 rs1236 TT
23456 rs1237 AA
23456 rs1238 TT
23456 rs1239 GG
23456 rs1240 GG
23456 rs1241 TT
23456 rs1242 CC
23456 rs1243 AA
17235 rs1234 TT
17235 rs1235 GG
17235 rs1236 TT
17235 rs1237 CC
17235 rs1238 AA
17235 rs1239 AA
17235 rs1240 AG
17235 rs1241 GG
17235 rs1242 GG
17235 rs1243 TC
25342 rs1234 AA
25342 rs1235 AG
;
PROC TRANSPOSE data = a out= b ;
by ID notsorted;
var Genotype ;
ID SNPs;
run;
Then you should recode the SNPs character variable to an integer to generate a 1 to 1 mapping. Use that code as the ID variable, and make the original SNPs as the IDLABEL:
data MAlong ;
input ID SNPs :$200. Genotype :$2. ;
datalines;
23456 rs1234 CC
23456 rs1235 CC
23456 rs1236 TT
23456 rs1237 AA
23456 rs1238 TT
23456 rs1239 GG
23456 rs1240 GG
23456 rs1241 TT
23456 rs1242 CC
23456 rs1243 AA
17235 rs1234 TT
17235 rs1235 GG
17235 rs1236 TT
17235 rs1237 CC
17235 rs1238 AA
17235 rs1239 AA
17235 rs1240 AG
17235 rs1241 GG
17235 rs1242 GG
17235 rs1243 TC
25342 rs1234 AA
25342 rs1235 AG
25342 rs3456789_123456789_123456789_12A XA 33 characters for SNPs
25342 rs3456789_123456789_123456789_12B XB 33 characters for SNPs
run;
proc sort data=malong (keep=snps) out=snp_list nodupkey;
by snps;
run;
data pre_transpose (drop=_:) / view=pre_transpose;
if _n_=1 then do;
if 0 then set snp_list;
length snp_code $10;
declare hash h ();
h.definekey('snps');
h.definedata('snp_code');
h.definedone();
do until (end_of_snps);
set snp_list end=end_of_snps;
_s+1;
snp_code=catx('_','SCode',put(_s,z4.));
h.add();
end;
end;
set malong;
h.find();
run;
proc transpose data=pre_transpose out=MAWide (drop=_name_);
by ID notsorted;
var Genotype ;
id snp_code;
IDLABEL SNPs;
run;
The reason for the sort is so that the order of the columns represent the snps values in alphabetical order.
If SNP values are longer than 32 characters, and identical in the first 32, then that is where your duplicates happen. Keep in mind that SAS variable names are limited to 32 characters.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.