BookmarkSubscribeRSS Feed
joa2
Calcite | Level 5

Hello,

I am relatively  new to SAS and I am working with microarray SNP data, fairly large. I need to reshape my date from long to wide. I tried the code below on a smaller data set and works well. An example data set is shown below:

Data MAlong

IDSNPsGenotype
23456rs1234CC
23456rs1235CC
23456rs1236TT
23456rs1237AA
23456rs1238TT
23456rs1239GG
23456rs1240GG
23456rs1241TT
23456rs1242CC
23456rs1243AA
17235rs1234TT
17235rs1235GG
17235rs1236TT
17235rs1237CC
17235rs1238AA
17235rs1239AA
17235rs1240AG
17235rs1241GG
17235rs1242GG
17235rs1243TC
25342rs1234AA
25342rs1235AG

 

SAS code:

PROC TRANSPOSE data = MAlong out= MAWide ;
by ID not sorted;
var Genotype ;
ID SNPs;
run;

 

 

Resulting data set

Data MAwide

IDrs1234rs1235rs1236rs1237rs1238rs1239rs1240rs1241rs1242rs1243
23456CCCCTTAATTGGGGTTCCAA
17235TTGGTTCCAAAAAGGGGGTC
25342AAAGAAAGTTCC--GGGGGG

 

For the larger data set, I am getting duplicate error by the ID SNPS that I am using. I know the SNP names are not duplicates, they are slightly different at the tale end of name but it seems SAS is assuming they are duplicates early. How can I make my script ignore duplicates? Using nodupkeys does not work as the SNP names are re-used for all SNPs. This is data from Illumina infinium array and their report lists genotypes per sample in long format, one at a time, and then start again.

 

Thanks,

Joy

6 REPLIES 6
mkeintz
PROC Star

@joa2 wrote:

 

... stuff deleted ...

 

For the larger data set, I am getting duplicate error by the ID SNPS that I am using. I know the SNP names are not duplicates, they are slightly different at the tale end of name but it seems SAS is assuming they are duplicates early. How can I make my script ignore duplicates? Using nodupkeys does not work as the SNP names are re-used for all SNPs. This is data from Illumina infinium array and their report lists genotypes per sample in long format, one at a time, and then start again.

 

Thanks,

Joy


You "know SNP names are not duplicates", but apparently SAS doesn't agree (you haven't shown your log, so I am assuming that is what you are describing).  So find the duplicates that sas sees.

 

proc sort data=MAlong  out=non_unique nouniquekey;
  by id snps;
run;

Every obs in non_unique is a duplicate ID/SNPS dyad.  You can examine them and decide what to do with them to prepare for proc transpose.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
joa2
Calcite | Level 5
The data set is 650K by 1265 SNPs so it is hard to look at all duplicates.
The log file only me tion a few before stopping. What will be the best way
to solve this issue?
StatDave
SAS Super FREQ

This is probably because the length of the SNPs variable is 8. Try increasing it. For example, these statements without the LENGTH statement produces an error, but is ok with the LENGTH statement.

data a; 
length SNPs $ 200;
input ID	SNPs $	Genotype $;
datalines;
23456	rs1234aaaa	CC
23456	rs1234aaab	CC
23456	rs1236	TT
23456	rs1237	AA
23456	rs1238	TT
23456	rs1239	GG
23456	rs1240	GG
23456	rs1241	TT
23456	rs1242	CC
23456	rs1243	AA
17235	rs1234	TT
17235	rs1235	GG
17235	rs1236	TT
17235	rs1237	CC
17235	rs1238	AA
17235	rs1239	AA
17235	rs1240	AG
17235	rs1241	GG
17235	rs1242	GG
17235	rs1243	TC
25342	rs1234	AA
25342	rs1235	AG
;
PROC TRANSPOSE data = a out= b ;
by ID notsorted;
var Genotype ;
ID SNPs;
run;

 
joa2
Calcite | Level 5
The SNP variable length is 200, is there a way to add length to the proc
transpose statement? The actual data set is relatively large, 650K SNPs for
1265 samples.
mkeintz
PROC Star

Then you should recode the SNPs character variable to an integer to generate a 1 to 1 mapping.  Use that code as the ID variable, and make the original SNPs as the IDLABEL:

data MAlong ;
  input ID	SNPs :$200.	Genotype :$2. ;
datalines;
23456	rs1234	CC
23456	rs1235	CC
23456	rs1236	TT
23456	rs1237	AA
23456	rs1238	TT
23456	rs1239	GG
23456	rs1240	GG
23456	rs1241	TT
23456	rs1242	CC
23456	rs1243	AA
17235	rs1234	TT
17235	rs1235	GG
17235	rs1236	TT
17235	rs1237	CC
17235	rs1238	AA
17235	rs1239	AA
17235	rs1240	AG
17235	rs1241	GG
17235	rs1242	GG
17235	rs1243	TC
25342	rs1234	AA
25342	rs1235	AG
25342   rs3456789_123456789_123456789_12A   XA     33 characters for SNPs
25342   rs3456789_123456789_123456789_12B   XB     33 characters for SNPs
run;

proc sort data=malong (keep=snps) out=snp_list nodupkey;
  by snps;
run;

data pre_transpose (drop=_:) / view=pre_transpose;
  if _n_=1 then do;
    if 0 then set snp_list;
    length snp_code $10;
    declare hash h ();
      h.definekey('snps');
      h.definedata('snp_code');
      h.definedone();
    do until (end_of_snps);
      set snp_list end=end_of_snps;
      _s+1;
      snp_code=catx('_','SCode',put(_s,z4.));
      h.add();
    end;
  end;
  set malong;
  h.find();
run;

proc transpose data=pre_transpose out=MAWide (drop=_name_);
  by ID notsorted;
  var Genotype ;
  id snp_code;
  IDLABEL SNPs;
run;

The reason for the sort is so that the order of the columns represent the snps values in alphabetical order.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

If SNP values are longer than 32 characters, and identical in the first 32, then that is where your duplicates happen. Keep in mind that SAS variable names are limited to 32 characters.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 817 views
  • 2 likes
  • 4 in conversation