topic Re: Reshaping long data to wide in Statistical Procedures

Reshaping long data to wide

joa2 — Tue, 08 Dec 2020 01:15:38 GMT

Hello,

I am relatively new to SAS and I am working with microarray SNP data, fairly large. I need to reshape my date from long to wide. I tried the code below on a smaller data set and works well. An example data set is shown below:

Data MAlong

ID	SNPs	Genotype
23456	rs1234	CC
23456	rs1235	CC
23456	rs1236	TT
23456	rs1237	AA
23456	rs1238	TT
23456	rs1239	GG
23456	rs1240	GG
23456	rs1241	TT
23456	rs1242	CC
23456	rs1243	AA
17235	rs1234	TT
17235	rs1235	GG
17235	rs1236	TT
17235	rs1237	CC
17235	rs1238	AA
17235	rs1239	AA
17235	rs1240	AG
17235	rs1241	GG
17235	rs1242	GG
17235	rs1243	TC
25342	rs1234	AA
25342	rs1235	AG

SAS code:

PROC TRANSPOSE data = MAlong out= MAWide ;
by ID not sorted;
var Genotype ;
ID SNPs;
run;

Resulting data set

Data MAwide

ID	rs1234	rs1235	rs1236	rs1237	rs1238	rs1239	rs1240	rs1241	rs1242	rs1243
23456	CC	CC	TT	AA	TT	GG	GG	TT	CC	AA
17235	TT	GG	TT	CC	AA	AA	AG	GG	GG	TC
25342	AA	AG	AA	AG	TT	CC	--	GG	GG	GG

For the larger data set, I am getting duplicate error by the ID SNPS that I am using. I know the SNP names are not duplicates, they are slightly different at the tale end of name but it seems SAS is assuming they are duplicates early. How can I make my script ignore duplicates? Using nodupkeys does not work as the SNP names are re-used for all SNPs. This is data from Illumina infinium array and their report lists genotypes per sample in long format, one at a time, and then start again.

Thanks,

Joy

Re: Reshaping long data to wide

mkeintz — Tue, 08 Dec 2020 01:46:48 GMT

@joa2 wrote:

... stuff deleted ...

For the larger data set, I am getting duplicate error by the ID SNPS that I am using. I know the SNP names are not duplicates, they are slightly different at the tale end of name but it seems SAS is assuming they are duplicates early. How can I make my script ignore duplicates? Using nodupkeys does not work as the SNP names are re-used for all SNPs. This is data from Illumina infinium array and their report lists genotypes per sample in long format, one at a time, and then start again.

Thanks,

Joy

You "know SNP names are not duplicates", but apparently SAS doesn't agree (you haven't shown your log, so I am assuming that is what you are describing). So find the duplicates that sas sees.

proc sort data=MAlong  out=non_unique nouniquekey;
  by id snps;
run;

Every obs in non_unique is a duplicate ID/SNPS dyad. You can examine them and decide what to do with them to prepare for proc transpose.

Re: Reshaping long data to wide

StatDave — Tue, 08 Dec 2020 02:02:50 GMT

This is probably because the length of the SNPs variable is 8. Try increasing it. For example, these statements without the LENGTH statement produces an error, but is ok with the LENGTH statement.

data a; 
length SNPs $ 200;
input ID	SNPs $	Genotype $;
datalines;
23456	rs1234aaaa	CC
23456	rs1234aaab	CC
23456	rs1236	TT
23456	rs1237	AA
23456	rs1238	TT
23456	rs1239	GG
23456	rs1240	GG
23456	rs1241	TT
23456	rs1242	CC
23456	rs1243	AA
17235	rs1234	TT
17235	rs1235	GG
17235	rs1236	TT
17235	rs1237	CC
17235	rs1238	AA
17235	rs1239	AA
17235	rs1240	AG
17235	rs1241	GG
17235	rs1242	GG
17235	rs1243	TC
25342	rs1234	AA
25342	rs1235	AG
;
PROC TRANSPOSE data = a out= b ;
by ID notsorted;
var Genotype ;
ID SNPs;
run;

Re: Reshaping long data to wide

joa2 — Tue, 08 Dec 2020 02:21:07 GMT

The data set is 650K by 1265 SNPs so it is hard to look at all duplicates.
The log file only me tion a few before stopping. What will be the best way
to solve this issue?

Re: Reshaping long data to wide

joa2 — Tue, 08 Dec 2020 02:39:07 GMT

The SNP variable length is 200, is there a way to add length to the proc
transpose statement? The actual data set is relatively large, 650K SNPs for
1265 samples.

Re: Reshaping long data to wide

mkeintz — Tue, 08 Dec 2020 04:23:52 GMT

Then you should recode the SNPs character variable to an integer to generate a 1 to 1 mapping. Use that code as the ID variable, and make the original SNPs as the IDLABEL:

data MAlong ;
  input ID	SNPs :$200.	Genotype :$2. ;
datalines;
23456	rs1234	CC
23456	rs1235	CC
23456	rs1236	TT
23456	rs1237	AA
23456	rs1238	TT
23456	rs1239	GG
23456	rs1240	GG
23456	rs1241	TT
23456	rs1242	CC
23456	rs1243	AA
17235	rs1234	TT
17235	rs1235	GG
17235	rs1236	TT
17235	rs1237	CC
17235	rs1238	AA
17235	rs1239	AA
17235	rs1240	AG
17235	rs1241	GG
17235	rs1242	GG
17235	rs1243	TC
25342	rs1234	AA
25342	rs1235	AG
25342   rs3456789_123456789_123456789_12A   XA     33 characters for SNPs
25342   rs3456789_123456789_123456789_12B   XB     33 characters for SNPs
run;

proc sort data=malong (keep=snps) out=snp_list nodupkey;
  by snps;
run;

data pre_transpose (drop=_:) / view=pre_transpose;
  if _n_=1 then do;
    if 0 then set snp_list;
    length snp_code $10;
    declare hash h ();
      h.definekey('snps');
      h.definedata('snp_code');
      h.definedone();
    do until (end_of_snps);
      set snp_list end=end_of_snps;
      _s+1;
      snp_code=catx('_','SCode',put(_s,z4.));
      h.add();
    end;
  end;
  set malong;
  h.find();
run;

proc transpose data=pre_transpose out=MAWide (drop=_name_);
  by ID notsorted;
  var Genotype ;
  id snp_code;
  IDLABEL SNPs;
run;

The reason for the sort is so that the order of the columns represent the snps values in alphabetical order.

Re: Reshaping long data to wide

Kurt_Bremser — Tue, 08 Dec 2020 08:46:47 GMT

If SNP values are longer than 32 characters, and identical in the first 32, then that is where your duplicates happen. Keep in mind that SAS variable names are limited to 32 characters.