topic Re: Data Step: trailing @@ for GWAS genotype data in SAS Programming

Data Step: trailing @@ for GWAS genotype data

Alchemi — Thu, 27 Aug 2020 15:25:19 GMT

I have been sent some genotype data and need help reading it into SAS.

Two files: first is a listing of subjects (text file with .sample extension); second is a single record containing the genotype data (text file with .gen extension).

The .gen file starts (this is made up data): 19 rs123456 12345678 T G 0 0 1 0.873 0.127 0.002... etc to column 38064 and is only one record.

The first five items in the .gen file are the: chromosome=19, SNP ID=rs123456, position=12345678, allele A='T' and allele B='G'

The next items are triplets of numbers (allele frequencies). The first three numbers (0, 0, 1) relate to the first record in the .sample file and can be called P_AA, P_AB, P_BB, the second three numbers ( 0.873, 0.127, 0.002 ) relate to the second record (with same variable name P_AA, P_AB, P_BB). And so on. Every triplet corresponds to a subject in the .sample file, based on the ordering in both data files.

If I edit the .gen file and remove the "19 rs123456 12345678 T G " from the start of the .gen file I can read the data using:

data test;
infile 'myfile.gen' dlm=' ' lrecl=38039;
input P_AA P_AB P_BB @@;
run;

And then 1:1 merge this with .sample file, and get what I want. It's a solution , but there must be a better way.

What I would like to do is read the original .gen file, essentially starting from column 26. I tried :

input @26  P_AA P_AB P_BB @@;

But his does not work. I tried other variations using two input lines, but no luck.

Does anyone have any suggestions? This data structure is common in genotyping and GWAS, so I feel someone (smarter than me) must have solved this before.

Thank you.

Michael.

Re: Data Step: trailing @@ for GWAS genotype data

ballardw — Thu, 27 Aug 2020 17:02:49 GMT

Doesn't work is awful vague.

Are there errors in the log?: Post the code and log in a code box opened with the <> to maintain formatting of error messages.

No output? Post any log in a code box.

Unexpected output? Provide input data in the form of data step code pasted into a code box, the actual results and the expected results. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the <> icon or attached as text to show exactly what you have and that we can test code against.

Since this is data step read syntax, then an actual example of the text to read would be helpful.

Things like the number and consistency of spaces, commas and other characters are very likely to play a role in reading a complex layout. Even things like the structure of your non-problem variables are an issue.

You show examples as triplets that are comma delimited inside () but the code you show would not read those. So that makes actual example data more critical.

It sounds like there should be a fixed number of triplets, so perhaps an array based read will work.

This example reads 2 triplets. If the data looks at all like yours then replace 2 with the number of triplets involved.

data example;
   infile datalines ;
   input id $ @;
   array AA{2};
   array AB{2};
   array BB{2}; 
   do i=1 to 2;
     input AA[i] AB[i] BB[i] @;
   end;
   input;
   do i=1 to 2;
      P_AA=AA[i];
      P_AB=AB[i];
      P_BB=BB[i];
      output;
   end;
   keep id  P_AA P_AB P_BB;

datalines;
abc   0.873 0.127 0.002    0.893 0.927 0.092 
pdq   0.973 0.197 0.902    0.879 0.129 0.009 
xyz   0.883 0.128 0.082    0.878 0.187 0.802 
;

Re: Data Step: trailing @@ for GWAS genotype data

FreelanceReinh — Thu, 27 Aug 2020 16:50:05 GMT

Hello @Alchemi and welcome to the SAS Support Communities!

You can move the pointer to column 26 with an INPUT statement that is executed only in the first iteration of the DATA step and then continue reading with the "@@" modifier until the end of the file (see EOF= option).

data test;
infile 'myfile.gen' eof=finish lrecl=38039;
if _n_=1 then input @26 @;
input P_AA P_AB P_BB @@;
return;
finish: stop;
run;

Re: Data Step: trailing @@ for GWAS genotype data

Alchemi — Thu, 27 Aug 2020 18:41:45 GMT

Thank you. I'd like to summarize.

Here are example datasets:

subjects.sample (header, second record, and then 4 subject records, but my real data had more subjects)

ID_1 ID_2 missing sex status
0 0 0 D B
A123E123 A123E123 0 0 -9
A123F456 A123F456 0 0 -9
A123G789 A123G789 0 0 -9
A456G123 A456G123 0 0 -9

genotype.gen (single record, this is a truncated example but my real data had length 30,000 characters with many more triplets

19 rs123456 12345678 T G 0 0 1 0.873 0.127 0.002 0.252 0.746 0 0 1 0

My solution was to edit the above to get simple triplets

editedgenotype.gen

0 0 1 0.873 0.127 0.002 0.252 0.746 0 0 1 0

data test0;
infile 'Q:\USERS\MEJones\temp\subjects.sample' dlm=' ' firstobs=3;
input ID_1 $ ID_2 $ missing $ sex $ status $;
run;

data test1;
infile 'Q:\USERS\MEJones\temp\editedgenotype.gen' dlm=' ' lrecl=43;
input P_AA P_AB P_BB @@;
run;

data all;
merge test0 test1;
run;

proc print data=all;
run;

My output

The SAS System

Obs ID_1 ID_2 missing sex status P_AA P_AB P_BB1234

A123E123	A123E123	-9	0.000	0.000	1.000
A123F456	A123F456	-9	0.873	0.127	0.002
A123G789	A123G789	-9	0.252	0.746	0.000
A456G123	A456G123	-9	0.000	1.000	0.000

The solution from FreelanceReinhard Amethyst

data test2;
infile 'Q:\USERS\MEJones\temp\genotype.gen' dlm=' ' lrecl=69  eof=finish ;
if _n_=1 then input @26 @;
input P_AA P_AB P_BB @@;
return;
finish: stop;
run;

data new;
merge test0 test2;
run;
proc print data=new;
run;

Reads the original dataset with no need to edit it 😁 and produces

The SAS System

Obs ID_1 ID_2 missing sex status P_AA P_AB P_BB1234

A123E123	A123E123	-9	0.000	0.000	1.000
A123F456	A123F456	-9	0.873	0.127	0.002
A123G789	A123G789	-9	0.252	0.746	0.000
A456G123	A456G123	-9	0.000	1.000	0.000

NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA. NOTE: SAS (r) Proprietary Software 9.4 (TS1M3) Licensed to INSTITUTE OF CANCER RESEARCH, Site 70092393. NOTE: This session is executing on the W32_8PRO platform. NOTE: Updated analytical products: SAS/STAT 14.1 SAS/ETS 14.1 SAS/OR 14.1 SAS/IML 14.1 SAS/QC 14.1 NOTE: Additional host information: W32_8PRO WIN 6.2.9200 Workstation NOTE: SAS initialization used: real time 2.06 seconds cpu time 0.71 seconds 1 data test0; 2 infile 'Q:\USERS\MEJones\temp\subjects.sample' dlm=' ' firstobs=3; 3 input ID_1 $ ID_2 $ missing $ sex $ status $; 4 run; NOTE: The infile 'Q:\USERS\MEJones\temp\subjects.sample' is: Filename=Q:\USERS\MEJones\temp\subjects.sample, RECFM=V,LRECL=32767,File Size (bytes)=143, Last Modified=27 August 2020 19:04:52 o'clock, Create Time=27 August 2020 18:58:53 o'clock NOTE: 4 records were read from the infile 'Q:\USERS\MEJones\temp\subjects.sample'. The minimum record length was 24. The maximum record length was 24. NOTE: The data set WORK.TEST0 has 4 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds 5 6 data test1; 7 infile 'Q:\USERS\MEJones\temp\editedgenotype.gen' dlm=' ' lrecl=43 7 ! ; 8 input P_AA P_AB P_BB @@; 9 run; NOTE: The infile 'Q:\USERS\MEJones\temp\editedgenotype.gen' is: Filename=Q:\USERS\MEJones\temp\editedgenotype.gen, RECFM=V,LRECL=43,File Size (bytes)=45, Last Modified=27 August 2020 19:04:29 o'clock, Create Time=27 August 2020 19:04:29 o'clock NOTE: 1 record was read from the infile 'Q:\USERS\MEJones\temp\editedgenotype.gen'. The minimum record length was 43. The maximum record length was 43. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.TEST1 has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds 10 11 data all; 12 merge test0 test1; 13 run; NOTE: There were 4 observations read from the data set WORK.TEST0. NOTE: There were 4 observations read from the data set WORK.TEST1. NOTE: The data set WORK.ALL has 4 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 14 15 proc print data=all; NOTE: Writing HTML Body file: sashtml.htm 16 run; NOTE: There were 4 observations read from the data set WORK.ALL. NOTE: PROCEDURE PRINT used (Total process time): real time 0.34 seconds cpu time 0.06 seconds 17 18 19 data test2; 20 infile 'Q:\USERS\MEJones\temp\genotype.gen' dlm=' ' lrecl=69 20 ! eof=finish ; 21 if _n_=1 then input @26 @; 22 input P_AA P_AB P_BB @@; 23 return; 24 finish: stop; 25 run; NOTE: The infile 'Q:\USERS\MEJones\temp\genotype.gen' is: Filename=Q:\USERS\MEJones\temp\genotype.gen, RECFM=V,LRECL=69,File Size (bytes)=71, Last Modified=27 August 2020 19:01:18 o'clock, Create Time=27 August 2020 18:59:38 o'clock NOTE: 1 record was read from the infile 'Q:\USERS\MEJones\temp\genotype.gen'. The minimum record length was 69. The maximum record length was 69. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.TEST2 has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.03 seconds 26 27 data new; 28 merge test0 test2; 29 run; NOTE: There were 4 observations read from the data set WORK.TEST0. NOTE: There were 4 observations read from the data set WORK.TEST2. NOTE: The data set WORK.NEW has 4 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 30 proc print data=new; 31 run; NOTE: There were 4 observations read from the data set WORK.NEW. NOTE: PROCEDURE PRINT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds