I have been sent some genotype data and need help reading it into SAS.
Two files: first is a listing of subjects (text file with .sample extension); second is a single record containing the genotype data (text file with .gen extension).
The .gen file starts (this is made up data): 19 rs123456 12345678 T G 0 0 1 0.873 0.127 0.002... etc to column 38064 and is only one record.
The first five items in the .gen file are the: chromosome=19, SNP ID=rs123456, position=12345678, allele A='T' and allele B='G'
The next items are triplets of numbers (allele frequencies). The first three numbers (0, 0, 1) relate to the first record in the .sample file and can be called P_AA, P_AB, P_BB, the second three numbers ( 0.873, 0.127, 0.002 ) relate to the second record (with same variable name P_AA, P_AB, P_BB). And so on. Every triplet corresponds to a subject in the .sample file, based on the ordering in both data files.
If I edit the .gen file and remove the "19 rs123456 12345678 T G " from the start of the .gen file I can read the data using:
data test;
infile 'myfile.gen' dlm=' ' lrecl=38039;
input P_AA P_AB P_BB @@;
run;
And then 1:1 merge this with .sample file, and get what I want. It's a solution , but there must be a better way.
What I would like to do is read the original .gen file, essentially starting from column 26. I tried :
input @26 P_AA P_AB P_BB @@;
But his does not work. I tried other variations using two input lines, but no luck.
Does anyone have any suggestions? This data structure is common in genotyping and GWAS, so I feel someone (smarter than me) must have solved this before.
Thank you.
Michael.
Doesn't work is awful vague.
Are there errors in the log?: Post the code and log in a code box opened with the <> to maintain formatting of error messages.
No output? Post any log in a code box.
Unexpected output? Provide input data in the form of data step code pasted into a code box, the actual results and the expected results. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the <> icon or attached as text to show exactly what you have and that we can test code against.
Since this is data step read syntax, then an actual example of the text to read would be helpful.
Things like the number and consistency of spaces, commas and other characters are very likely to play a role in reading a complex layout. Even things like the structure of your non-problem variables are an issue.
You show examples as triplets that are comma delimited inside () but the code you show would not read those. So that makes actual example data more critical.
It sounds like there should be a fixed number of triplets, so perhaps an array based read will work.
This example reads 2 triplets. If the data looks at all like yours then replace 2 with the number of triplets involved.
data example; infile datalines ; input id $ @; array AA{2}; array AB{2}; array BB{2}; do i=1 to 2; input AA[i] AB[i] BB[i] @; end; input; do i=1 to 2; P_AA=AA[i]; P_AB=AB[i]; P_BB=BB[i]; output; end; keep id P_AA P_AB P_BB; datalines; abc 0.873 0.127 0.002 0.893 0.927 0.092 pdq 0.973 0.197 0.902 0.879 0.129 0.009 xyz 0.883 0.128 0.082 0.878 0.187 0.802 ;
Hello @Alchemi and welcome to the SAS Support Communities!
You can move the pointer to column 26 with an INPUT statement that is executed only in the first iteration of the DATA step and then continue reading with the "@@" modifier until the end of the file (see EOF= option).
data test;
infile 'myfile.gen' eof=finish lrecl=38039;
if _n_=1 then input @26 @;
input P_AA P_AB P_BB @@;
return;
finish: stop;
run;
Thank you. I'd like to summarize.
Here are example datasets:
subjects.sample (header, second record, and then 4 subject records, but my real data had more subjects)
ID_1 ID_2 missing sex status
0 0 0 D B
A123E123 A123E123 0 0 -9
A123F456 A123F456 0 0 -9
A123G789 A123G789 0 0 -9
A456G123 A456G123 0 0 -9
genotype.gen (single record, this is a truncated example but my real data had length 30,000 characters with many more triplets
19 rs123456 12345678 T G 0 0 1 0.873 0.127 0.002 0.252 0.746 0 0 1 0
My solution was to edit the above to get simple triplets
editedgenotype.gen
0 0 1 0.873 0.127 0.002 0.252 0.746 0 0 1 0
data test0;
infile 'Q:\USERS\MEJones\temp\subjects.sample' dlm=' ' firstobs=3;
input ID_1 $ ID_2 $ missing $ sex $ status $;
run;
data test1;
infile 'Q:\USERS\MEJones\temp\editedgenotype.gen' dlm=' ' lrecl=43;
input P_AA P_AB P_BB @@;
run;
data all;
merge test0 test1;
run;
proc print data=all;
run;
My output
The SAS System |
A123E123 | A123E123 | 0 | 0 | -9 | 0.000 | 0.000 | 1.000 |
A123F456 | A123F456 | 0 | 0 | -9 | 0.873 | 0.127 | 0.002 |
A123G789 | A123G789 | 0 | 0 | -9 | 0.252 | 0.746 | 0.000 |
A456G123 | A456G123 | 0 | 0 | -9 | 0.000 | 1.000 | 0.000 |
The solution from FreelanceReinhard Amethyst
data test2;
infile 'Q:\USERS\MEJones\temp\genotype.gen' dlm=' ' lrecl=69 eof=finish ;
if _n_=1 then input @26 @;
input P_AA P_AB P_BB @@;
return;
finish: stop;
run;
data new;
merge test0 test2;
run;
proc print data=new;
run;
Reads the original dataset with no need to edit it 😁 and produces
The SAS System |
A123E123 | A123E123 | 0 | 0 | -9 | 0.000 | 0.000 | 1.000 |
A123F456 | A123F456 | 0 | 0 | -9 | 0.873 | 0.127 | 0.002 |
A123G789 | A123G789 | 0 | 0 | -9 | 0.252 | 0.746 | 0.000 |
A456G123 | A456G123 | 0 | 0 | -9 | 0.000 | 1.000 | 0.000 |
NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M3)
Licensed to INSTITUTE OF CANCER RESEARCH, Site 70092393.
NOTE: This session is executing on the W32_8PRO platform.
NOTE: Updated analytical products:
SAS/STAT 14.1
SAS/ETS 14.1
SAS/OR 14.1
SAS/IML 14.1
SAS/QC 14.1
NOTE: Additional host information:
W32_8PRO WIN 6.2.9200 Workstation
NOTE: SAS initialization used:
real time 2.06 seconds
cpu time 0.71 seconds
1 data test0;
2 infile 'Q:\USERS\MEJones\temp\subjects.sample' dlm=' ' firstobs=3;
3 input ID_1 $ ID_2 $ missing $ sex $ status $;
4 run;
NOTE: The infile 'Q:\USERS\MEJones\temp\subjects.sample' is:
Filename=Q:\USERS\MEJones\temp\subjects.sample,
RECFM=V,LRECL=32767,File Size (bytes)=143,
Last Modified=27 August 2020 19:04:52 o'clock,
Create Time=27 August 2020 18:58:53 o'clock
NOTE: 4 records were read from the infile
'Q:\USERS\MEJones\temp\subjects.sample'.
The minimum record length was 24.
The maximum record length was 24.
NOTE: The data set WORK.TEST0 has 4 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
5
6 data test1;
7 infile 'Q:\USERS\MEJones\temp\editedgenotype.gen' dlm=' ' lrecl=43
7 ! ;
8 input P_AA P_AB P_BB @@;
9 run;
NOTE: The infile 'Q:\USERS\MEJones\temp\editedgenotype.gen' is:
Filename=Q:\USERS\MEJones\temp\editedgenotype.gen,
RECFM=V,LRECL=43,File Size (bytes)=45,
Last Modified=27 August 2020 19:04:29 o'clock,
Create Time=27 August 2020 19:04:29 o'clock
NOTE: 1 record was read from the infile
'Q:\USERS\MEJones\temp\editedgenotype.gen'.
The minimum record length was 43.
The maximum record length was 43.
NOTE: SAS went to a new line when INPUT statement reached past the end
of a line.
NOTE: The data set WORK.TEST1 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
10
11 data all;
12 merge test0 test1;
13 run;
NOTE: There were 4 observations read from the data set WORK.TEST0.
NOTE: There were 4 observations read from the data set WORK.TEST1.
NOTE: The data set WORK.ALL has 4 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
14
15 proc print data=all;
NOTE: Writing HTML Body file: sashtml.htm
16 run;
NOTE: There were 4 observations read from the data set WORK.ALL.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.34 seconds
cpu time 0.06 seconds
17
18
19 data test2;
20 infile 'Q:\USERS\MEJones\temp\genotype.gen' dlm=' ' lrecl=69
20 ! eof=finish ;
21 if _n_=1 then input @26 @;
22 input P_AA P_AB P_BB @@;
23 return;
24 finish: stop;
25 run;
NOTE: The infile 'Q:\USERS\MEJones\temp\genotype.gen' is:
Filename=Q:\USERS\MEJones\temp\genotype.gen,
RECFM=V,LRECL=69,File Size (bytes)=71,
Last Modified=27 August 2020 19:01:18 o'clock,
Create Time=27 August 2020 18:59:38 o'clock
NOTE: 1 record was read from the infile
'Q:\USERS\MEJones\temp\genotype.gen'.
The minimum record length was 69.
The maximum record length was 69.
NOTE: SAS went to a new line when INPUT statement reached past the end
of a line.
NOTE: The data set WORK.TEST2 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.03 seconds
26
27 data new;
28 merge test0 test2;
29 run;
NOTE: There were 4 observations read from the data set WORK.TEST0.
NOTE: There were 4 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.NEW has 4 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
30 proc print data=new;
31 run;
NOTE: There were 4 observations read from the data set WORK.NEW.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.