10-12-2016 07:22 AM
Dear friends I need your help again.
I have data set with more than 2000000 variables
id snp1 snp2 snp3 ... snp2000000
i want to export it to any kind of delimited file, and preserve header.
I tried few stuff like:
PROC EXPORT DATA= bikovi (OBS=0)
FILE "C:\Users\Maja\Desktop\gv50.csv" MOD DSD DLM=',' LRECL=5000000 ;
PUT (_ALL_) ( ;
and while the data are fine, my header is stopping at snp3708 every time.
What can be a reason?
I tried on smaller data set (40000 variables) and the story was the same.
My data are very simple
id is character value length 15 but if needed i can have it as numeric 1,2,3...671 while SNPs are numbers 0,1,2.
I tried also a robust macro from
http://support.sas.com/kb/24/727.html but this one only complicated everything
Hope you guys can help me
10-12-2016 04:30 PM - edited 10-12-2016 04:32 PM
Just replace your PROC EXPORT with a data step to write the header.
filename snp "C:\Users\Maja\Desktop\gv50.csv"; data _null_ ; file snp dsd lrecl=10000000 ; put 'id' @; do i=1 to 2000000; name = cats('snp',i); put name @; end; put; run; data _null_ ; file snp dsd lrecl=10000000 mod ; set bikovi; put ( id snp1-snp2000000 ) (+0) ; run;
(or do it in the other data step)
10-12-2016 07:50 AM
First of all, I have the notion that a dataset with 2 million variables is the result of a design failure (holding data in structure), I guess that switching from a wide to a long format would be a big improvement in further handling the data.
With which application are you viewing your csv?
10-12-2016 07:56 AM
well, here it is not usefull to discuss why my colegue need wide instead of long format. I have them in long. But as I said it, this is not up to me...
However those are genotype data, lot of info... This is kind of standard format.
TextPad opens this one nicely. No issues at all. If issues would apear I woul use long file text viewer or something to check.
As I said, data are fine. Only header stops.
No problem with exporting data. But guy needs header to...
10-12-2016 07:58 AM
10-12-2016 08:18 AM
The data step example you posted will not create a header line, only the proc export.
Now, proc export creates a data step that does the export, and runs that.
I have the suspicion that in this data step, SAS runs into the limitation for statements (32767) when it tries to write the header line.
I guess you might have to roll your own macro for this:
%macro output_file; data _null_; set dataset; file "c:\outfile.csv" lrecl=1000000000 dlm=','; if _n_ = 1 then do; put "ID"@; %do i = 1 %to 2000000; put ",SNP&i"@; %end; end; put id@; %do i = 1 %to 2000000; put snp&i.@; %end; run; %mend;
This splits the big put for the header into single puts, as is done in the proc export-generated data step for the variables.
10-12-2016 08:45 AM
I tried it on smaller data set (40000).
But, after the last one header (snp40000), data starts to apear in the same line.
Ups. Add a simple put; statement immediately before the end; of the if _n_ = 1 block, but after the %end;
10-12-2016 10:23 AM
If you have that many variables and the column labels are longer than the values then you need to make sure that the output line is long enough to hold the text of the headers.
10-13-2016 02:20 AM
hm... still not working...
%macro output_file; %let maxvar=2000000; data _null_; set dataset; file "c:\outfile.csv" lrecl=1000000000 dlm=','; if _n_ = 1 then do; put "ID"@; %do i = 1 %to %eval(&maxvar-1); put ",SNP&i"@; %end; end; put ",SNP&maxvar"; put id@; %do i = 1 %to &maxvar; put snp&i.@; %end; run; %mend;
That should deal with the issue of the header line not terminating correctly.
10-12-2016 10:47 AM
Actually handling fat 'blobs' is important and SAS has recforms f and n to handle these situations. Also useful with binary data. Thanks for the question HAVE 2,000,000 variables with values -- CHARACTER -- Variable Len Value T00000001 C 10 0000000000 T00000002 C 10 0000000002 T00000003 C 10 0000000003 T00000004 C 10 0000000004 T00000005 C 10 0000000005 T00000006 C 10 0000000006 T00000007 C 10 0000000007 T00000008 C 10 0000000008 .... T02000000 C 10 0020000000 WANT CSV FILE 9 2,000,000 WIDE "T0000000" ,"T00000002","T00000003","T00000004","T00000005",..."T20000000" "00000000" ,"000000002","000000003","000000004","000000005",..."020000000" "10000000" ,"100000002","100000003","100000004","100000005",..."120000000" "20000000" ,"200000002","200000003","200000004","200000005",..."220000000" SOLUTION (MIGHT BE EASIER WITH RECFM=N?) ======================================= this is not a complete solution but it should scale. Might get you started FSLIST is critical to understanding filename fixblk "d:/txt/fixblk.txt" lrecl=12 recfm=f; data _null_; length varout $12; file fixblk; varout=' "T0000000"'; put varout @; do var=2 to 1000; varout=cats(',"T',put(var,z8.),'"'); put varout @; end; put '202020202020202020200D0A'x @; varout=put(0,z9.); put varout @; do var=2 to 1000; varout=cats(',',put(var,z10.)); put varout; end; put '202020202020202020200D0A'x @; run;quit; * turn numbers on and hex on; filename fixblk "d:/txt/fixblk.txt" lrecl=120 recfm=f; proc fslist file=fixblk; run;quit; * import to check; data fixblk; infile "d:/txt/fixblk.txt" firstobs=2 lrecl=32756; array t $10 t0001-t1000; input; do i=1 to 1000; t[i]=scan(_infile_,i,','); end; output; run;quit;
10-12-2016 10:48 AM
Code should be right.
You must generate that file under UNIX and browse it under WINDOWS.
They have different end of row character , so you will see they are in the same row.
suggest you open that file by Notepad++ .
Need further help from the community? Please ask a new question.