BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Dear friends I need your help again.

I have data set with more than 2000000 variables

 

id snp1 snp2 snp3 ... snp2000000

 

i want to export it to any kind of delimited file, and preserve header.

I tried few stuff like:

 

PROC EXPORT DATA= bikovi (OBS=0)
OUTFILE= "C:\Users\Maja\Desktop\gv50.csv"
DBMS=CSV REPLACE;
PUTNAMES=YES;
RUN;


DATA _NULL_;
SET bikovi;
FILE "C:\Users\Maja\Desktop\gv50.csv" MOD DSD DLM=',' LRECL=5000000 ;
PUT (_ALL_) (:) ;
RUN;

 

and while the data are fine, my header is stopping at snp3708 every time. 

What can be a reason?

I tried on smaller data set (40000 variables) and the story was the same.

 

My data are very simple

id is character value length 15 but if needed i can have it as numeric 1,2,3...671 while SNPs are numbers 0,1,2.

I tried also a robust macro from 

http://support.sas.com/kb/24/727.html but this one only complicated everything

Hope you guys can help me

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Just replace your PROC EXPORT with a data step to write the header.

 

filename snp  "C:\Users\Maja\Desktop\gv50.csv";

data _null_ ;
  file snp dsd lrecl=10000000 ;
  put 'id' @;
  do i=1 to 2000000;
    name = cats('snp',i);
    put name @;
  end;
  put;
run;

data _null_ ;
  file snp dsd lrecl=10000000 mod ;
  set bikovi;
  put ( id snp1-snp2000000 ) (+0) ;
run;

 

(or do it in the other data step)

 

View solution in original post

18 REPLIES 18
Kurt_Bremser
Super User

First of all, I have the notion that a dataset with 2 million variables is the result of a design failure (holding data in structure), I guess that switching from a wide to a long format would be a big improvement in further handling the data.

 

With which application are you viewing your csv?

MajaFerencakovic
Fluorite | Level 6

well, here it is not usefull to discuss why my colegue need wide instead of long format. I have them in long. But as I said it, this is not up to me...

However those are genotype data, lot of info... This is kind of standard format.

 

TextPad opens this one nicely. No issues at all. If issues would apear I woul use long file text viewer or something to check. 

As I said, data are fine. Only header stops. 

No problem with exporting data. But guy needs header to... 

LinusH
Tourmaline | Level 20

Why on earth do you have 2'' variables?

What is the purpose on such a layout?

General advice - transpose from wide to long.

Data never sleeps
MajaFerencakovic
Fluorite | Level 6
guys, this is not helpful.
This is standard .ped format used in genetics. So can we skip the part why would I need to have 2000000 variables. I have them, I must deal with them...
Kurt_Bremser
Super User

The data step example you posted will not create a header line, only the proc export.

 

Now, proc export creates a data step that does the export, and runs that.

I have the suspicion that in this data step, SAS runs into the limitation for statements (32767) when it tries to write the header line.

 

I guess you might have to roll your own macro for this:

%macro output_file;
data _null_;
set dataset;
file "c:\outfile.csv" lrecl=1000000000 dlm=',';
if _n_ = 1
then do;
  put "ID"@;
%do i = 1 %to 2000000;
  put ",SNP&i"@;
%end;
end;
put id@;
%do i = 1 %to 2000000;
put snp&i.@;
%end;
run;
%mend;

This splits the big put for the header into single puts, as is done in the proc export-generated data step for the variables.

MajaFerencakovic
Fluorite | Level 6

looks promising.

I tried it on smaller data set (40000). 

But, after the last one header (snp40000), data starts to apear in the same line.

 

 

Kurt_Bremser
Super User

@MajaFerencakovic wrote:

looks promising.

I tried it on smaller data set (40000). 

But, after the last one header (snp40000), data starts to apear in the same line.

 

 


Ups. Add a simple put; statement immediately before the end; of the if _n_ = 1 block, but after the %end;

MajaFerencakovic
Fluorite | Level 6

hm... still not working...

 

😞

ballardw
Super User

If you have that many variables and the column labels are longer than the values then you need to make sure that the output line is long enough to hold the text of the headers.

MajaFerencakovic
Fluorite | Level 6

the file by itself is fine. All headers are in but data starts in the same line as headers

Kurt_Bremser
Super User

@MajaFerencakovic wrote:

hm... still not working...

 

😞


%macro output_file;
%let maxvar=2000000;
data _null_;
set dataset;
file "c:\outfile.csv" lrecl=1000000000 dlm=',';
if _n_ = 1
then do;
  put "ID"@;
%do i = 1 %to %eval(&maxvar-1);
  put ",SNP&i"@;
%end;
end;
put ",SNP&maxvar";
put id@;
%do i = 1 %to &maxvar;
put snp&i.@;
%end;
run;
%mend;

That should deal with the issue of the header line not terminating correctly.

MajaFerencakovic
Fluorite | Level 6
Works!!! Thank you! It is nice to have so many working solutions.
rogerjdeangelis
Barite | Level 11
Actually handling fat 'blobs' is important and SAS has recforms f and n to handle these situations. Also useful with binary data.

Thanks for the question

HAVE 2,000,000 variables with values

 -- CHARACTER --
Variable          Len     Value

T00000001    C    10      0000000000
T00000002    C    10      0000000002
T00000003    C    10      0000000003
T00000004    C    10      0000000004
T00000005    C    10      0000000005
T00000006    C    10      0000000006
T00000007    C    10      0000000007
T00000008    C    10      0000000008
....
T02000000    C    10      0020000000


WANT CSV FILE 9 2,000,000 WIDE

"T0000000"  ,"T00000002","T00000003","T00000004","T00000005",..."T20000000"

"00000000"  ,"000000002","000000003","000000004","000000005",..."020000000"
"10000000"  ,"100000002","100000003","100000004","100000005",..."120000000"
"20000000"  ,"200000002","200000003","200000004","200000005",..."220000000"


SOLUTION (MIGHT BE EASIER WITH RECFM=N?)
=======================================
this is not a complete solution but it should scale.
Might get you started

FSLIST is critical to understanding


filename fixblk "d:/txt/fixblk.txt" lrecl=12 recfm=f;

data _null_;
  length varout $12;
  file fixblk;
  varout=' "T0000000"';
  put varout @;
  do var=2 to 1000;
    varout=cats(',"T',put(var,z8.),'"');
    put varout @;
  end;
  put '202020202020202020200D0A'x @;
  varout=put(0,z9.);
  put varout @;
  do var=2 to 1000;
    varout=cats(',',put(var,z10.));
    put varout;
  end;
  put '202020202020202020200D0A'x @;
run;quit;

* turn numbers on and hex on;
filename fixblk "d:/txt/fixblk.txt" lrecl=120 recfm=f;
proc fslist file=fixblk;
run;quit;

* import to check;
data fixblk;
  infile "d:/txt/fixblk.txt" firstobs=2 lrecl=32756;
  array t[1000] $10 t0001-t1000;
  input;
  do i=1 to 1000;
    t[i]=scan(_infile_,i,',');
  end;
  output;
run;quit;
Ksharp
Super User

Code should be right.

You must generate that file under UNIX and browse it under WINDOWS.

They have different end of row character , so you will see they are in the same row.

suggest you open that file by Notepad++ .

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 18 replies
  • 3158 views
  • 7 likes
  • 8 in conversation