- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I need to export a very large dataset to csv, but I need to break the output into multiple, smaller files for the downstream process. The split need to incorporate logic so that certain groups of records are not split across files. Can anyone suggest ways to do this?
My current plan: look at what proc export sticks in the log (it seems to be a macro, despite the name), scrape that code and doctor it do the split logic. I'm pretty sure this will work (I haven't seen the spec for the logic yet), but there may be a less hacky way. If so, RSVP.export
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
CSV files are trivial to write with a SAS datastep. If you ignore the header record you just need:
data _null_;
set MYDATA ;
file MYFILE dsd dlm=',' lrecl=32000 ;
put (_all_) (:) ;
run;
To make multiple files you can use the FILEVAR option on the FILE statement to control the name of the target file with data step logic.
You can build macro variables with the list of variable names from the metadata (SASHELP.VCOLUMN or use PROC CONTENTS) to drive the put statement. You can also build a macro variable with valid code to put the variable names (or labels) as the header.
You can use a double DOW loop to count how many records you have per by group before you start the group so that you can decide if you should start a new file before the current BY value.
So something like this: (UNTESTED)
%let varlist=A B C;
%let titles="A" ',' "B" ',' "C" ;
%let maxobs=10000 ;
data _null_;
do nobs=1 by 1 until (last.group);
set mydata ;
by group ;
end;
do until (last.group) ;
set mydata ;
by group ;
if nrecs=0 or (nrecs + nobs > &maxobs) then do;
filenumber+1;
nrecs=0;
filename = cats('myfile',filenumber,'.csv');
end;
file dummy filevar=filename dsd dlm=',' lrecl=32000 ;
nrecs+1;
if nrecs=1 then put &titles;
put &varlist ;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The support.sas.com site has several macros that will split files by groups.
38430 - Exporting BY groups from a SAS data set to multiple Microsoft Excel or text files
I don't know if you have a true "BY" group though without more details.
I think this code can also be modified to export to CSV simply enough:
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I do a lot of macro programming, so this approach is, at least, familiar territory. I think I'm going to try the ODS solution, but I worry that I will have mygrp in the output (and it wasn't in the original input).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi: Rather than tinkering with PROC EXPORT, I would use the NEWFILE=BYGROUP with ODS CSV. There's a code example below. In the first example, a simple BY AGE was used in PROC PRINT to create a separate output file for each age. In the second example, I used a DATA step program to make some groups using IF logic, then I use the new group variable in my BY statement. In the second example, you should get 3 output files.
cynthia
proc sort data=sashelp.class;
by age;
run;
title; footnote;
** 1) will get 1 csv file for each age;
ods csv file='c:\temp\age11.csv'
newfile=bygroup;
proc print data=sashelp.class;
by age;
run;
ods csv close;
** 2) use my logic to make groups;
data newfile;
** create new groups using logic;
set sashelp.class;
if age in (11,12) then
mygrp = 1;
else if age in (13, 14) then
mygrp = 2;
else mygrp = 3;
run;
proc sort data=newfile;
by mygrp;
run;
** will make mgrp1, mgrp2, mgrp3 files;
ods csv file='c:\temp\mgrp1.csv'
newfile=bygroup;
proc print data=newfile;
by mygrp;
run;
ods csv close;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've never used ODS, and this seems like the newthink. When I learn more about the split logic, I'll give this a try.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Without a VAR statement all variables except the ones listed in the BY statement will appear.
Use OPTIONS NOBYLINE to suppress any BY line that PROC PRINT might generate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've never done this. I've always used drop= or keep= on the input dataset to get the columns I want. (Any way, I will have to enumerate the columns I want, which means being aware of them. It would be nice to avoid that: drop= would only need to know about the columns I create. But I can't BY process on a dropped column.)
It's always good to post here and get help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Cynthia,
I'm trying your solution because it seems require the least coding on my part. I have to say, however, that ODS is a multi-headed monster, and I can't find anything in the help about "ods csv". (I can find "ods csvall", but that's a different beast.) Can you point me to some documentation?
Second question: I'd like NULLs in my dataset to be represented by consecutive delimiters (i.e. an empty field), but I'm getting a '.' (missing). Can you tell me how to turn off the dot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One way regarding how to "turn off" the dots:
options missing=' ';
Some documentation: I got a number of related hits doing a Google search on:
ods csv sas
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
I'd like to write a general purpose macro (although that's a secondary priority),so I'd certainly prefer to only turn off the dots for the proc print and not in general. Maybe I can detect the state of that option and restore it after I do my work.
edit:
%let missopt = %sysfunc(getoption(missing,keyword)); /* store the existing option away */
options missing='';
...
options &missopt;
This seems to be what I wanted.
Interesting about the google search; I've been searching both my local help file and support.sas.com and not getting good results. (I get results, just not relevant one.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, I was disappointed today. I tested my code on a 100K row dataset, splitting into 10k chunks but not splitting across a special set of vars. It worked.
BUT, when I ran it on the real 47M row dataset, I got
ERROR: The SAS System stopped processing this step because of insufficient memory.
on a machine with 24Gig of RAM and a Terabyte of drive space.
Now, to try Tom's approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just out of curiosity I have two questions: (1) Did you sort your real dataset before running the code? AND (2)does your work directory happen to be on a drive that has very limited memory?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1. No: I used an index
2. No. Plenty of drive space
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ods csv involves far mode than you need
Other ODS destinations take better advantage of the overheads of the destination approach.
To generate a csv file requires very little memory
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello, This looks like something that I can use, but I need some help refining the script. I'd like to use a version of the following script: ** 1) will get 1 csv file for each age; ods csv file='c:\temp\age11.csv' newfile=bygroup; proc print data=sashelp.class; by age; run; ods csv close; but I'd like for each exported csv file to have a specific name in the form "Q_age_value.csv", where "age-value" is the value of the variable age. Can someone suggest a way to do this?