BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
js5
Pyrite | Level 9 js5
Pyrite | Level 9

Hello,

 

I need to export an approximately ~55 GB dataset with 160 million rows as .csv. Due to issues with quoting I ended up using %ds2csv macro. On our set-up this is taking between 4 and 5 hours, which is a lot. Looking at task manager, the dataset is getting downloaded in chunks at 65 Mbps, and the SAS CPU load hovers at 15% which would mean fully loading one CPU core.

Is there a simple way of speeding the export up? Reducing the amount of data needing upload would require significant rewrite of how the ETL pipeline works. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
js5
Pyrite | Level 9 js5
Pyrite | Level 9

It turns out %ds2csv was the actual culprit. I switched back to proc export because %ds2csv was not producing any timer information. To my surprise, writing to gzip was miles faster, finishing in under 25 minutes:

 

NOTE: 142213133 records were written to the file CSVXPORT.
      The minimum record length was 155.
      The maximum record length was 486.
NOTE: There were 142213132 observations read from the data set SAS_LIB.xxx.
NOTE: There were 142213132 observations read from the data set SPARKDTM.xxx.
NOTE: DATA statement used (Total process time):
      real time           22:32.09
      cpu time            17:38.70
      

Since I do not need quoting superiority of %ds2csv for all the datasets, I can probably just use it when absolutely necessary and stick to proc export otherwise.

 

View solution in original post

12 REPLIES 12
andreas_lds
Jade | Level 19

Please post proc contents of the dataset. 

 

 

js5
Pyrite | Level 9 js5
Pyrite | Level 9
                                                       The CONTENTS Procedure

                         Data Set Name        xxxxxxxx.yyyyy_yyyyyyy_v3    Observations           142213132
                         Member Type          DATA                         Variables              39       
                         Engine               V9                           Indexes                0        
                         Created              11/10/2025 12:19:06          Observation Length     802      
                         Last Modified        11/10/2025 12:19:06          Deleted Observations   0        
                         Protection                                        Compressed             CHAR     
                         Data Set Type                                     Reuse Space            NO       
                         Label                                             Point to Observations  YES      
                         Data Representation  WINDOWS_64                   Sorted                 NO       
                         Encoding             utf-8  Unicode (UTF-8)                                       


                                                 Engine/Host Dependent Information

        Data Set Page Size          65536                                                                                  
        Number of Data Set Pages    801994                                                                                 
        Number of Data Set Repairs  0                                                                                      
        ExtendObsCounter            YES                                                                                    
        Filename                    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\yyyyy_yyyyyyy_v3.sas7bdat
        Release Created             9.0401M7                                                                               
        Host Created                X64_DSRV19                                                                             
        Owner Name                  aaaaaaaaaaaaaaa                                                                        
        File Size                   49GB                                                                                   
        File Size (bytes)           52559544320                                                                            


                                             Alphabetic List of Variables and Attributes
 
                                #    Variable              Type    Len    Format      Label

                               39    a                     Char      2                               
                               28    b                     Char      5                               
                               29    c                     Char      5                               
                               30    d                     Char      5                               
                               31    e                     Char      5                               
                               32    f                     Char      5                               
                               25    g                     Char      5                               
                               33    h                     Char      5                               
                               26    i                     Char      5                               
                               21    j                     Char      5                               
                               24    k                     Char      5                               
                               22    l                     Char      5                               
                               23    m                     Char      5                               
                               27    n                     Char      5                               
                               11    COMMENT               Char    100                COMMENT        
                               12    CONTROLs              Char      2                CONTROLs       
                               36    COUNT                 Num       8                Frequency Count
                               14    Cells per Well        Num       8                               
                                3    EXPERIMENT NAME       Char     50                EXPERIMENT NAME
                               10    EXPERT COMMENT        Char    100                EXPERT COMMENT 
                                2    FILE NAME             Char     50                FILE NAME      
                                8    FlowJo VERSION        Char     10                FlowJo VERSION 
                               38    Frequency             Num       8                               
                               34    o                     Char      5                               
                                9    PANEL ID              Char     20                PANEL ID       
                                4    PATIENT ID            Char     20                PATIENT ID     
                                7    PROTOCOL ID           Char     50                PROTOCOL ID    
                               16    RFSTDT                Num       4    E8601DA.                   
                               15    SAMPLE DATE           Num       4    E8601DA.                   
                                6    SAMPLE ID             Char     20                SAMPLE ID      
                                5    SAMPLE TYPE           Char     20                SAMPLE TYPE    
                               17    Day                   Num       3                               
                               35    p                     Char      5                               
                               13    Visit                 Char     20                               
                               18    basename              Char    100                               
                               20    q                     Char     10                               
                               19    r                     Char     10                               
                               37    count_sum_carstate    Num       8                               
                                1    source                Char    100                               
Quentin
Super User

You mention upload and download.  Can you describe your environment a bit more?  

 

Is this SAS running on a server, or local PC SAS?  If you've got a SAS dataset sitting on a Windows server, and then you're exporting and downloading a CSV, it could be the the network speed is causing the long run time.  As a test, you could change your code to write the CSV file to the Work library.  That should show how long it takes to write the CSV file, without the overhead of network transfer.  Then perhaps if you determine the issue is network latency, you could work with your admin on that, and perhaps consider compressing the CSV file before it is transferred.

 

Typically I've found PROC EXPORT to generate a CSV to be fast, so I would give that a try to see if it can handle your data.

js5
Pyrite | Level 9 js5
Pyrite | Level 9

SAS is running on a server. It fetches the data from a remote server and exports the .csv into SAS' workdir (I am using filename temp). I do not think that network bandwidth is the issue, as the download speed goes up-and-down, indicative of network transfer waiting for something.

The reason I am using %ds2csv instead of proc export is the fact that the latter does not apply sufficient quoting for values with line breaks in them. 

Tom
Super User Tom
Super User

@js5 wrote:

SAS is running on a server. It fetches the data from a remote server and exports the .csv into SAS' workdir (I am using filename temp). I do not think that network bandwidth is the issue, as the download speed goes up-and-down, indicative of network transfer waiting for something.

The reason I am using %ds2csv instead of proc export is the fact that the latter does not apply sufficient quoting for values with line breaks in them. 


A more likely reason for the download speed to go up and down is competition for the network bandwidth from other processes.

 

To force SAS to add quotes around values it does not think needs them you can just use the ~ modifier in the PUT statement. If the remote system does not mind if the numeric variables have quotes around them (the SAS INPUT statement when using the DSD option on the INFILE statement does not mind them for example) then you could just add the quotes around everything.

put (_all_) (~);

Which means your export step could be as simple as:

proc transpose data=&indsn(obs=0) out=names;
  var _all_;
run;
data _null_;
   file &outfile dsd lrecl=1000000;
   length _name_ $32 _label_ $256 ;
   set names;
   _label_=coalescec(_label_,_name_);
   put _label_ ~ @ ;
run;
data _null_;
  file &outfile mod dsd lrecl=1000000;
  set &indsn;
  put (_all_) (~);
run;

Results

"Name","Sex","Age","Height","Weight"
"Alfred","M","14","69","112.5"
"Alice","F","13","56.5","84"
"Barbara","F","13","65.3","98"
"Carol","F","14","62.8","102.5"
"Henry","M","14","63.5","102.5"
"James","M","12","57.3","83"

If you NEED to only add the ~ for the character variables then use PROC CONTENTS to get the list of file names so you have the TYPE of the variable also to allow you to determine when to add the ~ modifier.

filename csv temp;
%let outfile=csv;
%let indsn=sashelp.class;
proc contents data=&indsn out=contents noprint;
run;

proc sql noprint;
  select quote(trim(coalesce(label,name)))
       , catx(' ',nliteral(name)
             ,case when type=2 then '~' else ' ' end
             )
  into :headers separated by ','
     , :varlist separated by ' '
  from contents            
  order by varnum
  ;
quit;

data _null_;
  file &outfile dsd lrecl=1000000;
  put %sysfunc(quote(%superq(headers),%str(%')));
run;
data _null_;
  file &outfile mod dsd lrecl=1000000;
  set &indsn;
  put &varlist;
run;

Results

"Name","Sex","Age","Height","Weight"
"Alfred","M",14,69,112.5
"Alice","F",13,56.5,84
"Barbara","F",13,65.3,98
"Carol","F",14,62.8,102.5
"Henry","M",14,63.5,102.5
"James","M",12,57.3,83

 

js5
Pyrite | Level 9 js5
Pyrite | Level 9
The problem with writing handcrafted export code is that I need to do this for close to 100 datasets. proc export and/or ds2csv offer the advantage of being universally usable without the need of knowing the dataset.
LinusH
Tourmaline | Level 20
Please share the full log, with
options fullstimer msglevel=i;
If you have the possibility, monitor you server while your export is running (CPU, MEM, I/O).
Data never sleeps
Tom
Super User Tom
Super User

What feature of %DS2CSV is it that you need? I doubt that it is slowing things down since I think it also just generates a data step, like PROC EXPORT does.  You should be able to see what code it actually generated by turning on the MPRINT option.

 

The best way to improve performance would be to skip the CSV file and have SAS send the data directly to the ultimate location, if you can.

 

Other things to do to improve the performance:

 

  • is to generate a ZIP file (or GZIP file) instead.  But the other end would need to know how to handle that change.  It should be as simple as pointing %DS2CSV to a FILEREF you defined using the ZIP enging.
  • Split the task into multi tasks so they can run in parallel (if your  system can handle the extra load).  For example you could use do it based on observation number by adding the FIRSTOBS= and OBS= dataset options to the input dataset reference.  Again your downstream process would need to know how to handle multiple files.
js5
Pyrite | Level 9 js5
Pyrite | Level 9
I tried the .gzip route as we do not have SAS/CONNECT licensed. While it reduced the size of the generated file to ~500 MB, it also took almost 10 hours. Given that uploading and importing the file to databricks only take approx. 20 minutes each, this is not a favourable exchange.
Kurt_Bremser
Super User

A simple process like writing a dataset to an external text file is ALWAYS I/O bound. Given your further description, the main bottleneck is the download from the remote server (database or base SAS library on a network share, or anything else? Please clarify), followed by the bandwidth of the local storage where WORK resides.

If you fetch from a database, the observed fluctuations may come from network saturation and/or load on the database system itself.

js5
Pyrite | Level 9 js5
Pyrite | Level 9
I will investigate this further, thanks!
The permanent library from which the export is supposed to happen is a network share. It being the limiting factor would be interesting as creating the same dataset on the same network location, with CHAR compression enabled, only takes ~15 minutes.
Our SAS runs on EC2 so the work library resides on EBS. Checking with a small subset of data does not show significant performance differences vs exporting to the network share on which the dataset is stored.
I could also export to FSx and see if that helps.
Additionally, the export runs through a view because I need to re-format date and datetime variables into a format understood by spark. Could this be causing the high computational demand?
js5
Pyrite | Level 9 js5
Pyrite | Level 9

It turns out %ds2csv was the actual culprit. I switched back to proc export because %ds2csv was not producing any timer information. To my surprise, writing to gzip was miles faster, finishing in under 25 minutes:

 

NOTE: 142213133 records were written to the file CSVXPORT.
      The minimum record length was 155.
      The maximum record length was 486.
NOTE: There were 142213132 observations read from the data set SAS_LIB.xxx.
NOTE: There were 142213132 observations read from the data set SPARKDTM.xxx.
NOTE: DATA statement used (Total process time):
      real time           22:32.09
      cpu time            17:38.70
      

Since I do not need quoting superiority of %ds2csv for all the datasets, I can probably just use it when absolutely necessary and stick to proc export otherwise.

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 409 views
  • 4 likes
  • 6 in conversation