BookmarkSubscribeRSS Feed
C_Myers221
Fluorite | Level 6

I have a project where someone wants 38 million rows of data to be exported from the SAS server to their local computer. What would be the best way to approach this? Can I just use a data step to export it to a csv file?

13 REPLIES 13
Kurt_Bremser
Super User

You can try several methods.

  • Use the Export wizard in EG (it creates the csv data and transfers it to the PC in one action)
  • Create the csv with PROC EXPORT or a DATA step ina code node, and use the EG Copy Files task for download
  • Create the csv as above, and use an external tool for download (SAS servers will usually have the SSH port available, so you can use something like WinSCP from the desktop)

If you want it automated from the SAS code (batch), you best use a shared network resource for storing the csv. Installing and running a SFTP server on the desktop is possible, but not recommended.

C_Myers221
Fluorite | Level 6

Won't this take a very long time though? I just learned that the dataset is over 130gb, is there anyway to export this quickly or am I pretty much stuck?

Kurt_Bremser
Super User

130G is 130G, so it WILL take time. Writing to a compressed CSV will give you the maximally available compression, so you should do that first and see how large the resulting ZIP file is.

Also check how the SAS dataset is stored, with or without the COMPRESS option. If it is not compressed, the size of the ZIP will surprise you.

C_Myers221
Fluorite | Level 6

Ok, thank you for the help. 

 

I'm somewhat new to SAS so I might have messed up the original question, but the dataset is stored on a server as a .sas7bdat file. I'm not really sure how to write the code to get the file into the program like you did with the cars dataset. Sorry if this isn't a clear question.

Kurt_Bremser
Super User

First, you need to assign a LIBNAME to the directory where the dataset file resides (if that is not already done). Then you use the

library.dataset

notation to address your dataset.

In the DATA step, adapt the PUT to include all variables of the dataset, and add a LRECL= option to the FILE statement, with a large enough value to accomodate all values (use the observation size as reported by PROC CONTENTS as base, and add 3 per column for the commas and the eventual quotes).

PatKellyNH
Calcite | Level 5

 

Hi,

My suggestion is to export using the XLSX engine on the libname statement. I've noticed, for the same data, xlsx files are smaller than csv files, however csv files can be compressed to be much smaller than xlsx, xlsx files compress but don't get much smaller than the original.  Here is some code, it could be improved but it's a start… For example, as it is written now, it has to read your big 8M daset 10 times, that can be fixed. I’m just trying to show that you can write data directly from a data step to excel where you have all the data step statement available to you.

Good luck…

 

/*This will create one big excel file with 8-10 tabs;*/

/*in reality, you should break it up into multiple files b/c 8M rows for*/

/*one excel file might be too big, depending on number of columns*/

 

/*define libname with xlsx engine, including the excle file name and .xlsx*/

libname outFile XLSX "/put your Unix path here – It is Case Sensitive/ExcelFileName.xlsx";

 

/*Define macro*/

%macro WriteOutData;

/*   loop thru 10 times*/

/*   and put 1M rows on each tab*/

/*   careful, anyting over 10M rows will not get output (as written now)*/

     %do i=1 %to 10;           

           data

                outFile.partial_&i;

                set myBigFile;

                /*output 1M rows per tab*/

                if (&i-1)*1000000 < _n_ <= &i*1000000 then output outFile.partial_&i;

           run;

     %end;

%mend WriteOutData;

 

/*call macro - note no semicolon is needed*/

%WriteOutData

 

/*clear the libname*/

libname outFile;

Kurt_Bremser
Super User

Since xlsx files are already ZIP-compressed archives of XML data, you won't get much, if any, further compression.

 

Since CSV files do not have the overhead of the XML tags (which is considerable), they will end up being much smaller than XLSX files, and have the advantage of being in "one piece" (instead of the 38 sheets you would need).

Next, because of the structure of XLSX, with every new sheet added, the whole file will have to be uncompressed, the new XML data created, and the resulting directory tree compressed again. Performancewise, that's deadly.

 

Let's do some tests (run in University Edition on a 2012 MacBook Pro):

Create fake data with > 800000 obs:

 

data work.cars;
set sashelp.cars;
do i = 1 to 2000;
  output;
end;
run;

Export to a single sheet in a XLSX:

 

 

libname myex xlsx '/folders/myfolders/test.xlsx';

proc copy in=work out=myex;
select cars;
run;

libname myex clear;

Create a zip file in a data step:

 

filename outzip zip '/folders/myfolders/test.zip';

data _null_;
set work.cars;
file outzip(cars);
put
  cylinders drivetrain enginesize horsepower invoice length
  make model mpg_city mpg_highway msrp origin type weight wheelbase
;
run;

filename outzip clear;

See the combined log:

 73         data work.cars;
 74         set sashelp.cars;
 75         do i = 1 to 2000;
 76           output;
 77         end;
 78         run;
 
 NOTE: There were 428 observations read from the data set SASHELP.CARS.
 NOTE: The data set WORK.CARS has 856000 observations and 16 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.87 seconds
       cpu time            0.80 seconds
       
 
 79         
 80         libname myex xlsx '/folders/myfolders/test.xlsx';
 NOTE: Libref MYEX was successfully assigned as follows: 
       Engine:        XLSX 
       Physical Name: /folders/myfolders/test.xlsx
 81         
 82         proc copy in=work out=myex;
 83         select cars;
 84         run;
 
 NOTE: Copying WORK.CARS to MYEX.CARS (memtype=DATA).
 NOTE: There were 856000 observations read from the data set WORK.CARS.
 NOTE: The data set MYEX.CARS has 856000 observations and 16 variables.
 NOTE: The export data set has 856000 observations and 16 variables.
 NOTE:  Verwendet wurde: PROZEDUR COPY - (Gesamtverarbeitungszeit):
       real time           44.76 seconds
       cpu time            44.20 seconds
       
 
 85         
 86         libname myex clear;
 NOTE: Libref MYEX has been deassigned.
 87         
 88         filename outzip zip '/folders/myfolders/test.zip';
 89         
 90         data _null_;
 91         set work.cars;
 92         file outzip(cars.csv) dlm="," dsd;
 93         put
 94           cylinders drivetrain enginesize horsepower invoice length
 95           make model mpg_city mpg_highway msrp origin type weight wheelbase
 96         ;
 97         run;
 
 NOTE: The file library OUTZIP is:
       Directory=/folders/myfolders/test.zip
 
 NOTE: The file OUTZIP(cars.csv) is:
       Filename=/folders/myfolders/test.zip,
       Member Name=cars.csv
 
 NOTE: A total of 856000 records were written to the file library OUTZIP.
       The minimum record length was 68.
       The maximum record length was 111.
 NOTE: 856000 records were written to the file OUTZIP(cars.csv).
       The minimum record length was 68.
       The maximum record length was 111.
 NOTE: There were 856000 observations read from the data set WORK.CARS.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           2.93 seconds
       cpu time            2.92 seconds
       
 
 98         
 99         filename outzip clear;

Note that creating the XLSX takes about 15 times as long as writing the zip file.

The XLSX is 42.3 MB, the zip file a meager 277K !

Now let's add another sheet to the XLSX:

libname myex xlsx '/folders/myfolders/test.xlsx';

data myex.cars2;
set work.cars;
run;

libname myex clear;

and the log from that:

 73         libname myex xlsx '/folders/myfolders/test.xlsx';
 NOTE: Libref MYEX was successfully assigned as follows: 
       Engine:        XLSX 
       Physical Name: /folders/myfolders/test.xlsx
 74         
 75         data myex.cars2;
 76         set work.cars;
 77         run;
 
 NOTE: There were 856000 observations read from the data set WORK.CARS.
 NOTE: The data set MYEX.cars2 has 856000 observations and 16 variables.
 NOTE: The export data set has 856000 observations and 16 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           56.83 seconds
       cpu time            54.71 seconds
       
 
 78         
 79         libname myex clear;
 NOTE: Libref MYEX has been deassigned.

As expected, the time for further sheets increases.

 

Bottom line: do not use Excel files for data transfer. Text files and using a standard compression tool beats them hands down.

C_Myers221
Fluorite | Level 6

Thank you! I am going to try this now.

Kurt_Bremser
Super User

BTW I created an uncompressed CSV file also, it's about double the size of the XLSX, and takes about as much time to write as the ZIP-compressed CSV. Given standard network transfer rates, the additional time to transfer the CSV is outweighed by much more time spent to create the XLSX.

Reeza
Super User
What are they planning to do with the file? Do they need to use it in a different software application? If they just want to work locally (also this seems like a security risk) in SAS, save it as a SAS7BDAT file.
Otherwise, a CSV or XML (too big most likely) are you best options as they're fairly versatile for many different application systems.
ballardw
Super User

Are you going to provide any data dictionary with descriptions of what type of values,( character or numeric), lengths of variables, possibly ranges (or lists if fixed values ) of values, descriptions other than column headers, layout of date or time columns?

 

These things are very helpful for CSV file formats so people know what to expect.

Especially if someone is creating dates that look like 010203.

RichardDeVen
Barite | Level 11
What application on their local computer is going to consume the export ? Does the app read SAS data sets natively ? If so, copy the 130GB data set file to a device (thumb drive/passport'ish drive) connected to the servers USB and hand that device off to them to contend with.
Oligolas
Barite | Level 11

Since the IPoAC protocol isn't fast enough for me, I usually use this one-liner to copy whole folders from a mapped network drive:

 

 

x "IF EXIST ""Z:\remoteFolder\oneFile.sas7bdat"" (robocopy Z:\remoteFolder C:\Temp\&sysuserid.\localfolder /MIR /MT:16 )";

 

 

________________________

- Cheers -

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1842 views
  • 4 likes
  • 7 in conversation