- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a project where someone wants 38 million rows of data to be exported from the SAS server to their local computer. What would be the best way to approach this? Can I just use a data step to export it to a csv file?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can try several methods.
- Use the Export wizard in EG (it creates the csv data and transfers it to the PC in one action)
- Create the csv with PROC EXPORT or a DATA step ina code node, and use the EG Copy Files task for download
- Create the csv as above, and use an external tool for download (SAS servers will usually have the SSH port available, so you can use something like WinSCP from the desktop)
If you want it automated from the SAS code (batch), you best use a shared network resource for storing the csv. Installing and running a SFTP server on the desktop is possible, but not recommended.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Won't this take a very long time though? I just learned that the dataset is over 130gb, is there anyway to export this quickly or am I pretty much stuck?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
130G is 130G, so it WILL take time. Writing to a compressed CSV will give you the maximally available compression, so you should do that first and see how large the resulting ZIP file is.
Also check how the SAS dataset is stored, with or without the COMPRESS option. If it is not compressed, the size of the ZIP will surprise you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ok, thank you for the help.
I'm somewhat new to SAS so I might have messed up the original question, but the dataset is stored on a server as a .sas7bdat file. I'm not really sure how to write the code to get the file into the program like you did with the cars dataset. Sorry if this isn't a clear question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First, you need to assign a LIBNAME to the directory where the dataset file resides (if that is not already done). Then you use the
library.dataset
notation to address your dataset.
In the DATA step, adapt the PUT to include all variables of the dataset, and add a LRECL= option to the FILE statement, with a large enough value to accomodate all values (use the observation size as reported by PROC CONTENTS as base, and add 3 per column for the commas and the eventual quotes).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
My suggestion is to export using the XLSX engine on the libname statement. I've noticed, for the same data, xlsx files are smaller than csv files, however csv files can be compressed to be much smaller than xlsx, xlsx files compress but don't get much smaller than the original. Here is some code, it could be improved but it's a start… For example, as it is written now, it has to read your big 8M daset 10 times, that can be fixed. I’m just trying to show that you can write data directly from a data step to excel where you have all the data step statement available to you.
Good luck…
/*This will create one big excel file with 8-10 tabs;*/
/*in reality, you should break it up into multiple files b/c 8M rows for*/
/*one excel file might be too big, depending on number of columns*/
/*define libname with xlsx engine, including the excle file name and .xlsx*/
libname outFile XLSX "/put your Unix path here – It is Case Sensitive/ExcelFileName.xlsx";
/*Define macro*/
%macro WriteOutData;
/* loop thru 10 times*/
/* and put 1M rows on each tab*/
/* careful, anyting over 10M rows will not get output (as written now)*/
%do i=1 %to 10;
data
outFile.partial_&i;
set myBigFile;
/*output 1M rows per tab*/
if (&i-1)*1000000 < _n_ <= &i*1000000 then output outFile.partial_&i;
run;
%end;
%mend WriteOutData;
/*call macro - note no semicolon is needed*/
%WriteOutData
/*clear the libname*/
libname outFile;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since xlsx files are already ZIP-compressed archives of XML data, you won't get much, if any, further compression.
Since CSV files do not have the overhead of the XML tags (which is considerable), they will end up being much smaller than XLSX files, and have the advantage of being in "one piece" (instead of the 38 sheets you would need).
Next, because of the structure of XLSX, with every new sheet added, the whole file will have to be uncompressed, the new XML data created, and the resulting directory tree compressed again. Performancewise, that's deadly.
Let's do some tests (run in University Edition on a 2012 MacBook Pro):
Create fake data with > 800000 obs:
data work.cars;
set sashelp.cars;
do i = 1 to 2000;
output;
end;
run;
Export to a single sheet in a XLSX:
libname myex xlsx '/folders/myfolders/test.xlsx';
proc copy in=work out=myex;
select cars;
run;
libname myex clear;
Create a zip file in a data step:
filename outzip zip '/folders/myfolders/test.zip';
data _null_;
set work.cars;
file outzip(cars);
put
cylinders drivetrain enginesize horsepower invoice length
make model mpg_city mpg_highway msrp origin type weight wheelbase
;
run;
filename outzip clear;
See the combined log:
73 data work.cars; 74 set sashelp.cars; 75 do i = 1 to 2000; 76 output; 77 end; 78 run; NOTE: There were 428 observations read from the data set SASHELP.CARS. NOTE: The data set WORK.CARS has 856000 observations and 16 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 0.87 seconds cpu time 0.80 seconds 79 80 libname myex xlsx '/folders/myfolders/test.xlsx'; NOTE: Libref MYEX was successfully assigned as follows: Engine: XLSX Physical Name: /folders/myfolders/test.xlsx 81 82 proc copy in=work out=myex; 83 select cars; 84 run; NOTE: Copying WORK.CARS to MYEX.CARS (memtype=DATA). NOTE: There were 856000 observations read from the data set WORK.CARS. NOTE: The data set MYEX.CARS has 856000 observations and 16 variables. NOTE: The export data set has 856000 observations and 16 variables. NOTE: Verwendet wurde: PROZEDUR COPY - (Gesamtverarbeitungszeit): real time 44.76 seconds cpu time 44.20 seconds 85 86 libname myex clear; NOTE: Libref MYEX has been deassigned. 87 88 filename outzip zip '/folders/myfolders/test.zip'; 89 90 data _null_; 91 set work.cars; 92 file outzip(cars.csv) dlm="," dsd; 93 put 94 cylinders drivetrain enginesize horsepower invoice length 95 make model mpg_city mpg_highway msrp origin type weight wheelbase 96 ; 97 run; NOTE: The file library OUTZIP is: Directory=/folders/myfolders/test.zip NOTE: The file OUTZIP(cars.csv) is: Filename=/folders/myfolders/test.zip, Member Name=cars.csv NOTE: A total of 856000 records were written to the file library OUTZIP. The minimum record length was 68. The maximum record length was 111. NOTE: 856000 records were written to the file OUTZIP(cars.csv). The minimum record length was 68. The maximum record length was 111. NOTE: There were 856000 observations read from the data set WORK.CARS. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 2.93 seconds cpu time 2.92 seconds 98 99 filename outzip clear;
Note that creating the XLSX takes about 15 times as long as writing the zip file.
The XLSX is 42.3 MB, the zip file a meager 277K !
Now let's add another sheet to the XLSX:
libname myex xlsx '/folders/myfolders/test.xlsx';
data myex.cars2;
set work.cars;
run;
libname myex clear;
and the log from that:
73 libname myex xlsx '/folders/myfolders/test.xlsx'; NOTE: Libref MYEX was successfully assigned as follows: Engine: XLSX Physical Name: /folders/myfolders/test.xlsx 74 75 data myex.cars2; 76 set work.cars; 77 run; NOTE: There were 856000 observations read from the data set WORK.CARS. NOTE: The data set MYEX.cars2 has 856000 observations and 16 variables. NOTE: The export data set has 856000 observations and 16 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 56.83 seconds cpu time 54.71 seconds 78 79 libname myex clear; NOTE: Libref MYEX has been deassigned.
As expected, the time for further sheets increases.
Bottom line: do not use Excel files for data transfer. Text files and using a standard compression tool beats them hands down.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you! I am going to try this now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
BTW I created an uncompressed CSV file also, it's about double the size of the XLSX, and takes about as much time to write as the ZIP-compressed CSV. Given standard network transfer rates, the additional time to transfer the CSV is outweighed by much more time spent to create the XLSX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Otherwise, a CSV or XML (too big most likely) are you best options as they're fairly versatile for many different application systems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Are you going to provide any data dictionary with descriptions of what type of values,( character or numeric), lengths of variables, possibly ranges (or lists if fixed values ) of values, descriptions other than column headers, layout of date or time columns?
These things are very helpful for CSV file formats so people know what to expect.
Especially if someone is creating dates that look like 010203.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since the IPoAC protocol isn't fast enough for me, I usually use this one-liner to copy whole folders from a mapped network drive:
x "IF EXIST ""Z:\remoteFolder\oneFile.sas7bdat"" (robocopy Z:\remoteFolder C:\Temp\&sysuserid.\localfolder /MIR /MT:16 )";
- Cheers -