Hello,
I wrote a program to generate data . it works well.
Now I have a data set containing normally 25 000 000 of observations.
I have try the code below to open the data set and after 20 minutes, it is still running.
Does someone has already work with huge data set?
Is there particular options I should adjust or extra instructions to add to my code?
Thanks in advance for your help.
libname mywork "\\....\Documents\My SAS Files\work";
Data Temp;
set mywork.d2_..._ent;
run;
It's a simple problem of the right tool for the right job. Given that there is a (most probably) powerful UNIX server already present, you'll have a hard time to convince people to give you a high-end workstation with striped multi-terabyte SSD's, which would immediately alleviate the throughput problem.
SAS itself does not really have limitations.
Actually, they are so ridiculously high these days (64 bit systems) in number of observations, number of bytes per observation, and physical filesizes that they don't matter. The maximum number of observations in a 64-bit environment is 9.2 quintillion (needed to go to wikipedia to find out how much that is).
What will limit you mainly is the available disk space, and the I/O bandwidth of your storage. Second to that is CPU power and available memory.
So I suggest you do tests on your desktop with smaller data (using only local disks!), and then move your verified code to the server for mass-data processing.
You are not just opening the dataset, you are opening it, reading every observation, then writing every observation back to the disk. Hence why it is taking a long time as 25mil obs is a fair bit. Why are you just doing a datastep to open, read/write the data, thats just a waste of processing and read/write time?
I'm not sure what your goal is here. The code you shared does not simply open your data set (which would be quick), but it's copying it from the source location to your WORK library. (That's what the SET statement does -- it brings in all of the records from the data set that you name into the current data set you're creating.)
That's not needed unless you plan to change/adjust the data for further work in your process.
For a quick view, try:
proc print data=mywork.d2_..._ent (obs=100);
run;
If using SAS Enterprise Guide, you can use the Server List to navigate to the MYWORK library and just add the data set to your project. It should open quickly and show you the first few pages of records. You can scroll through as needed.
Since mywork resides on a remote share, you will have the network as bottleneck. Do you have an idea about the observation size (as observation number * observation size will give us a clue about the physical file size)?
121 033 080 ko
What is "ko"?
And whatever "ko" is, 121 million of it for a single observation sounds not right to me at all.
121 giga octects
So you are reading 121 GB across a network. If you have a bandwidth of 1 Gbit/s, which translates (at best) to 100 MB/s, you'll have 121 * 10 seconds = 1210, or roughly 20 minutes. For the raw data transfer, without any communication overhead. Expect slower rates in the real world.
That's why it is always a VERY BAD IDEA to assign libnames on remote shares (unless you're dealing with small data in the ~100 MB range). If you need distributed storage, set up a Storage Area Network with fibre optics; the volumes will appear as local disks in your file system.
Hello Kurt,
this is the first time I work with a table sas of this size.
Now that I have my synthetic data, I have to validate two programs that I have produced.
The first program uses an audit file to validate the data. The second program converts the data into an xml file.
So far, I have tested with 2 million observations.
Will SAS be able to handle such a big table or would it be more prudent to break it up?
We also have a SAS version on a unix server but as I do not have access to the server yet,
I have not done any tests to see if we have an increase of efficiency mostly in term of capability to work with a huge table
and in term of time.
It's a simple problem of the right tool for the right job. Given that there is a (most probably) powerful UNIX server already present, you'll have a hard time to convince people to give you a high-end workstation with striped multi-terabyte SSD's, which would immediately alleviate the throughput problem.
SAS itself does not really have limitations.
Actually, they are so ridiculously high these days (64 bit systems) in number of observations, number of bytes per observation, and physical filesizes that they don't matter. The maximum number of observations in a 64-bit environment is 9.2 quintillion (needed to go to wikipedia to find out how much that is).
What will limit you mainly is the available disk space, and the I/O bandwidth of your storage. Second to that is CPU power and available memory.
So I suggest you do tests on your desktop with smaller data (using only local disks!), and then move your verified code to the server for mass-data processing.
Do bear in mind that you have a binary file - the SAS dataset - and that XML is a verbose file format. Moving the data from a dataset to an XML file, will likely increaste the overall size of the file quite considerably, and I mean quite considerably - by factors of size. So that probably isn't the best approach to getting the data in that format.
Why do you want to copy the whole dataset into work? Do the process on the permanent dataset instead of copying it into temp work location.
Also consider options like compress the dataset while storing, use keep= or Drop= to get the required variables, obs= to limit the records to read, sub-setting the dataset.
When working with really large data, we also highly recommend unchecking Automatically open data when added to project (in EG's Tools->Options->Data->Data General) and Automatically open data or results when generated (in EG's Tools->Options->Results->Results General). Opening really large data can be very expensive, so it is better for the user to request it explicitly (with the knowledge that it could take awhile) than automatically.
Casey
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Thank for your advice. I will test it.
In the same line of though, I would like to know if it is possible to obtains a data set on the work server but not showed in the SAS window (See in French the window donnees de sortie, attached file).
Regards,
@alepage wrote:
Thank for your advice. I will test it.
In the same line of though, I would like to know if it is possible to obtains a data set on the work server but not showed in the SAS window (See in French the window donnees de sortie, attached file).
Regards,
Why would you want that? With the right setting, it would not open automatically, but you can still view it if you want so. Without that, you'd need an extra run of proc print, and that is more (and unnecessary) work than just clicking on the tab.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.