Solved: How to open a sas data set with 25 000 000 observations

alepage · Posted 09-28-2018 08:20 AM

Hello,

I wrote a program to generate data . it works well.

Now I have a data set containing normally 25 000 000 of observations.

I have try the code below to open the data set and after 20 minutes, it is still running.

Does someone has already work with huge data set?

Is there particular options I should adjust or extra instructions to add to my code?

Thanks in advance for your help.

libname mywork "\\....\Documents\My SAS Files\work";

Data Temp;

set mywork.d2_..._ent;

run;

Kurt_Bremser · Posted 09-28-2018 09:57 AM

It's a simple problem of the right tool for the right job. Given that there is a (most probably) powerful UNIX server already present, you'll have a hard time to convince people to give you a high-end workstation with striped multi-terabyte SSD's, which would immediately alleviate the throughput problem.

SAS itself does not really have limitations.

Actually, they are so ridiculously high these days (64 bit systems) in number of observations, number of bytes per observation, and physical filesizes that they don't matter. The maximum number of observations in a 64-bit environment is 9.2 quintillion (needed to go to wikipedia to find out how much that is).

What will limit you mainly is the available disk space, and the I/O bandwidth of your storage. Second to that is CPU power and available memory.

So I suggest you do tests on your desktop with smaller data (using only local disks!), and then move your verified code to the server for mass-data processing.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

RW9 · Posted 09-28-2018 08:31 AM

You are not just opening the dataset, you are opening it, reading every observation, then writing every observation back to the disk. Hence why it is taking a long time as 25mil obs is a fair bit. Why are you just doing a datastep to open, read/write the data, thats just a waste of processing and read/write time?

ChrisHemedinger · Posted 09-28-2018 08:32 AM

I'm not sure what your goal is here. The code you shared does not simply open your data set (which would be quick), but it's copying it from the source location to your WORK library. (That's what the SET statement does -- it brings in all of the records from the data set that you name into the current data set you're creating.)

That's not needed unless you plan to change/adjust the data for further work in your process.

For a quick view, try:

proc print data=mywork.d2_..._ent (obs=100);
run;

If using SAS Enterprise Guide, you can use the Server List to navigate to the MYWORK library and just add the data set to your project. It should open quickly and show you the first few pages of records. You can scroll through as needed.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

Kurt_Bremser · Posted 09-28-2018 08:33 AM

Since mywork resides on a remote share, you will have the network as bottleneck. Do you have an idea about the observation size (as observation number * observation size will give us a clue about the physical file size)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 09-28-2018 08:44 AM

121 033 080 ko

Kurt_Bremser · Posted 09-28-2018 08:47 AM

What is "ko"?

And whatever "ko" is, 121 million of it for a single observation sounds not right to me at all.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 09-28-2018 08:52 AM

121 giga octects

Kurt_Bremser · Posted 09-28-2018 09:19 AM

So you are reading 121 GB across a network. If you have a bandwidth of 1 Gbit/s, which translates (at best) to 100 MB/s, you'll have 121 * 10 seconds = 1210, or roughly 20 minutes. For the raw data transfer, without any communication overhead. Expect slower rates in the real world.

That's why it is always a VERY BAD IDEA to assign libnames on remote shares (unless you're dealing with small data in the ~100 MB range). If you need distributed storage, set up a Storage Area Network with fibre optics; the volumes will appear as local disks in your file system.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

alepage · Posted 09-28-2018 09:41 AM

Hello Kurt,

this is the first time I work with a table sas of this size.

Now that I have my synthetic data, I have to validate two programs that I have produced.

The first program uses an audit file to validate the data. The second program converts the data into an xml file.

So far, I have tested with 2 million observations.

Will SAS be able to handle such a big table or would it be more prudent to break it up?

We also have a SAS version on a unix server but as I do not have access to the server yet,

I have not done any tests to see if we have an increase of efficiency mostly in term of capability to work with a huge table

and in term of time.

Kurt_Bremser · Posted 09-28-2018 09:57 AM

It's a simple problem of the right tool for the right job. Given that there is a (most probably) powerful UNIX server already present, you'll have a hard time to convince people to give you a high-end workstation with striped multi-terabyte SSD's, which would immediately alleviate the throughput problem.

SAS itself does not really have limitations.

Actually, they are so ridiculously high these days (64 bit systems) in number of observations, number of bytes per observation, and physical filesizes that they don't matter. The maximum number of observations in a 64-bit environment is 9.2 quintillion (needed to go to wikipedia to find out how much that is).

What will limit you mainly is the available disk space, and the I/O bandwidth of your storage. Second to that is CPU power and available memory.

So I suggest you do tests on your desktop with smaller data (using only local disks!), and then move your verified code to the server for mass-data processing.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RW9 · Posted 09-28-2018 10:10 AM

Do bear in mind that you have a binary file - the SAS dataset - and that XML is a verbose file format. Moving the data from a dataset to an XML file, will likely increaste the overall size of the file quite considerably, and I mean quite considerably - by factors of size. So that probably isn't the best approach to getting the data in that format.

SuryaKiran · Posted 09-28-2018 08:57 AM

Why do you want to copy the whole dataset into work? Do the process on the permanent dataset instead of copying it into temp work location.

Also consider options like compress the dataset while storing, use keep= or Drop= to get the required variables, obs= to limit the records to read, sub-setting the dataset.

Thanks,
Suryakiran

CaseySmith · Posted 10-01-2018 05:33 PM

When working with really large data, we also highly recommend unchecking Automatically open data when added to project (in EG's Tools->Options->Data->Data General) and Automatically open data or results when generated (in EG's Tools->Options->Results->Results General). Opening really large data can be very expensive, so it is better for the user to request it explicitly (with the knowledge that it could take awhile) than automatically.

Casey

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

alepage · Posted 10-02-2018 08:24 AM

Thank for your advice. I will test it.

In the same line of though, I would like to know if it is possible to obtains a data set on the work server but not showed in the SAS window (See in French the window donnees de sortie, attached file).

Regards,

Kurt_Bremser · Posted 10-02-2018 08:38 AM

@alepage wrote:

Thank for your advice. I will test it.

In the same line of though, I would like to know if it is possible to obtains a data set on the work server but not showed in the SAS window (See in French the window donnees de sortie, attached file).

Regards,

Why would you want that? With the right setting, it would not open automatically, but you can still view it if you want so. Without that, you'd need an extra run of proc print, and that is more (and unnecessary) work than just clicking on the tab.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Re: How to open a sas data set with 25 000 000 observations

Registration is open

SAS Training: Just a Click Away