Help using Base SAS procedures

very large file and slow speed

Reply
Super Contributor
Posts: 312

very large file and slow speed

my data is so large, that only two columns (ID, datetime) is already 39G (>76 million rows). I will work on more variables (about 10 more variables). it is extremely slow, even just an import or a simple data step. I estimate it will be >100G for all columns.

 

I work on a PC (i7-4790 CPU 3.6G, 32G memory). I wonder if any of you also worked on such large files and what I can do without buying a better computer?

 

Thanks.

 

SAS Super FREQ
Posts: 3,477

Re: very large file and slow speed

We'll need more information about what analyses you intend to conduct. For simple descriptive statistics (PROC MEANS, PROC FREQ) and least squares regression or ANOVA (PROC REG, PROC GLM), you shouldn't have any problems. Do you know which procedures you want to run?

Super User
Posts: 5,256

Re: very large file and slow speed

Performance for simple table scans is simple math. Just examine your I/O throughput, that will probably set your limit.

Sounds crazy to expect decent response time with such data volumes on a stand alone PC.

And next, you may wish to join several tables, or apply statistical models etc. That can exponentially increase the demand for processing speed. You need to get some iron!

Or, if you have the possibility, reconsider your requirements. Can you in some way accomplish your task by query small samples?

Data never sleeps
Trusted Advisor
Posts: 1,115

Re: very large file and slow speed

Hi @fengyuwuzu,

 

Some of your numbers seem a bit inconsistent to me:


fengyuwuzu wrote:

my data is so large, that only two columns (ID, datetime) is already 39G (>76 million rows).


Please note that 39E9/77E6>500. So, you must have an observation length (this is one of the numbers displayed in PROC CONTENTS output) of more than 500 bytes. The datetime variable should be numeric, hence of length 8 as a maximum. How long is the ID to fill more than 492 (minus a little overhead) bytes?

 

Theoretically, 4 (four) bytes would suffice for an ID variable capable of holding distinct IDs for 2^(8*4)=4.29 billion observations.

 

Also, if your datetime values are only integers (i.e. no fractions of a second occur) and do not exceed the year 6314 (!), you could even reduce the length of this variable to 6 bytes without losing information.

 

So, theoretically, your (uncompressed) data could have an observation length of 10 and hence a total size of approx. 750 MB, which is less than 2% of 39 GB.

 

Obviously, a dataset of this comparably moderate size would even fit on a RAM disk* within your 32 GB RAM, which could further improve processing speed (depending on your current storage medium).

 

* (requires RAM disk software)

Super User
Posts: 6,936

Re: very large file and slow speed

a) do a PROC CONTENTS on the dataset and post the output.

b) where did you get the data from? Your numbers let me suspect that a (more or less failed) import step created variables that are much too large for their contents.

c) the processing power of your CPU is often secondary when dealing with large tables in SAS. Instead invest in several (to spread the I/O load with striping) fast disks, preferably SSDs.

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 312

Re: very large file and slow speed

[ Edited ]

below is proc contents with the two column data set. I need to add more columns later (I have data from 7 companies, and I used keep= option in set operation when combining them)

 

The SAS System

 

The CONTENTS Procedure

 

Data Set Name ANAWAGER.DUPI_WAGERTIME Observations 717674921
Member Type DATA Variables 2
Engine V9 Indexes 0
Created 03/04/2016 14:55:28 Observation Length 56
Last Modified 03/04/2016 14:55:28 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation WINDOWS_64    
Encoding utf-8 Unicode (UTF-8)    


Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 614975
First Data Page 1
Max Obs per Page 1167
Obs in First Data Page 1145
Number of Data Set Repairs 0
ExtendObsCounter YES
Filename D:\Backup\Analysis_wager\dupi_wagertime.sas7bdat
Release Created 9.0401M2
Host Created X64_7PRO


Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
1 DUPI Char 42 $42. $42.
2 wagerdatetime Num 8 DATETIME16. ANYDTDTM40.
Super Contributor
Posts: 312

Re: very large file and slow speed

Do you think increase the memory from 32G to 64G or 128G will help?

 

Super User
Posts: 6,936

Re: very large file and slow speed

Your initially stated number of obs was too low by an order of magnitude. That confused us.

 

So the next question is: what is stored in DUPI? Examine this column and look if the length is needed. Reduce the length of the variable to the maximum length of data present. If the contents vary (widely) in length, using the compress=yes option will save you a lot of space and I/O.

Tinkering with the length of numeric variables can also help (see Reinhard's post)

 

Before you throw hardware at a data problem, try to repair the data Smiley Wink

 

If you will encounter datasets of this size in the future, increasing your RAM will be helpful (allowing the system to cache file data in memory), but you still need to increase your I/O bandwidth, as you will repeatedly read/write big datasets from/to disk.

That means better disks and parallelisation.

Keep in mind that (normal)  PCs will not have the throughput on the system bus (PCIe) that dedicated big-data servers (like an IBM POWER or similar) can provide. So, although your i7 may be very capable in terms of CPU specs, your system is very probably not built from the ground up for this kind of data (not number) crunching.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 312

Re: very large file and slow speed

do you mind explaining a little bit how to increase I/O bandwidth? Thanks
Super User
Posts: 6,936

Re: very large file and slow speed

a) get faster disks. SSDs preferred.

b) parallelization. Determine the throughput of each component, and then try to saturate the one bottleneck that you have.

 

eg:

you have a system bus with 500 MB/sec maximum.

your SATA adapter has 200 MB/sec.

-> so put in two adapters

your disk has 100MB/sec.

-> so put two disks on each adapter, and use striping to parallelize

Now you can read/write 400 MB/sec, which should let you scan through your dataset in 100 seconds

 

My fairly old SAS server has two independent PCIexpress buses, each with a 2Gbit FC adapter and UtraSCSI LVD adapter. Internal disks (WORK and UTILLOC!) are connected to the SCSI adapters, and large storage boxes via fiber optic. Storage was/is much more expensive than the computer itself.

 

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 312

Re: very large file and slow speed

Thank you very much!

Trusted Advisor
Posts: 1,115

Re: very large file and slow speed

In addition to KurtBremser's good advice, you could consider replacing variable DUPI with a different ID variable (to be defined). For example, a character variable of length 5 can store approx. 916 million different values, even if the characters are restricted to the 62-element set {A, ..., Z, a, ..., z, 0, ..., 9}.


Are the DUPI values unique in your dataset or do they represent certain objects or subjects each of which contributes many observations (with different datetime values)? In the latter case, the lookup table for the translation between DUPI and the new ID could be much smaller than the current dataset and possibly length 4 would be sufficient for the ID.

 

Together with the reduction of the length of WAGERDATETIME to 6 as proposed earlier, you might be able to reduce the observation length from 56 to 10, resulting in a file size of about 6.7 GB.

SAS Super FREQ
Posts: 3,477

Re: very large file and slow speed

The advice you have received is generic for fast I/O. But you posted this message to the Statistical Analysis forum, so what analyses do you intend to perform.   That answer to that question is important because not every algorithm scales well to 700 million observations. You  might go out and spend thousands of dollars on a new state-of-the-art system, only to discover that you still can't perform the analysis that you want.

Super User
Posts: 6,936

Re: very large file and slow speed

About IDs:

 

Several years ago, we switched the primary keys in our DB/2 from timestamps (that had a precision of microseconds) to UUIDs.

Technically, a UUID is hexadecimal number with 32 digits, which can be stored as a 16-byte string.

Since UUIDs are the universal method of creating unique identifiers, I'd consider replacing your current ID with them.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 312

Re: very large file and slow speed

Just want to update what I did later:

Our school has a high performance computing facility (Linux platform with sas 9.4 installed) that we can access. We tested running SAS on it and it is much faster than on my desktop pc. I am running as batch method. So for now I can do the large calculation on the HPC system.

 

Thank you very much for all your time and suggestions on this issue.

Ask a Question
Discussion stats
  • 14 replies
  • 726 views
  • 3 likes
  • 5 in conversation