BookmarkSubscribeRSS Feed
fengyuwuzu
Pyrite | Level 9

my data is so large, that only two columns (ID, datetime) is already 39G (>76 million rows). I will work on more variables (about 10 more variables). it is extremely slow, even just an import or a simple data step. I estimate it will be >100G for all columns.

 

I work on a PC (i7-4790 CPU 3.6G, 32G memory). I wonder if any of you also worked on such large files and what I can do without buying a better computer?

 

Thanks.

 

14 REPLIES 14
Rick_SAS
SAS Super FREQ

We'll need more information about what analyses you intend to conduct. For simple descriptive statistics (PROC MEANS, PROC FREQ) and least squares regression or ANOVA (PROC REG, PROC GLM), you shouldn't have any problems. Do you know which procedures you want to run?

LinusH
Tourmaline | Level 20

Performance for simple table scans is simple math. Just examine your I/O throughput, that will probably set your limit.

Sounds crazy to expect decent response time with such data volumes on a stand alone PC.

And next, you may wish to join several tables, or apply statistical models etc. That can exponentially increase the demand for processing speed. You need to get some iron!

Or, if you have the possibility, reconsider your requirements. Can you in some way accomplish your task by query small samples?

Data never sleeps
FreelanceReinh
Jade | Level 19

Hi @fengyuwuzu,

 

Some of your numbers seem a bit inconsistent to me:


@fengyuwuzu wrote:

my data is so large, that only two columns (ID, datetime) is already 39G (>76 million rows).


Please note that 39E9/77E6>500. So, you must have an observation length (this is one of the numbers displayed in PROC CONTENTS output) of more than 500 bytes. The datetime variable should be numeric, hence of length 8 as a maximum. How long is the ID to fill more than 492 (minus a little overhead) bytes?

 

Theoretically, 4 (four) bytes would suffice for an ID variable capable of holding distinct IDs for 2^(8*4)=4.29 billion observations.

 

Also, if your datetime values are only integers (i.e. no fractions of a second occur) and do not exceed the year 6314 (!), you could even reduce the length of this variable to 6 bytes without losing information.

 

So, theoretically, your (uncompressed) data could have an observation length of 10 and hence a total size of approx. 750 MB, which is less than 2% of 39 GB.

 

Obviously, a dataset of this comparably moderate size would even fit on a RAM disk* within your 32 GB RAM, which could further improve processing speed (depending on your current storage medium).

 

* (requires RAM disk software)

Kurt_Bremser
Super User

a) do a PROC CONTENTS on the dataset and post the output.

b) where did you get the data from? Your numbers let me suspect that a (more or less failed) import step created variables that are much too large for their contents.

c) the processing power of your CPU is often secondary when dealing with large tables in SAS. Instead invest in several (to spread the I/O load with striping) fast disks, preferably SSDs.

 

fengyuwuzu
Pyrite | Level 9

below is proc contents with the two column data set. I need to add more columns later (I have data from 7 companies, and I used keep= option in set operation when combining them)

 

The SAS System

 

The CONTENTS Procedure

 

Data Set Name ANAWAGER.DUPI_WAGERTIME Observations 717674921
Member Type DATA Variables 2
Engine V9 Indexes 0
Created 03/04/2016 14:55:28 Observation Length 56
Last Modified 03/04/2016 14:55:28 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation WINDOWS_64    
Encoding utf-8 Unicode (UTF-8)    


Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 614975
First Data Page 1
Max Obs per Page 1167
Obs in First Data Page 1145
Number of Data Set Repairs 0
ExtendObsCounter YES
Filename D:\Backup\Analysis_wager\dupi_wagertime.sas7bdat
Release Created 9.0401M2
Host Created X64_7PRO


Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
1 DUPI Char 42 $42. $42.
2 wagerdatetime Num 8 DATETIME16. ANYDTDTM40.
fengyuwuzu
Pyrite | Level 9

Do you think increase the memory from 32G to 64G or 128G will help?

 

Kurt_Bremser
Super User

Your initially stated number of obs was too low by an order of magnitude. That confused us.

 

So the next question is: what is stored in DUPI? Examine this column and look if the length is needed. Reduce the length of the variable to the maximum length of data present. If the contents vary (widely) in length, using the compress=yes option will save you a lot of space and I/O.

Tinkering with the length of numeric variables can also help (see Reinhard's post)

 

Before you throw hardware at a data problem, try to repair the data 😉

 

If you will encounter datasets of this size in the future, increasing your RAM will be helpful (allowing the system to cache file data in memory), but you still need to increase your I/O bandwidth, as you will repeatedly read/write big datasets from/to disk.

That means better disks and parallelisation.

Keep in mind that (normal)  PCs will not have the throughput on the system bus (PCIe) that dedicated big-data servers (like an IBM POWER or similar) can provide. So, although your i7 may be very capable in terms of CPU specs, your system is very probably not built from the ground up for this kind of data (not number) crunching.

fengyuwuzu
Pyrite | Level 9
do you mind explaining a little bit how to increase I/O bandwidth? Thanks
Kurt_Bremser
Super User

a) get faster disks. SSDs preferred.

b) parallelization. Determine the throughput of each component, and then try to saturate the one bottleneck that you have.

 

eg:

you have a system bus with 500 MB/sec maximum.

your SATA adapter has 200 MB/sec.

-> so put in two adapters

your disk has 100MB/sec.

-> so put two disks on each adapter, and use striping to parallelize

Now you can read/write 400 MB/sec, which should let you scan through your dataset in 100 seconds

 

My fairly old SAS server has two independent PCIexpress buses, each with a 2Gbit FC adapter and UtraSCSI LVD adapter. Internal disks (WORK and UTILLOC!) are connected to the SCSI adapters, and large storage boxes via fiber optic. Storage was/is much more expensive than the computer itself.

 

 

fengyuwuzu
Pyrite | Level 9

Thank you very much!

FreelanceReinh
Jade | Level 19

In addition to KurtBremser's good advice, you could consider replacing variable DUPI with a different ID variable (to be defined). For example, a character variable of length 5 can store approx. 916 million different values, even if the characters are restricted to the 62-element set {A, ..., Z, a, ..., z, 0, ..., 9}.


Are the DUPI values unique in your dataset or do they represent certain objects or subjects each of which contributes many observations (with different datetime values)? In the latter case, the lookup table for the translation between DUPI and the new ID could be much smaller than the current dataset and possibly length 4 would be sufficient for the ID.

 

Together with the reduction of the length of WAGERDATETIME to 6 as proposed earlier, you might be able to reduce the observation length from 56 to 10, resulting in a file size of about 6.7 GB.

Rick_SAS
SAS Super FREQ

The advice you have received is generic for fast I/O. But you posted this message to the Statistical Analysis forum, so what analyses do you intend to perform.   That answer to that question is important because not every algorithm scales well to 700 million observations. You  might go out and spend thousands of dollars on a new state-of-the-art system, only to discover that you still can't perform the analysis that you want.

Kurt_Bremser
Super User

About IDs:

 

Several years ago, we switched the primary keys in our DB/2 from timestamps (that had a precision of microseconds) to UUIDs.

Technically, a UUID is hexadecimal number with 32 digits, which can be stored as a 16-byte string.

Since UUIDs are the universal method of creating unique identifiers, I'd consider replacing your current ID with them.

fengyuwuzu
Pyrite | Level 9

Just want to update what I did later:

Our school has a high performance computing facility (Linux platform with sas 9.4 installed) that we can access. We tested running SAS on it and it is much faster than on my desktop pc. I am running as batch method. So for now I can do the large calculation on the HPC system.

 

Thank you very much for all your time and suggestions on this issue.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 6133 views
  • 3 likes
  • 5 in conversation