Re: very large file and slow speed

fengyuwuzu · Posted 03-07-2016 03:36 PM

my data is so large, that only two columns (ID, datetime) is already 39G (>76 million rows). I will work on more variables (about 10 more variables). it is extremely slow, even just an import or a simple data step. I estimate it will be >100G for all columns.

I work on a PC (i7-4790 CPU 3.6G, 32G memory). I wonder if any of you also worked on such large files and what I can do without buying a better computer?

Thanks.

Rick_SAS · Posted 03-07-2016 03:55 PM

We'll need more information about what analyses you intend to conduct. For simple descriptive statistics (PROC MEANS, PROC FREQ) and least squares regression or ANOVA (PROC REG, PROC GLM), you shouldn't have any problems. Do you know which procedures you want to run?

LinusH · Posted 03-07-2016 04:09 PM

Performance for simple table scans is simple math. Just examine your I/O throughput, that will probably set your limit.

Sounds crazy to expect decent response time with such data volumes on a stand alone PC.

And next, you may wish to join several tables, or apply statistical models etc. That can exponentially increase the demand for processing speed. You need to get some iron!

Or, if you have the possibility, reconsider your requirements. Can you in some way accomplish your task by query small samples?

Data never sleeps

FreelanceReinh · Posted 03-07-2016 06:05 PM

Hi @fengyuwuzu,

Some of your numbers seem a bit inconsistent to me:

@fengyuwuzu wrote:

my data is so large, that only two columns (ID, datetime) is already 39G (>76 million rows).

Please note that 39E9/77E6>500. So, you must have an observation length (this is one of the numbers displayed in PROC CONTENTS output) of more than 500 bytes. The datetime variable should be numeric, hence of length 8 as a maximum. How long is the ID to fill more than 492 (minus a little overhead) bytes?

Theoretically, 4 (four) bytes would suffice for an ID variable capable of holding distinct IDs for 2^(8*4)=4.29 billion observations.

Also, if your datetime values are only integers (i.e. no fractions of a second occur) and do not exceed the year 6314 (!), you could even reduce the length of this variable to 6 bytes without losing information.

So, theoretically, your (uncompressed) data could have an observation length of 10 and hence a total size of approx. 750 MB, which is less than 2% of 39 GB.

Obviously, a dataset of this comparably moderate size would even fit on a RAM disk* within your 32 GB RAM, which could further improve processing speed (depending on your current storage medium).

* (requires RAM disk software)

Kurt_Bremser · Posted 03-08-2016 02:34 AM

a) do a PROC CONTENTS on the dataset and post the output.

b) where did you get the data from? Your numbers let me suspect that a (more or less failed) import step created variables that are much too large for their contents.

c) the processing power of your CPU is often secondary when dealing with large tables in SAS. Instead invest in several (to spread the I/O load with striping) fast disks, preferably SSDs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

fengyuwuzu · Posted 03-08-2016 09:42 AM

below is proc contents with the two column data set. I need to add more columns later (I have data from 7 companies, and I used keep= option in set operation when combining them)

The SAS System

The CONTENTS Procedure

Data Set Name	ANAWAGER.DUPI_WAGERTIME	Observations	717674921
Member Type	DATA	Variables	2
Engine	V9	Indexes	0
Created	03/04/2016 14:55:28	Observation Length	56
Last Modified	03/04/2016 14:55:28	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	WINDOWS_64
Encoding	utf-8 Unicode (UTF-8)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	614975
First Data Page	1
Max Obs per Page	1167
Obs in First Data Page	1145
Number of Data Set Repairs	0
ExtendObsCounter	YES
Filename	D:\Backup\Analysis_wager\dupi_wagertime.sas7bdat
Release Created	9.0401M2
Host Created	X64_7PRO

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
1	DUPI	Char	42	$42.	$42.
2	wagerdatetime	Num	8	DATETIME16.	ANYDTDTM40.

fengyuwuzu · Posted 03-08-2016 09:42 AM

Do you think increase the memory from 32G to 64G or 128G will help?

Kurt_Bremser · Posted 03-08-2016 10:19 AM

Your initially stated number of obs was too low by an order of magnitude. That confused us.

So the next question is: what is stored in DUPI? Examine this column and look if the length is needed. Reduce the length of the variable to the maximum length of data present. If the contents vary (widely) in length, using the compress=yes option will save you a lot of space and I/O.

Tinkering with the length of numeric variables can also help (see Reinhard's post)

Before you throw hardware at a data problem, try to repair the data 😉

If you will encounter datasets of this size in the future, increasing your RAM will be helpful (allowing the system to cache file data in memory), but you still need to increase your I/O bandwidth, as you will repeatedly read/write big datasets from/to disk.

That means better disks and parallelisation.

Keep in mind that (normal) PCs will not have the throughput on the system bus (PCIe) that dedicated big-data servers (like an IBM POWER or similar) can provide. So, although your i7 may be very capable in terms of CPU specs, your system is very probably not built from the ground up for this kind of data (not number) crunching.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

fengyuwuzu · Posted 03-08-2016 10:37 AM

do you mind explaining a little bit how to increase I/O bandwidth? Thanks

Kurt_Bremser · Posted 03-08-2016 10:56 AM

a) get faster disks. SSDs preferred.

b) parallelization. Determine the throughput of each component, and then try to saturate the one bottleneck that you have.

eg:

you have a system bus with 500 MB/sec maximum.

your SATA adapter has 200 MB/sec.

-> so put in two adapters

your disk has 100MB/sec.

-> so put two disks on each adapter, and use striping to parallelize

Now you can read/write 400 MB/sec, which should let you scan through your dataset in 100 seconds

My fairly old SAS server has two independent PCIexpress buses, each with a 2Gbit FC adapter and UtraSCSI LVD adapter. Internal disks (WORK and UTILLOC!) are connected to the SCSI adapters, and large storage boxes via fiber optic. Storage was/is much more expensive than the computer itself.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

fengyuwuzu · Posted 03-08-2016 11:10 AM

Thank you very much!

FreelanceReinh · Posted 03-08-2016 11:15 AM

In addition to KurtBremser's good advice, you could consider replacing variable DUPI with a different ID variable (to be defined). For example, a character variable of length 5 can store approx. 916 million different values, even if the characters are restricted to the 62-element set {A, ..., Z, a, ..., z, 0, ..., 9}.

Are the DUPI values unique in your dataset or do they represent certain objects or subjects each of which contributes many observations (with different datetime values)? In the latter case, the lookup table for the translation between DUPI and the new ID could be much smaller than the current dataset and possibly length 4 would be sufficient for the ID.

Together with the reduction of the length of WAGERDATETIME to 6 as proposed earlier, you might be able to reduce the observation length from 56 to 10, resulting in a file size of about 6.7 GB.

Rick_SAS · Posted 03-08-2016 11:35 AM

The advice you have received is generic for fast I/O. But you posted this message to the Statistical Analysis forum, so what analyses do you intend to perform. That answer to that question is important because not every algorithm scales well to 700 million observations. You might go out and spend thousands of dollars on a new state-of-the-art system, only to discover that you still can't perform the analysis that you want.

Kurt_Bremser · Posted 03-09-2016 01:48 AM

About IDs:

Several years ago, we switched the primary keys in our DB/2 from timestamps (that had a precision of microseconds) to UUIDs.

Technically, a UUID is hexadecimal number with 32 digits, which can be stored as a 16-byte string.

Since UUIDs are the universal method of creating unique identifiers, I'd consider replacing your current ID with them.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

fengyuwuzu · Posted 04-19-2016 09:38 PM

Just want to update what I did later:

Our school has a high performance computing facility (Linux platform with sas 9.4 installed) that we can access. We tested running SAS on it and it is much faster than on my desktop pc. I am running as batch method. So for now I can do the large calculation on the HPC system.

Thank you very much for all your time and suggestions on this issue.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away