Re: Big data analysis in SAS

Manije72

Hi,

I am working with a large dataset containing 500,000 rows and 4,000 columns, which requires 10 GB of memory. I imported the data into SAS, but only a portion of the data was successfully loaded, and many variables were not detected during analysis. My computer has limited memory, with only 16 GB available. Consequently, SAS is unable to open the entire dataset. How can I open the full dataset and ensure all variables are available for analysis in SAS?

AhmedAl_Attar

Hi @Manije72

Typical SAS session comes with some default settings, which you may need to customize to work with this "large" data set of yours

to see the default settings, execute this code

proc options group=performance; run;
/* To see the memory settings - which is a subset of the performance */
proc options group=memory; run;

if you want to change the default settings, here are the ways to do it before starting the SAS session

Windows shortcut Target properties --> Add -memsize 12G anywhere after the sas.exe
Create sasv9.cfg file under your windows user home directory (C:\Users\<YourName>\), and within the sasv9.cfg have the following lines, as an example

-memsize 12G
-sortsize 4G
-cpucount 8
-threads

Note: You can always execute the proc options statement to check your customization affects

Hope this can help

Tom

SAS does not normally load the whole dataset into memory, So the fact that it takes 10Gbytes of disk space to store the data should not prevent you from using it.

How did you "import" the data into SAS? What format is the data in now? If you have it in a TEXT file, such as a CSV file then SAS can easily READ such a file. Show the data step you used to read the file and explain how it failed to read all of the data.

Quentin

Can you post the code you are running, and the log you get that shows an error messages about running out of memory?

4,000 columns is a lot of columns, but 500,000 rows is not much data. One of the beautiful designs of the SAS DATA step is that it does NOT need to open the entire dataset it reads. It reads data one row at a time. The DATA step was designed to use very little memory, because there wasn't much memory around when SAS was created. So the decision decision was to minimize memory usage by increasing disk I/O. Generally in a DATA step, it's unusual to hit a limitation in memory. If you run some PROCs, like PROC SORT, they can be more memory intensive. In a DATA step if you create a hash table, that can be memory intensive. So depending on what step you are running when you have the problem, there may be different solutions.

Ideally, the best way to explain the problem would be to post some code that causes the problem, and then post the log you get from running that code.

For example, below code will create a dataset with 50,000 rows and 4001 variables. I would think it would run fine on your PC, does it? Then maybe try running your PROC on some data like this, and see if you can get the memory error to occur.

data have ;
  array x {4000} (1000*(3 6 9 12)) ;
  do id=1 to 500000 ;
    output ;
  end ;
run ;

BASUG is hosting free webinars ! Check out our recordings of past webinars: https://www.basug.org/videos. Be sure to subscribe to our email list for notification of future BASUG events.

Manije72

SAS cannot read the last 1000 variables correctly and displays them with generic names like VAR4680, VAR4681, VAR4682,... instead of their actual names. Additionally, some of these variables are numeric, but SAS recognizes them as character variables when I use proc contents. Consequently, when I run proc means on these variables, I receive the error: 'Variable VAR4680 in list does not match type prescribed for this list'." I tried to convert character variable to numeric variables by some codes. However, the mean and SD that I got is not what expected to get for that variable,

data kl;
set kk;
VAR4679_num=input(VAR4679, best32.);
run;
proc means data=kl;
var VAR4679_num;
run;

ballardw

Please describe how you know " only a portion of the data was successfully loaded, and many variables were not detected during analysis".

What type of analysis? Many regression procedures by default will not use any observations with one or more missing values for variables on the MODEL statement. Other statements in other procedures will have similar limitations.

Manije72

Some of these variables are numeric, but SAS recognizes them as character variables when I use proc contents. Consequently, when I run proc means on these variables, I receive the error: 'Variable VAR4680 in list does not match type prescribed for this list'." How can I solve this problem?

Tom

@Manije72 wrote:

Some of these variables are numeric, but SAS recognizes them as character variables when I use proc contents. Consequently, when I run proc means on these variables, I receive the error: 'Variable VAR4680 in list does not match type prescribed for this list'." How can I solve this problem?

PROC CONTENTS shows how the variables are defined in the SAS dataset. If PROC IMPORT created character variables for columns that you think should be numeric then one of three things happened.

1) There is non numeric data somewhere in the column. Perhaps it has those superfluous NA strings that programs like R like to write into CSV files.

2) The variable is empty on every observation. PROC IMPORT makes a 1 character variable in that case to save space in the SAS dataset.

2) You did not tell it to use enough observations to guess how to read the data and all of the observations checked had empty values for in that column.

Manije72

1) There is non numeric data somewhere in the column. Perhaps it has those superfluous NA strings that programs like R like to write into CSV files.

Yes, maybe there are some string variables like "NaN" . How should I solve this problem now?

2) The variable is empty on every observation. PROC IMPORT makes a 1 character variable in that case to save space in the SAS dataset.

How can I solve this problem?

3) You did not tell it to use enough observations to guess how to read the data and all of the observations checked had empty values for in that column.

How should I solve this problem?

I previously imported the data by clicking File > Import Data > CSV, and it was successfully imported. Now, I tried to import it using this code, but it has been running for two hours.

proc import datafile='path_to_your_file.csv'
out=your_dataset
dbms=csv
replace;
getnames=yes;
guessingrows=max;

run;

Tom

Know your data. Look at your CSV file. Figure out which variables are numeric. For character variables figure out how long each one needs to be to store all of the data (or all of the data you actually need). Write the data step to read it.

To read a CSV file the data step is as simple as this.

data want;
  infile 'myfile.csv' dsd firstobs=2 truncover lrecl=1000000;
  length firstvar 8 secondvar $20 ..... lastvar 8 ;
  informat datevar mmddyy.;
  format datevar yymmdd10.;
  input firstvar -- lastvar;
run;

So for your file just update the LENGTH statement that is setting the variable types and storage length. Define the variables in the same order they appear on the lines of the file and then the INPUT statement can be as simple as the one I show that uses a simple position based variable list. You only need to attach formats or informats to variables that NEED them. Most variables will NOT need them. Usually only things like DATE, TIME and DATETIME variables will need either informats or formats attached to them.

If you don't know how to LOOK at your CSV file you can use a simple data step like this to read in the first 5 lines and dump them to the SAS log where you can look at them.

data _null_;
  infile 'myfile.csv' obs=5;
  input;
  list;
run;

You might want to include the option LRECL= like I used above in case the lines in the file are longer than the default 32K bytes.

Reeza

Do you have a data dictionary for your data? If so, show us the format and it will be easier to write code to read your data from that.

For data of this size, using PROC IMPORT will cause headaches. It guesses...and not always correctly.

Kurt_Bremser

What is the format (database, text file, Excel) of your source data, and how did you import it into SAS?

With 4000 variables my "data hidden in structure" alarm bell goes off, so you should consider converting to a long dataset structure during the import process.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Manije72

My dataset is in CSV format. I import the data by clicking file/import data/CSV/.. While I can perform analysis, I am unable to view the data using proc print. Furthermore, It appears that SAS cannot read the last 1000 variables correctly and displays them with generic names like VAR4680, VAR4681, VAR4682,... instead of their actual names. Additionally, some of these variables are numeric, but SAS recognizes them as character variables when I use proc contents. Consequently, when I run proc means on these variables, I receive the error: 'Variable VAR4680 in list does not match type prescribed for this list'."

Tom

@Manije72 wrote:

My dataset is in CSV format. I import the data by clicking file/import data/CSV/.. While I can perform analysis, I am unable to view the data using proc print. Furthermore, It appears that SAS cannot read the last 1000 variables correctly and displays them with generic names like VAR4680, VAR4681, VAR4682,... instead of their actual names. Additionally, some of these variables are numeric, but SAS recognizes them as character variables when I use proc contents. Consequently, when I run proc means on these variables, I receive the error: 'Variable VAR4680 in list does not match type prescribed for this list'."

PROC IMPORT (or even the IMPORT tools in many point and click interfaces) is a good way to do QUICK AND DIRTY look at your data. But because it has to GUESS how to define the variables you should not use it for anything important.

Instead write your own data step to READ the file. That way you can control how each variable is defined (name, type, storage length, display format, any special informat need to convert the text in the CSV file into valid data values, etc.).

And in particular PROC IMPORT has a bug that prevents it from seeing more than 32K bytes of the header line of the CSV file. That is what is causing your variable names to be generic VAR4681 etc.

Another common trouble is that it will define empty variables as character of length 1 since that uses 7 fewer bytes that a numeric variable would.

If you want a tool that does a better guess about how to read a CSV file try using this macro instead: https://github.com/sasutils/macros/blob/master/csv2ds.sas

In addition to reading all of the header row it generates cleaner easier to use SAS code that you use as a starting point for writing your own data step to read the file.

Manije72

Does it only prevent viewing the header? I mean, do the variables with generic names still contain the actual data in SAS?

Ready to join fellow brilliant minds for the SAS Hackathon?