I have a basic laptop, and working with my huge dataset ( millions of rows and 100 variables) turned into a nightmare. Therefore I am planning to request an upgrade whether on my laptop or request a desktop from my advisor. I do not have IT knowledge and I need to know what laptop/desktop characteristics (IT-related characteristics INCLUDING CPU,...) are recommended to have the SAS work properly and faster on huge dataset analysis. I appreciate any advice on this matter.
The dataset is almost 3 GiG itself. ( it has >100 million rows). recently I had a "guessing row" in my proc and it took 1 hour to run. I will need to run regression procs later but now I am more in descriptive process and data transformation, and it takes too long.
If the issue is with "guessingrows" then you need to stop doing that with large data sets (limited to 32K rows anyway). Any data set of 100 million rows without documentation "forcing" use of Proc Import (basically the only use of guessingrows) is garbage to begin with. Use the documentation to write a proper data step program.
Note that reading 100 million rows of data in one hour is roughly 28,000 rows per second. It takes time to read/process big files. Period.
A faster disk may be of more benefit than memory or CPU upgrades with large files.
Obviously, I don't know your project, but chances are that your current hardware is fully sufficient for a much larger part of your work than you might think.
As others have already suggested, I would start with writing a DATA step reading a few thousand records of the raw data (using the OBS= option of the INFILE statement). Then, after a very short run time and a careful check of the log, you will have created a fairly small SAS dataset which you can explore with PROC PRINT, PROC FREQ, PROC MEANS, etc. with technical data correctness being the first objective. Are all variables populated as expected? Check for missing values, truncated character values, incorrect (in)formats and so on and correct the reading step as necessary. A single run of the "pre-final" DATA step on the full raw data file will show if unexpected issues occur further down in the raw data.
Then continue the investigation of the small dataset to become familiar with the data structure. What are the key variables? How many observations are there per key variable combination? Is the dataset already properly sorted? Are there redundant variables? ...
Based on that knowledge together with the data description you can decide if the existing dataset is a sufficient basis for writing the DATA and PROC steps for the "descriptive process and data transformation" that you mentioned and the regression or if you need to read more records from the raw data (not necessarily the first n records, possibly only selected variables).
But even a sufficient extended subset of the data is most likely small enough that run time will not be an issue until you have developed almost production-ready programs for plausibility checks, descriptive tables and graphs as well as inferential analyses.
At this advanced stage you can (step by step) increase the number of records involved and see how the run times of the various DATA and PROC steps change. Thanks to the extensive preparations you will hardly ever need to run a step many times because of mistakes.
So, probably most of your work can be done on a relatively small subset of the data (e.g., a suitable random sample) and before a few (overnight) runs of the programs on the full data, your hardware limitations will not be an obstacle.
What does "nightmare" mean. Does it means painfully slow, or were you actually getting errors? If you were getting errors, what errors did you get, and what were you were doing when you got the error?
Also, since you mention having an advisor I assume that means you are at a university? You may want to ask if the university has a SAS server environment that you could use for this work.
Though to be fair, you're only doing this once typically and then saving it to a drive so it's not a huge time saver, just something to know.
Also, if you have an SSD it's faster than a typical drive.
One would hope so. But how many examples do we find from beginners re-importing files for each session because they don't import the data to a permanent library for reuse on this forum?
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
For SAS newbies, this video is a great way to get started. James Harroun walks through the process using SAS Studio for SAS OnDemand for Academics, but the same steps apply to any analytics project.
Find more tutorials on the SAS Users YouTube channel.