BookmarkSubscribeRSS Feed
xyxu
Quartz | Level 8

Hi folks,

 

I am working on a few large datasets (> 20GB) and my desktop can barely handle them efficiently, so I was wondering about the best way to run my code with cloud services. I tried AWS, but they seem to only offer University Edition, which restricts the data file size to be <= 10MB (ridiculously small nowadays). What is your recommended solution?

13 REPLIES 13
Reeza
Super User

Do you have SAS licensed and purchased? If so, you can run it on any cloud services, by setting up a virtual machine and then installing it on that machine. Another option is SAAS NOW which sells access. https://www.saasnow.com

 

SAS UE is designed for learning, not for usage on large datasets. If you're affiliated with a University your University may sell access to SAS for a very low price, mine does for $29 for example for an annual license. 

xyxu
Quartz | Level 8

Yes, I do have a paid SAS licensed. My school sold it to me at $90/year. 

How much did you pay for setting up a virtual machine? On AWS, it seems a reasonably good one costs quite a bit:

xyxu_0-1587322664684.png

 

 

SASKiwi
PROC Star

Adding a solid state drive (SSD) to your PC would give you much better performance and they are not that expensive these days.

xyxu
Quartz | Level 8

I installed the system on a 250GB SSD and run SAS on it. When I compare the speed, a sort procedure takes 3 minutes on SAS but only 1.5 seconds with Julia. I am considering buying a bigger SSD but that might not help the speed though.  

SASKiwi
PROC Star

If you run SAS on one SSD and have your data on another then I'd expect that would be faster. Also check where your SAS WORK library is pointing. That needs to be on SSD too as that is where all of the SAS sorting utility files go.

Reeza
Super User
Isn't Julia In Memory though, whereas SAS Base is drive based? I wouldn't expect the same performances at all between those types of systems. Can you work with the 20GB file on your desktop with Julia?
xyxu
Quartz | Level 8

Yeah that's why I found the speed difference a bit shocking. I exported the SAS dataset to csv and let Julia read it. Then I ran the same sorting procedure and compared their time.

Reeza
Super User

@xyxu wrote:

Yeah that's why I found the speed difference a bit shocking. I exported the SAS dataset to csv and let Julia read it. Then I ran the same sorting procedure and compared their time.


So Julia is processing a 20GB file on your desktop?

 

SAS CAS operates in memory and will likely have the requirements you need but it's not a cheap/easy set up by any means and very much an enterprise software. Not sure I'd even feel comfortable tackling that set up personally. 

 

I'm genuinely curious by the way. I actually no longer work with SAS except on the side and work with Python/R but not Julia. We run into massive issues with our data sets (900 million rows) and getting R or Python to work with them at the moment. Likely looking at setting up a Spark cluster but not sure now as there may be budget restrictions. 

xyxu
Quartz | Level 8

My original SAS dataset is 20.8 GB, and it becomes 6.2GB when exported to CSV. So what Julia actually processes is the 6.2GB file. 

ballardw
Super User

@xyxu wrote:

My original SAS dataset is 20.8 GB, and it becomes 6.2GB when exported to CSV. So what Julia actually processes is the 6.2GB file. 


I might suggest looking at your variable properties. Do you have a lot of character variables with large assigned lengths but many of the variables actually don't use most of the characters reserved?

If so you might want to look at how that SAS set was built and if all those variables really need that length.

 

Any analysis done with SAS can always used a reduced set of variables by using the data set KEEP option so you needn't drag a bunch of variables you aren't using.

xyxu
Quartz | Level 8

Yes you are right. When I input raw data files into SAS, I set length to be fairly long ($200.) to ensure no string is truncated, such as firm names and addresses. This increases the size a lot. Do you have any suggested exercise that reduces unneccessary size while avoiding potential truncation?

Reeza
Super User
You didn't use PROC IMPORT on the 6.2GB CSV did you? If so that would definitely lead to some issues in terms of data size.
xyxu
Quartz | Level 8

No, I didn't. I think the dataset size is related to string lengths.

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 2221 views
  • 0 likes
  • 4 in conversation