Hi all,
I am working with big data the size of around 158,000,000 KB . I am using SAS EG on PC. Its taking a long time to bring in the data. I will be using 'views' to work with the data and also limiting the data to few required variables.
What are some of the things I should take into account while working with 'views' and/or other ways I could work with the data faster? Your inputs will be really helpful. Please let me know as soon as possible. Thanks!
@Tom Thanks for all your suggestions. I will definitely be implementing them. I have about 20000 ids so it will be hard to put them in a macro. It would have certainly helped to subset to those ids in the first step itself.
Define "bring in the data". Do you mean reading a SAS dataset for analysis, or importing the data from external sources?
@Kurt_Bremser sure. I am working with a sas dataset the size of 186 GB. The 'view' for the sas dataset 'have' gets created instantly but the run time for the next datastep doesn' t seem to end. Below is the code I am running presently. I selected both the data steps to run and its about 20 minutes now and its still running.
data want(keep=id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11) /view=want;
set libref.have;
run;
data want_2;
set want;
if id=. then id2=0;
else id2=1;
run;
Don't make a copy of the large data if you can avoid it. What you have posted as two steps could all be done in the first step.
data want /view=want;
set libref.have(keep=id var1-var11) ;
id2 = not missing(id);
run;
What are you doing to DO with the data?
@Tom thanks for the advice. I actually need to work a lot more with these views. Doing a proc sort on the view is also time consuming. Is that normal?
For e.g: for the view 'want'
proc sort data=want out=want_sort; by id; run;
You are again making a copy of the data.
If you need the data sorted in a way it is not already sorted then you might look at using an INDEX instead.
I am not sure that an index can be used with a view however.
Do you really need to sort the data?
What are you ultimately trying to do?
@Tom I get your point but I have a larger task at hand. The system is not built to handle big data and I am thinking of ways to process it faster. Here is a detailed list of my task. Please advice me on what's the best way to go about it:
(1) I have 5 large sas datasets orig1 to orig5 around 186 gb size each . I just need a few variables from each of these datasets with non-missing ids for which I would like to create views called orig_1v to orig_5v;
So the code will be. Creating a view allows me to get what I need instantly.
data orig_1v (keep=id start_date2 end_date type)/view=orig_1v;
set orig1;
where id ne .;
run;
Below is the sample of each of the orig_1v to Orig5v views (please ignore the last 2 empty columns)
|
|
|
(2) I need only certain ids from orig_1v to orig_5v which are present in the dataset ‘required’. Below is the sample of the ‘required’ sas dataset
id |
start_date |
001 |
1/1/2001 |
002 |
1/2/2002 |
003 |
1/2/3002 |
004 |
1/3/2002 |
(3) I can either stack the views first (which I don't know if it’s a good idea to do that):
data orig_vall;
Set orig_1v
orig_2v
orig_3v
orig_4v
orig_5v;
run;
(4) Then I can do an inner join to create another view:
proc sql;
create view as want
Select a*.,b.start_date
from orig_all as a
Inner join required as b
on a.id=b.id;
quit;
Once I have the view ‘want’ I have to create derived variables and flag them based on conditions related to the dates:
start_date, start_date2 and end_date. Could I do that using the view 'want'?
Is this a good approach? Please let me know. Thanks.
As the first step, create new datasets (not views) that only contain the needed variables. These will sort a lot faster.
The first step will of course take lots of time, but a sequential read will be the fastest you can do.
Take a look at the overall data rate of your storage. With data this size, you need something that can deliver 500 MB/s or faster.
@Kurt_Bremser thanks for your suggestion. I will do that. How is it possible to estimate the overall data rate of storage? I would like to see that and provide information to our technical analyst in case we need any changes. Please let me know.
@dr2014 wrote:
@Kurt_Bremser thanks for your suggestion. I will do that. How is it possible to estimate the overall data rate of storage? I would like to see that and provide information to our technical analyst in case we need any changes. Please let me know.
Just do a
options fullstimer;
data _null_;
set yourlib.yourdata;
run;
and look at the log. Divide the size by the time, and you have your sustained read throughput.
If you pull data from a remote DBMS, the network will be your bottleneck. In this case, unload as much of the initial "slicing" to the database, with explicit pass-through, if needed.
@Kurt_Bremser I will check use your code to check on the time. Thanks for all your advice. Basically, more than 1 person has contributed on the solution. Thanks much!
If you have SAS on a remote server available to you it would be far quicker there.
It is on a remote server but its not helping much with the run time.
I suggest making your post subject a bit clearer then.
Since (2) seems to be the part that will subset the data the most do that first.
How large is this list? Can you fit it in a macro variable?
Can you fit it in memory?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.