Hi,
I have an extremely large dataset that crashes my SAS for being too large if I try to perform any commands on it (e.g., proc sort or merge). I only need to work with a small subset of that data; however, the subset I need to work on is identified by a list of id's that is currently in a separate dataset ("smallDataset").
Is there any way to do something to the effect of:
data x; set largeDataset; where id in smallDataset and largeDataset; run;
I could then merge the large and small datasets no problem.
Any other suggestions of how to accomplish this would be great.
Current code:
data largeDataset; set '/folders/myfolders/largeDataset.sas7bdat'; run; proc sort data=largeDataset; by caseid; run;
**Crashes here^^^
proc sort data=smallDataset;
by caseid;
run;
data merged;
merge smallDataset largeDataset;
by caseid;
run;
Yeah, it's likely too large for SAS University Edition which it appears you're using. If you can use a full version you won't run into these issues. You also increased the RAM settings and dual core I assume?
You can get a subset by using the following code.
proc sql;
create table sub as
select *
from bigData
where ID in (select Id from smallTable);
quit;
This is one case where it may make sense to make several subsets and loop through them to get what you need, if possible.
@thanksforhelp12 wrote:
Large dataset has information on 300 variables for 2 million cases.
Small dataset has an additional 20 variables for 20,000 of those 2 million cases.
I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).
Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).
How large is 'extremely large'?
It's around 10gb (1 million by 300 - I know some people say that isn't that big but it has been crashing every time I try to do anything with it - got it working by using a custom user folder and selecting a subset of data (with where) before I ever load it in with a "data" command, which has worked well until I am now faced with the need to somehow select the subset based on another .sas7bat. Here is the error it throws when I try to sort (there is 300gb of HDD space free).
Ok. What about your small dataset?
You can write your subsetting code like this without sorting
data merged;
if _N_ = 1 then do;
declare hash h(dataset:'smallDataset');
h.defineKey('caseid');
h.defineDone();
end;
set largeDataset;
if h.check();
run;
Small dataset is small. Like 20k rows by 19 col. Thanks a lot for the subset code.
It executed successfully; however, it returned a large dataset without the new columns from smallDataset added
What new columns? Please be more specific 🙂
Large dataset has information on 300 variables for 2 million cases.
Small dataset has an additional 20 variables for 20,000 of those 2 million cases.
I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).
Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).
The code above did run; however, it just returned the original 2 million rows by 300 columns.
Yeah, it's likely too large for SAS University Edition which it appears you're using. If you can use a full version you won't run into these issues. You also increased the RAM settings and dual core I assume?
You can get a subset by using the following code.
proc sql;
create table sub as
select *
from bigData
where ID in (select Id from smallTable);
quit;
This is one case where it may make sense to make several subsets and loop through them to get what you need, if possible.
@thanksforhelp12 wrote:
Large dataset has information on 300 variables for 2 million cases.
Small dataset has an additional 20 variables for 20,000 of those 2 million cases.
I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).
Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).
I'm glad you found your answer.
If @Reezas answer worked for you, then her answer is the one to mark as a solution 🙂
Ah ok. Do like this then and insert your variables from smallDataset instead of Var1, Var2, var3 in the call missing statement
data merged;
if 0 then set smallDataset;
if _N_ = 1 then do;
declare hash h(dataset:'smallDataset');
h.defineKey('caseid');
h.definedata(all:'Y');
h.defineDone();
/* Insert your actual variables from smallDataset instead of Var1, Var2, Var3... */
call missing('var1', 'var2', 'var3');
end;
set largeDataset;
if h.find();
run;
@thanksforhelp12 wrote:
It's around 10gb (1 million by 300 - I know some people say that isn't that big but it has been crashing every time I try to do anything with it - got it working by using a custom user folder and selecting a subset of data (with where) before I ever load it in with a "data" command, which has worked well until I am now faced with the need to somehow select the subset based on another .sas7bat. Here is the error it throws when I try to sort (there is 300gb of HDD space free).
ERROR: No disk space is available for the write operation. Filename =/tmp/SAS_util000100000A6F_localhost.localdomain/ut0A6F000005.utl.ERROR: Failure while attempting to write page 3728 of sorted run 6.ERROR: Failure while attempting to write page 44778 to utility file 1.ERROR: Failure encountered while creating initial set of sorted runs.ERROR: Failure encountered during external sort.ERROR: Sort execution failure.
If you are working in a networked environment such as Enterprise Guide perhaps your SAS admin has restricted the amount of disk space you are allowed to use.
If you are using Proc Sort then perhaps you want to try using the TAGSORT option. Only the sort keys and an observation number are used. Which means most of the 300 variables are ignored during the mechanics of the sort and then the values are brought in the sorted result of the key variables.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.