BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
thanksforhelp12
Calcite | Level 5

Hi, 

 

I have an extremely large dataset that crashes my SAS for being too large if I try to perform any commands on it (e.g., proc sort or merge). I only need to work with a small subset of that data; however, the subset I need to work on is identified by a list of id's that is currently in a separate dataset ("smallDataset").


Is there any way to do something to the effect of:

data x;
   set largeDataset;
   where id in smallDataset and largeDataset;
run;

I could then merge the large and small datasets no problem.

 

Any other suggestions of how to accomplish this would be great.

 

Current code:

data largeDataset; set '/folders/myfolders/largeDataset.sas7bdat'; run;
proc sort data=largeDataset; by caseid; run;

 

**Crashes here^^^

 

proc sort data=smallDataset;
by caseid;
run;


data merged;
merge smallDataset largeDataset;
by caseid;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Yeah, it's likely too large for SAS University Edition which it appears you're using. If you can use a full version you won't run into these issues. You also increased the RAM settings and dual core I assume?

 

You can get a subset by using the following code. 

proc sql;
create table sub as
select *
from bigData
where ID in (select Id from smallTable);
quit;

This is one case where it may make sense to make several subsets and loop through them to get what you need, if possible.


@thanksforhelp12 wrote:

Large dataset has information on 300 variables for 2 million cases.

Small dataset has an additional 20 variables for 20,000 of those 2 million cases.

 

I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).


Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).


 

View solution in original post

11 REPLIES 11
thanksforhelp12
Calcite | Level 5

It's around 10gb (1 million by 300 - I know some people say that isn't that big but it has been crashing every time I try to do anything with it - got it working by using a custom user folder and selecting a subset of data (with where) before I ever load it in with a "data" command, which has worked well until I am now faced with the need to somehow select the subset based on another .sas7bat. Here is the error it throws when I try to sort (there is 300gb of HDD space free).

 

 

ERROR: No disk space is available for the write operation. Filename =
/tmp/SAS_util000100000A6F_localhost.localdomain/ut0A6F000005.utl.
ERROR: Failure while attempting to write page 3728 of sorted run 6.
ERROR: Failure while attempting to write page 44778 to utility file 1.
ERROR: Failure encountered while creating initial set of sorted runs.
ERROR: Failure encountered during external sort.
ERROR: Sort execution failure.
 

 

PeterClemmensen
Tourmaline | Level 20

Ok. What about your small dataset?

 

You can write your subsetting code like this without sorting

 

data merged;
   if _N_ = 1 then do;
      declare hash h(dataset:'smallDataset');
      h.defineKey('caseid');
      h.defineDone();
   end;

   set largeDataset;

   if h.check();
run;
thanksforhelp12
Calcite | Level 5

Small dataset is small. Like 20k rows by 19 col. Thanks a lot for the subset code.

 

It executed successfully; however, it returned a large dataset without the new columns from smallDataset added 

thanksforhelp12
Calcite | Level 5

Large dataset has information on 300 variables for 2 million cases.

Small dataset has an additional 20 variables for 20,000 of those 2 million cases.

 

I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).


Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).

 

 

The code above did run; however, it just returned the original 2 million rows by 300 columns. 

Reeza
Super User

Yeah, it's likely too large for SAS University Edition which it appears you're using. If you can use a full version you won't run into these issues. You also increased the RAM settings and dual core I assume?

 

You can get a subset by using the following code. 

proc sql;
create table sub as
select *
from bigData
where ID in (select Id from smallTable);
quit;

This is one case where it may make sense to make several subsets and loop through them to get what you need, if possible.


@thanksforhelp12 wrote:

Large dataset has information on 300 variables for 2 million cases.

Small dataset has an additional 20 variables for 20,000 of those 2 million cases.

 

I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).


Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).


 

thanksforhelp12
Calcite | Level 5
This is exactly what I was looking to do and worked perfectly.

@PeterClemmensen, thank you so much for your help as well, I really appreciate it, and I am sure your new solution would have worked, too. As a newbie around here, I do not how to select which of you to accept as solution. Does it affect you all in any way? @Reeza, thanks again this is exactly what I was looking to do in the original code.
PeterClemmensen
Tourmaline | Level 20

I'm glad you found your answer.

 

If @Reezas answer worked for you, then her answer is the one to mark as a solution 🙂 

PeterClemmensen
Tourmaline | Level 20

Ah ok. Do like this then and insert your variables from smallDataset instead of Var1, Var2, var3 in the call missing statement 

 

data merged;
   if 0 then set smallDataset;

   if _N_ = 1 then do;
      declare hash h(dataset:'smallDataset');
      h.defineKey('caseid');
      h.definedata(all:'Y');
      h.defineDone();

      /* Insert your actual variables from smallDataset instead of Var1, Var2, Var3... */
      call missing('var1', 'var2', 'var3');
   end;

   set largeDataset;

   if h.find();
run;
ballardw
Super User

@thanksforhelp12 wrote:

It's around 10gb (1 million by 300 - I know some people say that isn't that big but it has been crashing every time I try to do anything with it - got it working by using a custom user folder and selecting a subset of data (with where) before I ever load it in with a "data" command, which has worked well until I am now faced with the need to somehow select the subset based on another .sas7bat. Here is the error it throws when I try to sort (there is 300gb of HDD space free).

 

 

ERROR: No disk space is available for the write operation. Filename =
/tmp/SAS_util000100000A6F_localhost.localdomain/ut0A6F000005.utl.
ERROR: Failure while attempting to write page 3728 of sorted run 6.
ERROR: Failure while attempting to write page 44778 to utility file 1.
ERROR: Failure encountered while creating initial set of sorted runs.
ERROR: Failure encountered during external sort.
ERROR: Sort execution failure.
 

 


If you are working in a networked environment such as Enterprise Guide perhaps your SAS admin has restricted the amount of disk space you are allowed to use.

 

If you are  using Proc Sort then perhaps you want to try using the TAGSORT option. Only the sort keys and an observation number are used. Which means most of the 300 variables are ignored during the mechanics of the sort and then the values are brought in the sorted result of the key variables.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1112 views
  • 3 likes
  • 4 in conversation