Solved: Re: Perform logic in one dataset based on a condition in a separate da...

thanksforhelp12 · Posted 02-12-2019 02:33 PM

Hi,

I have an extremely large dataset that crashes my SAS for being too large if I try to perform any commands on it (e.g., proc sort or merge). I only need to work with a small subset of that data; however, the subset I need to work on is identified by a list of id's that is currently in a separate dataset ("smallDataset").

Is there any way to do something to the effect of:

data x;
   set largeDataset;
   where id in smallDataset and largeDataset;
run;

I could then merge the large and small datasets no problem.

Any other suggestions of how to accomplish this would be great.

Current code:

data largeDataset; set '/folders/myfolders/largeDataset.sas7bdat'; run;
proc sort data=largeDataset; by caseid; run;

**Crashes here^^^

proc sort data=smallDataset;
by caseid;
run;


data merged;
merge smallDataset largeDataset;
by caseid;
run;

Reeza · Posted 02-12-2019 03:39 PM

Yeah, it's likely too large for SAS University Edition which it appears you're using. If you can use a full version you won't run into these issues. You also increased the RAM settings and dual core I assume?

You can get a subset by using the following code.

proc sql;
create table sub as
select *
from bigData
where ID in (select Id from smallTable);
quit;

This is one case where it may make sense to make several subsets and loop through them to get what you need, if possible.

@thanksforhelp12 wrote:

Large dataset has information on 300 variables for 2 million cases.

Small dataset has an additional 20 variables for 20,000 of those 2 million cases.

I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).

Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).

View solution in original post

PeterClemmensen · Posted 02-12-2019 02:35 PM

How large is 'extremely large'?

The DATA to DATA Step Macro
Blog: SASnrd

thanksforhelp12 · Posted 02-12-2019 02:40 PM

It's around 10gb (1 million by 300 - I know some people say that isn't that big but it has been crashing every time I try to do anything with it - got it working by using a custom user folder and selecting a subset of data (with where) before I ever load it in with a "data" command, which has worked well until I am now faced with the need to somehow select the subset based on another .sas7bat. Here is the error it throws when I try to sort (there is 300gb of HDD space free).

ERROR: No disk space is available for the write operation. Filename =

/tmp/SAS_util000100000A6F_localhost.localdomain/ut0A6F000005.utl.

ERROR: Failure while attempting to write page 3728 of sorted run 6.

ERROR: Failure while attempting to write page 44778 to utility file 1.

ERROR: Failure encountered while creating initial set of sorted runs.

ERROR: Failure encountered during external sort.

ERROR: Sort execution failure.

PeterClemmensen · Posted 02-12-2019 02:43 PM

Ok. What about your small dataset?

You can write your subsetting code like this without sorting

data merged;
   if _N_ = 1 then do;
      declare hash h(dataset:'smallDataset');
      h.defineKey('caseid');
      h.defineDone();
   end;

   set largeDataset;

   if h.check();
run;

The DATA to DATA Step Macro
Blog: SASnrd

thanksforhelp12 · Posted 02-12-2019 03:11 PM

Small dataset is small. Like 20k rows by 19 col. Thanks a lot for the subset code.

It executed successfully; however, it returned a large dataset without the new columns from smallDataset added

PeterClemmensen · Posted 02-12-2019 03:31 PM

What new columns? Please be more specific 🙂

The DATA to DATA Step Macro
Blog: SASnrd

thanksforhelp12 · Posted 02-12-2019 03:35 PM

Large dataset has information on 300 variables for 2 million cases.

Small dataset has an additional 20 variables for 20,000 of those 2 million cases.

I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).

Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).

The code above did run; however, it just returned the original 2 million rows by 300 columns.

Reeza · Posted 02-12-2019 03:39 PM

Yeah, it's likely too large for SAS University Edition which it appears you're using. If you can use a full version you won't run into these issues. You also increased the RAM settings and dual core I assume?

You can get a subset by using the following code.

proc sql;
create table sub as
select *
from bigData
where ID in (select Id from smallTable);
quit;

This is one case where it may make sense to make several subsets and loop through them to get what you need, if possible.

@thanksforhelp12 wrote:

Large dataset has information on 300 variables for 2 million cases.

Small dataset has an additional 20 variables for 20,000 of those 2 million cases.

I would like to perform analyses on those 20,000 cases including both the original 300 and additional 20 variables (columns).

Therefore, my goal is to merge smallDataset with the subset of cases from largeDataset whose caseid is also in smallDataset (i.e., to end up with 20,000 cases (rows) of 320 variables).

thanksforhelp12 · Posted 02-12-2019 03:53 PM

This is exactly what I was looking to do and worked perfectly.

@PeterClemmensen, thank you so much for your help as well, I really appreciate it, and I am sure your new solution would have worked, too. As a newbie around here, I do not how to select which of you to accept as solution. Does it affect you all in any way? @Reeza, thanks again this is exactly what I was looking to do in the original code.

PeterClemmensen · Posted 02-12-2019 03:57 PM

I'm glad you found your answer.

If @Reezas answer worked for you, then her answer is the one to mark as a solution 🙂

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 02-12-2019 03:43 PM

Ah ok. Do like this then and insert your variables from smallDataset instead of Var1, Var2, var3 in the call missing statement

data merged;
   if 0 then set smallDataset;

   if _N_ = 1 then do;
      declare hash h(dataset:'smallDataset');
      h.defineKey('caseid');
      h.definedata(all:'Y');
      h.defineDone();

      /* Insert your actual variables from smallDataset instead of Var1, Var2, Var3... */
      call missing('var1', 'var2', 'var3');
   end;

   set largeDataset;

   if h.find();
run;

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 02-12-2019 03:08 PM

@thanksforhelp12 wrote:

It's around 10gb (1 million by 300 - I know some people say that isn't that big but it has been crashing every time I try to do anything with it - got it working by using a custom user folder and selecting a subset of data (with where) before I ever load it in with a "data" command, which has worked well until I am now faced with the need to somehow select the subset based on another .sas7bat. Here is the error it throws when I try to sort (there is 300gb of HDD space free).

ERROR: No disk space is available for the write operation. Filename =

/tmp/SAS_util000100000A6F_localhost.localdomain/ut0A6F000005.utl.

ERROR: Failure while attempting to write page 3728 of sorted run 6.

ERROR: Failure while attempting to write page 44778 to utility file 1.

ERROR: Failure encountered while creating initial set of sorted runs.

ERROR: Failure encountered during external sort.

ERROR: Sort execution failure.
 

If you are working in a networked environment such as Enterprise Guide perhaps your SAS admin has restricted the amount of disk space you are allowed to use.

If you are using Proc Sort then perhaps you want to try using the TAGSORT option. Only the sort keys and an observation number are used. Which means most of the 300 variables are ignored during the mechanics of the sort and then the values are brought in the sorted result of the key variables.

Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Re: Perform logic in one dataset based on a condition in a separate dataset

Ready to join fellow brilliant minds for the SAS Hackathon?