SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Extracting subset of observations without merging

Accepted Solution Solved
Reply
Frequent Learner
Posts: 1
Accepted Solution

Extracting subset of observations without merging

Hi, so I have two SAS datasets, one of which is approximately 780GB and the other with approximately 5700 observations. In the smaller dataset, the 5700 observations have a unique identifier also present in the 780GB dataset.  I was wondering if there was a way to extract all the observations in the large dataset that match the identifiers from the smaller dataset (there are multiple observations for each identifier in this dataset).  I have tried merging the two using an inner join; however, it took SAS over 72 hours to complete this.  Is there a way to reference the variable in the smaller dataset in a where statement or any other way to trim down the length of time it takes to perform this function? Thank you in advance.


Accepted Solutions
Solution
‎02-15-2016 05:00 PM
Super User
Posts: 3,255

Re: Extracting subset of observations without merging

[ Edited ]

Sounds like the ideal candidate for a hash table solution:

 

data Want;
  attrib ID length = $8; * Define key variable.;
  drop rc;
* Load the small dataset as a hash table.;
  if _n_ = 1 then do;
    declare hash h(dataset: "WORK.Small");
    h.defineKey("ID");
    h.defineDone();
    call missing(ID);
  end;
* Read in large table;
	set Large;  
  rc = h.find(); * Search in data for match.; 
  if rc = 0; * Keep only rows where ID matches.;  
run;

 

View solution in original post


All Replies
Solution
‎02-15-2016 05:00 PM
Super User
Posts: 3,255

Re: Extracting subset of observations without merging

[ Edited ]

Sounds like the ideal candidate for a hash table solution:

 

data Want;
  attrib ID length = $8; * Define key variable.;
  drop rc;
* Load the small dataset as a hash table.;
  if _n_ = 1 then do;
    declare hash h(dataset: "WORK.Small");
    h.defineKey("ID");
    h.defineDone();
    call missing(ID);
  end;
* Read in large table;
	set Large;  
  rc = h.find(); * Search in data for match.; 
  if rc = 0; * Keep only rows where ID matches.;  
run;

 

Super User
Posts: 5,430

Re: Extracting subset of observations without merging

If this type of query will occur frequently it would be wise to index on the id column, and use SQL join. This will prevent a full table scan and sorting.
Data never sleeps
Super User
Posts: 7,809

Re: Extracting subset of observations without merging

With 5700 obs in the smaller dataset it should be possible to create a format that yields "yes" for any of the 5700 identifiers, and "no" for all other values.

Then you only need to do a single sequential pass through the large dataset and look if put(identifier,format.) = "yes".

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,970

Re: Extracting subset of observations without merging

Would this not also yeild one pass through the data:

data _null_;
  set small_dataset end=last;
  if _n_=1 then call execute('data want;  set big_dataset; select(id);');
  call execute(' when ('||strip(id)||') output;');
  if last then call execute(' otherwise; end; run;');
run;

I.e. a big datastep is generated from the small one, that code is then run on the dataset once.  

Super User
Posts: 7,809

Re: Extracting subset of observations without merging


RW9 wrote:

Would this not also yeild one pass through the data:

data _null_;
  set small_dataset end=last;
  if _n_=1 then call execute('data want;  set big_dataset; select(id);');
  call execute(' when ('||strip(id)||') output;');
  if last then call execute(' otherwise; end; run;');
run;

I.e. a big datastep is generated from the small one, that code is then run on the dataset once.  


Nice. Still another method to avoid the merge.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 3,255

Re: Extracting subset of observations without merging

Posted in reply to KurtBremser

Love the way there are so many methods for processing big data in SAS. It would be interesting to see some of these compared.

Super User
Posts: 7,809

Re: Extracting subset of observations without merging

I guess that anything that allows you to only perform one sequential read through a big dataset will only be bottlenecked by the I/O throughput, therefore being equal in terms of performance. What then guides the decision will be personal programming preferences and which method is most easily unterstood by the next one to maintain the code.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,970

Re: Extracting subset of observations without merging

Posted in reply to KurtBremser

Just to note, in my opinion from:

"What then guides the decision will be personal programming preferences and which method is most easily unterstood by the next one to maintain the code."

 

The personal programming preference is very low on the list.  In terms of importance:

1) Documentation - Functional Design Spec, Testing document, User Guides etc.

2) Good programming practice

3) Company standards

4) Personal preference

 

A programmers afterlife consists of meeting all those programmers who have had to use your code after you, be prepared Smiley Surprised)

Contributor
Posts: 39

Re: Extracting subset of observations without merging

@RW9 said:

A programmers afterlife consists of meeting all those programmers who have had to use your code after you, be prepared Smiley Surprised)

 

 

Ahahaha, I am laughing so hard I am crying. Or maybe I am just crying... The struggle is real.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 838 views
  • 7 likes
  • 6 in conversation