BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Timmay
Calcite | Level 5

Hi, so I have two SAS datasets, one of which is approximately 780GB and the other with approximately 5700 observations. In the smaller dataset, the 5700 observations have a unique identifier also present in the 780GB dataset.  I was wondering if there was a way to extract all the observations in the large dataset that match the identifiers from the smaller dataset (there are multiple observations for each identifier in this dataset).  I have tried merging the two using an inner join; however, it took SAS over 72 hours to complete this.  Is there a way to reference the variable in the smaller dataset in a where statement or any other way to trim down the length of time it takes to perform this function? Thank you in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
SASKiwi
PROC Star

Sounds like the ideal candidate for a hash table solution:

 

data Want;
  attrib ID length = $8; * Define key variable.;
  drop rc;
* Load the small dataset as a hash table.;
  if _n_ = 1 then do;
    declare hash h(dataset: "WORK.Small");
    h.defineKey("ID");
    h.defineDone();
    call missing(ID);
  end;
* Read in large table;
	set Large;  
  rc = h.find(); * Search in data for match.; 
  if rc = 0; * Keep only rows where ID matches.;  
run;

 

View solution in original post

9 REPLIES 9
SASKiwi
PROC Star

Sounds like the ideal candidate for a hash table solution:

 

data Want;
  attrib ID length = $8; * Define key variable.;
  drop rc;
* Load the small dataset as a hash table.;
  if _n_ = 1 then do;
    declare hash h(dataset: "WORK.Small");
    h.defineKey("ID");
    h.defineDone();
    call missing(ID);
  end;
* Read in large table;
	set Large;  
  rc = h.find(); * Search in data for match.; 
  if rc = 0; * Keep only rows where ID matches.;  
run;

 

LinusH
Tourmaline | Level 20
If this type of query will occur frequently it would be wise to index on the id column, and use SQL join. This will prevent a full table scan and sorting.
Data never sleeps
Kurt_Bremser
Super User

With 5700 obs in the smaller dataset it should be possible to create a format that yields "yes" for any of the 5700 identifiers, and "no" for all other values.

Then you only need to do a single sequential pass through the large dataset and look if put(identifier,format.) = "yes".

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Would this not also yeild one pass through the data:

data _null_;
  set small_dataset end=last;
  if _n_=1 then call execute('data want;  set big_dataset; select(id);');
  call execute(' when ('||strip(id)||') output;');
  if last then call execute(' otherwise; end; run;');
run;

I.e. a big datastep is generated from the small one, that code is then run on the dataset once.  

Kurt_Bremser
Super User

@RW9 wrote:

Would this not also yeild one pass through the data:

data _null_;
  set small_dataset end=last;
  if _n_=1 then call execute('data want;  set big_dataset; select(id);');
  call execute(' when ('||strip(id)||') output;');
  if last then call execute(' otherwise; end; run;');
run;

I.e. a big datastep is generated from the small one, that code is then run on the dataset once.  


Nice. Still another method to avoid the merge.

SASKiwi
PROC Star

Love the way there are so many methods for processing big data in SAS. It would be interesting to see some of these compared.

Kurt_Bremser
Super User

I guess that anything that allows you to only perform one sequential read through a big dataset will only be bottlenecked by the I/O throughput, therefore being equal in terms of performance. What then guides the decision will be personal programming preferences and which method is most easily unterstood by the next one to maintain the code.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Just to note, in my opinion from:

"What then guides the decision will be personal programming preferences and which method is most easily unterstood by the next one to maintain the code."

 

The personal programming preference is very low on the list.  In terms of importance:

1) Documentation - Functional Design Spec, Testing document, User Guides etc.

2) Good programming practice

3) Company standards

4) Personal preference

 

A programmers afterlife consists of meeting all those programmers who have had to use your code after you, be prepared 🐵

Michelle
Obsidian | Level 7

@RW9 said:

A programmers afterlife consists of meeting all those programmers who have had to use your code after you, be prepared Smiley Surprised)

 

 

Ahahaha, I am laughing so hard I am crying. Or maybe I am just crying... The struggle is real.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 2266 views
  • 9 likes
  • 6 in conversation