Here is two way , One is for SQL ,Another is for Hash Table.
This is also the first time to write DS2 with Hash Table
data have;
set sashelp.class;
run;
data key;
age=11;output;
age=14;output;
run;
proc ds2;
data want(overwrite=yes);
method run();
set {select * from have where age not in (select age from key)};
end;
enddata;
run;
quit;
data have;
set sashelp.class;
run;
data key;
age=11;output;
age=14;output;
run;
proc ds2;
data want(overwrite=yes);
declare package hash h(20,'key');
method init();
h.defineKey('age');
h.defineDone();
end;
method run();
set have;
if h.check() ne 0 then output;
end;
enddata;
run;
quit;
@Ksharp Unless the 1) data is partitioned in different I/O sub-systems and 2) DS2 uses threads, DS2 will not be faster, will it?
Chris,
Since you think "data is partitioned in different I/O sub-systems" , Why do you think "DS2 will not be faster," ? . I think it is vice verse.
What I am saying is that ds2 can only be beneficial if these 2 conditions are met.
There is no threading in your code and no partitioning mentioned by the op.
No, I don't think so . SAS know how to optimize performance ,how to split the data,how to use thread , once you are using DS2.
I maked a test under UE, interesting thing is Both version of Hash has almost the same time.
data have;
set sashelp.class;
do i=1 to 1000000;
output;
end;
drop i;
run;
data key;
age=11;output;
age=14;output;
run;
proc ds2;
data want(overwrite=yes);
declare package hash h(20,'key');
method init();
h.defineKey('age');
h.defineDone();
end;
method run();
set have;
if h.check() ne 0 then output;
end;
enddata;
run;
quit;
32.2 seconds
data want;
if _n_=1 then do;
declare hash h(dataset:'key',hashexp:20);
h.defineKey('age');
h.defineDone();
end;
set have;
if h.check() ne 0 ;
run;
34.2 seconds
Seems Hash of DS2 has a little edge .
Chris,
I made a couple of test again. And find out Hash of DS2 is a little solwer than Hash of Data Step. What weird thing it is.
Somebody can explain it ? Or As your said "ds2 can only be beneficial if these 2 conditions are met." ?
If your words is right, I would disappoint about DS2 , why DS2 could not be faster than Data Step ?
Chris,
I think you are right . No threads No benefit .
Ds2 is interesting if
1- you can split the process (across multiple data and processing sources)
Or
2- you need the precision offered by typed variables
Or
3- you need to use the packaging capabilities
Honestly.
I am disappoint about DS2.
1)Not friendly to code,
2)Some function are limited , can't compare with Data Step.
3)Hard to use .
They both have their uses.
The data step is much easier and familiar to program.
DS2 can do a lot more but only in those specific cases.
Don't throw the baby out with the bath water.
Ds2 can't do all but it has its place.
@Ksharp It's actually more limited than I thought.
proc ds2 doesn't support concatenated libraries: http://support.sas.com/kb/51/043.html
So any attempt to multithread on SAS tables split across several paths (using SPDE) will fail.
It looks like proc ds2 was not intended for SAS tables and brings limited benefits when used on SAS/Base storage.
SAS/Base is not the intended target at all. Sad....
Use RETURN instead of DELETE of data step.
data have;
set sashelp.class;
run;
proc ds2;
data Non_male(overwrite=yes);
method run();
set have;
if sex eq 'M' then return;
end;
enddata;
run;
quit;
You have made some queries in your comment on using Hash objects.
[1] Loading Both data sets int hash tables:
Huge dataset with KEY and non-key variables is likely to run out
of memory. This amounts to program aborting in the middle.
The first part of the program,
data want;
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'K1', 'K2');
h.definekey('K1', 'K2');
h.definedone();
end;
As SMALL is very small compared to HUGE, I loaded the
KEY(K1 and K2) into Hash table.
Internally, each hash entry in the hash table consumes the
number bytes equivalent to twice the length of K1 and K2
as default when data part of the hash table is left out.
We can save some memory by introducing the data part as:
h.definedata('K1');
before the statement
h.definedone();
Then the memory consumed will be length of K1, K2, K1 and
we get a nominal reduction.
[2] Sophisticated srategies to increase performance:
It is a good question for using the code for Production Jobs.
SAS has more than one way to solve any problem. I recommend
to try out all possible SAS solutions noting both time and memory.
By time I mean both REAL and CPU time. Then choose the best method.
Use of HASEXP:
With respect to solution using Hash Objects, we can use HASHEXP
option. By default, it is 8. One can use from 0 to 20. Again the
correct value for a given data set has to be tried out.
This means that blindly using any value other tha 8 has to be
empirically verified.
Keep K1, K2 of HUGE:
The I/O time to read the entire record of HUGE can be reduced but
further processing is warranted as noted below.
The statement
set HUGE(keep = K1 K2);
is expected to bring K1 and K2 only to the Program Data Vector(PDV).
The Record ID(_N_) of HUGE can be saved for the matched
record. In the next data step, use POINT= option to mark them
as deleted.
Alternatively using,
set HUGE;
flag = 0;
if h.find() =0 then flag = 1;
run;
will identify records to delete(flag=1).
[3] System does not have enough Ram:
This takes to the question of what goes into the hash table.
HUGE is not affecting the RAM as one record of it is brought
to PDV at a time.
Take the case of SMALL data set. If both K1 and K2 are numbers
then each hash table entry will take for the first part of the
code, 32 bytes plus SAS taking extra 16 bytes. We can reduce 8 bytes
only by introducing the hash data part as observed above.
If they are both character-types, then length of K1 rounded up
to 8 bytes and similarly K2. Suppose K1 is 10 bytes and K2 is
3 bytes, the memory for K1 is 16 and K2 is 8 bytes. This amounts
to 24 + 24 bytes and the extra 16 bytes SAS requires.
When everything fails, one possibility is to
split the SMALL data set into number
of parts such that a part can be saved in the hash table.
The HUGE data set has to be processed as many times as the
number of parts.
[4] Revised program that flags the records with 1 for deletion
and 0 for inclusion.
data want;
if _n_ = 1 then do;
declare hash h(HASHEXP:8);
h.definekey('K1', 'K2');
h.definedata('flag');
h.definedone();
do until(last);
set SMALL end = last;
h.add();
end;
end;
set HUGE;
flag = 0;
if h.find() = 0 then flag = 1;
run;
The WANT data set is HUGE with additional variable FLAG.
For further processing, WHERE = option of SET statement can be used in WANT as:
data NEED;
set WANT(where=(flag=0));
... other processing ...
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.