BookmarkSubscribeRSS Feed
Ksharp
Super User

Here is two way , One is for SQL ,Another is for Hash Table. 

This is also the first time to write DS2 with Hash Table 

 

data have;
 set sashelp.class;
run;

data key;
 age=11;output;
 age=14;output;
run;

proc ds2;
data want(overwrite=yes);
 method run();
  set {select * from have where age not in (select age from key)};
 end;
enddata;
run;
quit;

 

 

 

data have;
 set sashelp.class;
run;

data key;
 age=11;output;
 age=14;output;
run;


proc ds2;
data want(overwrite=yes);

 declare package hash h(20,'key');

 method init();
  h.defineKey('age');
  h.defineDone();
 end;

 method run();
  set have;
  if h.check() ne 0 then output;
 end;

enddata;
run;
quit;

 

ChrisNZ
Tourmaline | Level 20

 @Ksharp Unless the 1) data is partitioned in different I/O sub-systems and 2) DS2 uses threads, DS2 will not be faster, will it?

Ksharp
Super User

Chris,

Since you think "data is partitioned in different I/O sub-systems" , Why do you think "DS2 will not be faster," ? . I think it is vice verse.

ChrisNZ
Tourmaline | Level 20

What I am saying is that ds2 can only be beneficial if these 2 conditions are met.

There is no threading in your code and no partitioning mentioned by the op.

Ksharp
Super User

No, I don't think so .  SAS know how to optimize performance ,how to split the data,how to use thread , once you are using DS2.

I maked a test under UE, interesting thing is Both version of Hash has almost the same time.

 

 

data have;
 set sashelp.class;
 do i=1 to 1000000;
  output;
 end;
 drop i;
run;

data key;
 age=11;output;
 age=14;output;
run;


proc ds2;
data want(overwrite=yes);

 declare package hash h(20,'key');

 method init();
  h.defineKey('age');
  h.defineDone();
 end;

 method run();
  set have;
  if h.check() ne 0 then output;
 end;

enddata;
run;
quit;

32.2 seconds

 

 

 

data want;
 if _n_=1 then do;
  declare hash h(dataset:'key',hashexp:20);
  h.defineKey('age');
  h.defineDone();
  end;
 set have;
 if h.check() ne 0 ;
run;

34.2 seconds

 

 

 

Seems Hash of DS2 has a little edge .

 

 

 

Ksharp
Super User

Chris,

I made a couple of test again. And find out Hash of DS2 is a little solwer than Hash of Data Step. What weird thing it is.

Somebody can explain it ? Or As your said  "ds2 can only be beneficial if these 2 conditions are met." ?

If your words is right, I would disappoint about DS2 , why DS2 could not be faster than Data Step ?

Ksharp
Super User

Chris,

I think you are right . No threads No benefit .

ChrisNZ
Tourmaline | Level 20

Ds2 is interesting if

1- you can split the process (across multiple data and processing sources)

Or

2- you need the precision offered by typed variables

Or

3- you need to use the packaging capabilities

 

 

Ksharp
Super User

Honestly.

I am disappoint about DS2.

1)Not friendly to code,

2)Some function are limited , can't compare with Data Step.

3)Hard to use .

ChrisNZ
Tourmaline | Level 20

They both have their uses.

The data step is much easier and familiar to program.

DS2 can do a lot more but only in those specific cases.

Don't throw the baby out with the bath water.

Ds2 can't do all but it has its place.

ChrisNZ
Tourmaline | Level 20

@Ksharp It's actually more limited than I thought.

proc ds2 doesn't support concatenated libraries: http://support.sas.com/kb/51/043.html

So any attempt to multithread on SAS tables split across several paths (using SPDE) will fail.

It looks like proc ds2 was not intended for SAS tables and brings limited benefits when used on SAS/Base storage.

SAS/Base is not the intended target at all. Sad....

 

Ksharp
Super User
Me sad too. Anyway I am going to take a look of PROC DS2 in new future. But NOW I am focus on Statistical Analysis .
Ksharp
Super User

Use RETURN instead of DELETE of data step.

 

data have;
 set sashelp.class;
run;


proc ds2;
data Non_male(overwrite=yes);
 method run();
  set have;
  if sex eq 'M' then return;
 end;
enddata;
run;
quit;
KachiM
Rhodochrosite | Level 12

 

 

You have made some queries in your comment on using Hash objects.
[1] Loading Both data sets int hash tables:
Huge dataset with KEY and non-key variables is likely to run out
of memory. This amounts to program aborting in the middle.

The first part of the program,

data want;
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'K1', 'K2');
h.definekey('K1', 'K2');
h.definedone();
end;

As SMALL is very small compared to HUGE, I loaded the
KEY(K1 and K2) into Hash table.
Internally, each hash entry in the hash table consumes the
number bytes equivalent to twice the length of K1 and K2
as default when data part of the hash table is left out.
We can save some memory by introducing the data part as:

h.definedata('K1');

before the statement

h.definedone();

Then the memory consumed will be length of K1, K2, K1 and
we get a nominal reduction.

 

[2] Sophisticated srategies to increase performance:
It is a good question for using the code for Production Jobs.
SAS has more than one way to solve any problem. I recommend
to try out all possible SAS solutions noting both time and memory.
By time I mean both REAL and CPU time. Then choose the best method.

Use of HASEXP:
With respect to solution using Hash Objects, we can use HASHEXP
option. By default, it is 8. One can use from 0 to 20. Again the
correct value for a given data set has to be tried out.
This means that blindly using any value other tha 8 has to be
empirically verified.

 

Keep K1, K2 of HUGE:
The I/O time to read the entire record of HUGE can be reduced but
further processing is warranted as noted below.


The statement


set HUGE(keep = K1 K2);


is expected to bring K1 and K2 only to the Program Data Vector(PDV).
The Record ID(_N_) of HUGE can be saved for the matched
record. In the next data step, use POINT= option to mark them
as deleted.

 

Alternatively using,
set HUGE;
flag = 0;
if h.find() =0 then flag = 1;
run;
will identify records to delete(flag=1).

[3] System does not have enough Ram:
This takes to the question of what goes into the hash table.
HUGE is not affecting the RAM as one record of it is brought
to PDV at a time.
Take the case of SMALL data set. If both K1 and K2 are numbers
then each hash table entry will take for the first part of the
code, 32 bytes plus SAS taking extra 16 bytes. We can reduce 8 bytes
only by introducing the hash data part as observed above.
If they are both character-types, then length of K1 rounded up
to 8 bytes and similarly K2. Suppose K1 is 10 bytes and K2 is
3 bytes, the memory for K1 is 16 and K2 is 8 bytes. This amounts
to 24 + 24 bytes and the extra 16 bytes SAS requires.

When everything fails, one possibility is to

split the SMALL data set into number
of parts such that a part can be saved in the hash table.
The HUGE data set has to be processed as many times as the
number of parts.

[4] Revised program that flags the records with 1 for deletion
and 0 for inclusion.

 

data want;
   if _n_ = 1 then do;
      declare hash h(HASHEXP:8);
      h.definekey('K1', 'K2');
      h.definedata('flag');
      h.definedone();
      do until(last);
         set SMALL end = last;
         h.add();
      end;
   end;
 
   set HUGE;
   flag = 0;
   if h.find() = 0 then flag = 1;
run;

 

 The WANT data set is HUGE with additional variable FLAG.

 

For further processing, WHERE = option of SET statement can be used in WANT as:

 

data NEED;

   set WANT(where=(flag=0));

  ... other processing  ...

run;

 

ChrisNZ
Tourmaline | Level 20
Use check() instead of find(). But I doubt it will make a huge difference.
Sadly or its not rare that sas is much faster copying a whole table than updating it in place. I guess it depends on the proportion of records to update.
Well done!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 9841 views
  • 12 likes
  • 6 in conversation