Re: Fastest way to delete rows from a dataset by key - Page 2

Ksharp · Posted 06-23-2016 01:46 AM

Here is two way , One is for SQL ,Another is for Hash Table.

This is also the first time to write DS2 with Hash Table

data have;
 set sashelp.class;
run;

data key;
 age=11;output;
 age=14;output;
run;

proc ds2;
data want(overwrite=yes);
 method run();
  set {select * from have where age not in (select age from key)};
 end;
enddata;
run;
quit;

data have;
 set sashelp.class;
run;

data key;
 age=11;output;
 age=14;output;
run;


proc ds2;
data want(overwrite=yes);

 declare package hash h(20,'key');

 method init();
  h.defineKey('age');
  h.defineDone();
 end;

 method run();
  set have;
  if h.check() ne 0 then output;
 end;

enddata;
run;
quit;

ChrisNZ · Posted 06-23-2016 07:11 PM

@Ksharp Unless the 1) data is partitioned in different I/O sub-systems and 2) DS2 uses threads, DS2 will not be faster, will it?

High-Performance SAS Coding - Third Edition

Ksharp · Posted 06-23-2016 09:01 PM

Chris,

Since you think "data is partitioned in different I/O sub-systems" , Why do you think "DS2 will not be faster," ? . I think it is vice verse.

ChrisNZ · Posted 06-24-2016 02:41 AM

What I am saying is that ds2 can only be beneficial if these 2 conditions are met.

There is no threading in your code and no partitioning mentioned by the op.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 06-24-2016 03:11 AM

No, I don't think so . SAS know how to optimize performance ,how to split the data,how to use thread , once you are using DS2.

I maked a test under UE, interesting thing is Both version of Hash has almost the same time.

data have;
 set sashelp.class;
 do i=1 to 1000000;
  output;
 end;
 drop i;
run;

data key;
 age=11;output;
 age=14;output;
run;


proc ds2;
data want(overwrite=yes);

 declare package hash h(20,'key');

 method init();
  h.defineKey('age');
  h.defineDone();
 end;

 method run();
  set have;
  if h.check() ne 0 then output;
 end;

enddata;
run;
quit;

32.2 seconds

data want;
 if _n_=1 then do;
  declare hash h(dataset:'key',hashexp:20);
  h.defineKey('age');
  h.defineDone();
  end;
 set have;
 if h.check() ne 0 ;
run;

34.2 seconds

Seems Hash of DS2 has a little edge .

Ksharp · Posted 06-24-2016 03:31 AM

Chris,

I made a couple of test again. And find out Hash of DS2 is a little solwer than Hash of Data Step. What weird thing it is.

Somebody can explain it ? Or As your said "ds2 can only be beneficial if these 2 conditions are met." ?

If your words is right, I would disappoint about DS2 , why DS2 could not be faster than Data Step ?

Ksharp · Posted 06-24-2016 03:56 AM

Chris,

I think you are right . No threads No benefit .

ChrisNZ · Posted 06-24-2016 04:41 AM

Ds2 is interesting if

1- you can split the process (across multiple data and processing sources)

Or

2- you need the precision offered by typed variables

Or

3- you need to use the packaging capabilities

High-Performance SAS Coding - Third Edition

Ksharp · Posted 06-24-2016 04:59 AM

Honestly.

I am disappoint about DS2.

1)Not friendly to code,

2)Some function are limited , can't compare with Data Step.

3)Hard to use .

ChrisNZ · Posted 06-24-2016 05:21 AM

They both have their uses.

The data step is much easier and familiar to program.

DS2 can do a lot more but only in those specific cases.

Don't throw the baby out with the bath water.

Ds2 can't do all but it has its place.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 07-07-2016 10:47 PM

@Ksharp It's actually more limited than I thought.

proc ds2 doesn't support concatenated libraries: http://support.sas.com/kb/51/043.html

So any attempt to multithread on SAS tables split across several paths (using SPDE) will fail.

It looks like proc ds2 was not intended for SAS tables and brings limited benefits when used on SAS/Base storage.

SAS/Base is not the intended target at all. Sad....

High-Performance SAS Coding - Third Edition

Ksharp · Posted 07-08-2016 05:01 AM

Me sad too. Anyway I am going to take a look of PROC DS2 in new future. But NOW I am focus on Statistical Analysis .

Ksharp · Posted 06-22-2016 04:58 AM

Use RETURN instead of DELETE of data step.

data have;
 set sashelp.class;
run;


proc ds2;
data Non_male(overwrite=yes);
 method run();
  set have;
  if sex eq 'M' then return;
 end;
enddata;
run;
quit;

KachiM · Posted 06-22-2016 08:34 AM

You have made some queries in your comment on using Hash objects.
[1] Loading Both data sets int hash tables:
Huge dataset with KEY and non-key variables is likely to run out
of memory. This amounts to program aborting in the middle.

The first part of the program,

data want;
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'K1', 'K2');
h.definekey('K1', 'K2');
h.definedone();
end;

As SMALL is very small compared to HUGE, I loaded the
KEY(K1 and K2) into Hash table.
Internally, each hash entry in the hash table consumes the
number bytes equivalent to twice the length of K1 and K2
as default when data part of the hash table is left out.
We can save some memory by introducing the data part as:

h.definedata('K1');

before the statement

h.definedone();

Then the memory consumed will be length of K1, K2, K1 and
we get a nominal reduction.

[2] Sophisticated srategies to increase performance:
It is a good question for using the code for Production Jobs.
SAS has more than one way to solve any problem. I recommend
to try out all possible SAS solutions noting both time and memory.
By time I mean both REAL and CPU time. Then choose the best method.

Use of HASEXP:
With respect to solution using Hash Objects, we can use HASHEXP
option. By default, it is 8. One can use from 0 to 20. Again the
correct value for a given data set has to be tried out.
This means that blindly using any value other tha 8 has to be
empirically verified.

Keep K1, K2 of HUGE:
The I/O time to read the entire record of HUGE can be reduced but
further processing is warranted as noted below.

The statement

set HUGE(keep = K1 K2);

is expected to bring K1 and K2 only to the Program Data Vector(PDV).
The Record ID(_N_) of HUGE can be saved for the matched
record. In the next data step, use POINT= option to mark them
as deleted.

Alternatively using,
set HUGE;
flag = 0;
if h.find() =0 then flag = 1;
run;
will identify records to delete(flag=1).

[3] System does not have enough Ram:
This takes to the question of what goes into the hash table.
HUGE is not affecting the RAM as one record of it is brought
to PDV at a time.
Take the case of SMALL data set. If both K1 and K2 are numbers
then each hash table entry will take for the first part of the
code, 32 bytes plus SAS taking extra 16 bytes. We can reduce 8 bytes
only by introducing the hash data part as observed above.
If they are both character-types, then length of K1 rounded up
to 8 bytes and similarly K2. Suppose K1 is 10 bytes and K2 is
3 bytes, the memory for K1 is 16 and K2 is 8 bytes. This amounts
to 24 + 24 bytes and the extra 16 bytes SAS requires.

When everything fails, one possibility is to

split the SMALL data set into number
of parts such that a part can be saved in the hash table.
The HUGE data set has to be processed as many times as the
number of parts.

[4] Revised program that flags the records with 1 for deletion
and 0 for inclusion.

data want;
   if _n_ = 1 then do;
      declare hash h(HASHEXP:8);
      h.definekey('K1', 'K2');
      h.definedata('flag');
      h.definedone();
      do until(last);
         set SMALL end = last;
         h.add();
      end;
   end;
 
   set HUGE;
   flag = 0;
   if h.find() = 0 then flag = 1;
run;

The WANT data set is HUGE with additional variable FLAG.

For further processing, WHERE = option of SET statement can be used in WANT as:

data NEED;

set WANT(where=(flag=0));

... other processing ...

run;

ChrisNZ · Posted 06-22-2016 04:49 AM

Use check() instead of find(). But I doubt it will make a huge difference.
Sadly or its not rare that sas is much faster copying a whole table than updating it in place. I guess it depends on the proportion of records to update.
Well done!

High-Performance SAS Coding - Third Edition

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!