Fast count distinct per group using hash tables

Reply
Occasional Contributor
Posts: 5

Fast count distinct per group using hash tables

Dear all,

I am trying to get a count of unique values per group using hash tables.

My data looks like this:

data Transactions ;

  informat cust prdt $1. ;

  input cust prdt ;

  cards;

1 A

3 A

1 A

2 A

3 A

4 B

2 B

2 B

run ;

I need to get a dataset with the following results:

prdt       |    cust

----------------------

A          |      3

B          |      2

Solutions involving a sort or a proc sql with a COUNT(DISTINCT...) are excluded since my dataset has 10 billions records.

I am struggling to get the correct syntax using hash objects, could you please point me to the right direction ?

Thanks in advance for your help

Thomas

Super User
Posts: 19,791

Re: Fast count distinct per group using hash tables

Posted in reply to Thomas_Cabrol

Hash requires pulling the dataset into memory, do you have the RAM space for the 10 billion records?

Respected Advisor
Posts: 3,799

Re: Fast count distinct per group using hash tables

A hash count would only require holding the distinct levels of the grouping(KEYS) plus the accumulation statistics.

Occasional Contributor
Posts: 5

Re: Fast count distinct per group using hash tables

Hi Reeza,

From my understanding I will not need need to pull the entire dataset into memory. I'll just need to read the dataset and store in hash objects the combination of unique pairs prdt / cust, which is way smaller. I already successfully used this approach for cumulative sum using suminc, and I did not find the trick for unique counts.

Thanks

Respected Advisor
Posts: 3,799

Re: Fast count distinct per group using hash tables

Posted in reply to Thomas_Cabrol

Does

variable_name=object.NUM_ITEMS;

give you what you want. 

Occasional Contributor
Posts: 5

Re: Fast count distinct per group using hash tables

Posted in reply to data_null__

I didn't manage to use it since my hash object is keyed by 2 fields. NUM_ITEMS will give me the total length of the object, but what I need the length of the second key by value in the first key. I was looking at the hash-of-hash option but my SAS skills are a bit rusted...

Respected Advisor
Posts: 3,799

Re: Fast count distinct per group using hash tables

Posted in reply to Thomas_Cabrol
Occasional Contributor
Posts: 5

Re: Fast count distinct per group using hash tables

Posted in reply to data_null__

Thanks, it does. I will try to understand "Example 16: Using Multi-Level Hash of Hashes to Create a Single-Pass Equivalent of Count Distinct Output Fitted to the PROC SUMMARY non-NWAY Template" which does almost what I need, except this is a bit simpler in my case.

New Contributor
Posts: 3

Re: Fast count distinct per group using hash tables

Posted in reply to Thomas_Cabrol

Hi Thomas

Did you manage to get this count unique working on hashtables?

I have the exact same scenario and would love to know if/how you got the count working over more than one grouping

cheers

Occasional Contributor
Posts: 5

Re: Fast count distinct per group using hash tables

Also, this is what I do to create sums (my real transaction table has a "revenue" field, and I need to do a lookup with a product table to get the "prdt" field):

data revenue ;

  if 0 then set transactions ;

  format prdt $20. ;

  if _n_ = 1 then do ; 

    declare hash p (dataset: 'product_lookup') ;

    p.defineKey('ean') ;

    p.defineData(all:'Y') ;

    p.defineDone() ;

    declare hash h (suminc: 'revenue', ordered: "a") ;

    h.defineKey('prdt') ;

    h.defineDone() ;

  end ;

  do until (eof) ;

    set work.transactions end = eof ;

    call missing(prdt) ;

    rc = p.find() ;    * Append the product lookup table ;

    h.ref() ;            * Increment the accumulator ;

  end ;

  declare hiter hi("h");

  rc = hi.first();

  do while (rc = 0);

    h.sum(sum: rev);

    output revenue;

    rc = hi_h.next();

  end;

run ;

Super User
Posts: 10,028

Re: Fast count distinct per group using hash tables

Posted in reply to Thomas_Cabrol

How about this:

data Transactions ;
  informat cust prdt $1. ;
  input cust prdt ;
  cards;
1 A
3 A
1 A
2 A
3 A
4 B
2 B
2 B
;
run;
data want;
if 0 then set Transactions;
  declare hash ha(dataset:'Transactions',ordered:'y');
  declare hiter hi('ha');
   ha.definekey('prdt','cust');
   ha.definedata('prdt','cust');
   ha.definedone();
do while(hi.next()=0);
 n+1;
 lag_prdt=lag(prdt); 
 if (prdt ne lag_prdt) and not missing(lag_prdt) then do;prdt=lag_prdt;n=n-1;output;n=1;end;
end;
output;
keep prdt n;
run;


Ksharp

Super User
Posts: 5,504

Re: Fast count distinct per group using hash tables

Posted in reply to Thomas_Cabrol

Are your data in order by CUST, as in the example?  If so, there are DATA step variations that don't require hashing.  It's possible that they're faster as well, but that would have to be tested.

Super User
Posts: 10,028

Re: Fast count distinct per group using hash tables

Posted in reply to Astounding

Astounding is right .If your data is pre-sorted then the following will be better. Or you can use PROC FREQ to get it.

data Transactions ;
  informat cust prdt $1. ;
  input cust prdt ;
  cards;
1 A
1 A
2 A
4 B
;
run;

data want;
if _n_ eq 1 then do;
if 0 then set Transactions;
  declare hash ha();
   ha.definekey('prdt','cust');
   ha.definedone();
end;
do until(last.prdt);
 set  Transactions ;
 by prdt;
 ha.replace();
end;
n=ha.num_items;
ha.clear();
keep prdt n;
run;









ods listing close;
ods output nlevels=want;
proc freq data=Transactions nlevels;
by prdt;
table cust;
run;
ods listing;

Ksharp

Ask a Question
Discussion stats
  • 12 replies
  • 1701 views
  • 0 likes
  • 6 in conversation