01-17-2013 12:36 PM
I am trying to get a count of unique values per group using hash tables.
My data looks like this:
data Transactions ;
informat cust prdt $1. ;
input cust prdt ;
I need to get a dataset with the following results:
prdt | cust
A | 3
B | 2
Solutions involving a sort or a proc sql with a COUNT(DISTINCT...) are excluded since my dataset has 10 billions records.
I am struggling to get the correct syntax using hash objects, could you please point me to the right direction ?
Thanks in advance for your help
01-17-2013 12:52 PM
From my understanding I will not need need to pull the entire dataset into memory. I'll just need to read the dataset and store in hash objects the combination of unique pairs prdt / cust, which is way smaller. I already successfully used this approach for cumulative sum using suminc, and I did not find the trick for unique counts.
01-17-2013 01:05 PM
I didn't manage to use it since my hash object is keyed by 2 fields. NUM_ITEMS will give me the total length of the object, but what I need the length of the second key by value in the first key. I was looking at the hash-of-hash option but my SAS skills are a bit rusted...
01-17-2013 01:56 PM
I searched "hash of hash" at Lex's web site and found these don't know if it will help.
01-17-2013 02:02 PM
Thanks, it does. I will try to understand "Example 16: Using Multi-Level Hash of Hashes to Create a Single-Pass Equivalent of Count Distinct Output Fitted to the PROC SUMMARY non-NWAY Template" which does almost what I need, except this is a bit simpler in my case.
04-22-2013 08:29 AM
Did you manage to get this count unique working on hashtables?
I have the exact same scenario and would love to know if/how you got the count working over more than one grouping
01-17-2013 01:01 PM
Also, this is what I do to create sums (my real transaction table has a "revenue" field, and I need to do a lookup with a product table to get the "prdt" field):
data revenue ;
if 0 then set transactions ;
format prdt $20. ;
if _n_ = 1 then do ;
declare hash p (dataset: 'product_lookup') ;
declare hash h (suminc: 'revenue', ordered: "a") ;
do until (eof) ;
set work.transactions end = eof ;
call missing(prdt) ;
rc = p.find() ; * Append the product lookup table ;
h.ref() ; * Increment the accumulator ;
declare hiter hi("h");
rc = hi.first();
do while (rc = 0);
rc = hi_h.next();
04-22-2013 10:11 AM
How about this:
data Transactions ; informat cust prdt $1. ; input cust prdt ; cards; 1 A 3 A 1 A 2 A 3 A 4 B 2 B 2 B ; run; data want; if 0 then set Transactions; declare hash ha(dataset:'Transactions',ordered:'y'); declare hiter hi('ha'); ha.definekey('prdt','cust'); ha.definedata('prdt','cust'); ha.definedone(); do while(hi.next()=0); n+1; lag_prdt=lag(prdt); if (prdt ne lag_prdt) and not missing(lag_prdt) then do;prdt=lag_prdt;n=n-1;output;n=1;end; end; output; keep prdt n; run;
04-22-2013 12:15 PM
Are your data in order by CUST, as in the example? If so, there are DATA step variations that don't require hashing. It's possible that they're faster as well, but that would have to be tested.
04-22-2013 11:56 PM
Astounding is right .If your data is pre-sorted then the following will be better. Or you can use PROC FREQ to get it.
data Transactions ; informat cust prdt $1. ; input cust prdt ; cards; 1 A 1 A 2 A 4 B ; run; data want; if _n_ eq 1 then do; if 0 then set Transactions; declare hash ha(); ha.definekey('prdt','cust'); ha.definedone(); end; do until(last.prdt); set Transactions ; by prdt; ha.replace(); end; n=ha.num_items; ha.clear(); keep prdt n; run; ods listing close; ods output nlevels=want; proc freq data=Transactions nlevels; by prdt; table cust; run; ods listing;