topic Re: Distinct count other than proc sql in SAS Programming

Distinct count other than proc sql

Srigyan — Mon, 25 Mar 2019 11:47:49 GMT

It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.

table1

product	brand	date
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013

The query is;

data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013;
run;




Proc SQL; Select Count(distinct(catx(product,brand,date))),count(*)  from have; quit;

I am using catx becuase if there any space in any of these 3 column, that will be removed.

Above query will give the output 11 & 14.

Is there any way to optimise this query. so I can run on the huge data set.

Re: Distinct count other than proc sql

RW9 — Mon, 25 Mar 2019 11:47:40 GMT

Do a proc freq and put all three variables as tables:

proc freq data=have;
  tables product * brand * date / out=want;
run;

This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out, but it should be way faster than SQL. Note, I assume all three are character variables, you say date, but do not convert it in the catx so you are either implicitly converting it (bad) or its text (again not optimal).

Re: Distinct count other than proc sql

Srigyan — Mon, 25 Mar 2019 11:49:31 GMT

i put the code for example dataset. your query is giving me another huge table. I just need two number as an ouput

Re: Distinct count other than proc sql

RW9 — Mon, 25 Mar 2019 11:52:06 GMT

Refer to: "This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out"

Re: Distinct count other than proc sql

Srigyan — Mon, 25 Mar 2019 11:54:18 GMT

I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.

Re: Distinct count other than proc sql

PaigeMiller — Mon, 25 Mar 2019 12:19:27 GMT

@Srigyan wrote:
Whats the point of this answer which is not gonna help...

What's the point of insisting that you can't use PROC SQL, when it is the perfect tool for this situation and gets you the exact answer that you want?

Re: Distinct count other than proc sql

Srigyan — Mon, 25 Mar 2019 12:23:57 GMT

Because I am running this proc sql on more than a billioin record and running proc sql will take lot of time. So I am looking for an optimisation. Hope you can help.

Re: Distinct count other than proc sql

Ksharp — Mon, 25 Mar 2019 12:37:55 GMT

Firstly make an index on it ,and no need catx() which can make your code very slow .

data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

options bufsize=128K bufno=100;
Proc SQL; 
create index x on have(product,brand,date);

Select Count(*)
from (select distinct product,brand,date from have); 
select count(*) from have;
quit;

Re: Distinct count other than proc sql

Srigyan — Mon, 25 Mar 2019 12:56:20 GMT

Perfect!!!

Thanks a lot.

Re: Distinct count other than proc sql

mkeintz — Mon, 25 Mar 2019 12:58:02 GMT

Are you merely trying to generate a true/false result on whether the cardinality of the 3-variable-combination is the same as the number of obs in the data set? Or do you need to know exactly how many 3-var-combos there are?

If you just want the true/false, then

you can stop checking your data upon discovery of the first duplicate, which might occur long before you read then entire file.

If the data are sorted, as in your example then your can

data _null_;
  set have end=end_of_have;
  by product brand date;
  if last.date=0 then do;
    put 'found duplicate at ' _N_= (product brand date) (=);
	stop;
  end;
  if end_of_have then put  'No duplicates for dataset of ' _n_ ' observations.';
run;

But if it's NOT sorted the I would recommend a hash object to track the combinations already discovered.

data _null_;
  set have end=end_of_have;
  if _n_=1 then do;
    declare hash h (dataset:'have (obs=0 keep=product brand date)');
	  h.definekey(all:'Y');
	  h.definedone();
  end;
  if h.add()^=0 then do;
    put 'found duplicate at ' _N_= (product brand date) (=);
	stop;
  end;
  if end_of_have then put  'No duplicates for dataset of ' _n_ ' observations.';
run;

But if you actually want frequencies of each combination, then a modification of the two above (depending on whether the data is sorted) will be needed.

Re: Distinct count other than proc sql

Tom — Mon, 25 Mar 2019 13:11:20 GMT

Using CATX() is going to give you the wrong answer. Especially misusing it by using one of the variables as the delimiter instead of constant.

It will map different sets of values to the same resulting string, thereby giving you an under count of the combinations.

Re: Distinct count other than proc sql

novinosrin — Mon, 25 Mar 2019 13:12:50 GMT

Hi @Ksharp As usual your solutions are spot on and great. I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass. I lack clarity in the trade-off to determine how sometimes index can be costly or efficient not necessarily pertaining to this thread.

Of course I am reading some papers of Michael raithel et al but not really getting a through grasping of performance of B-tree should the unique combination be very large in proportion to the original count.

If it's coming from you, it's ought to be good. I can conform to blind obedience.

Re: Distinct count other than proc sql

Kurt_Bremser — Mon, 25 Mar 2019 13:33:33 GMT

@Srigyan wrote:
I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.

Then it's ******* time you start to learn how to really use SAS, and that's where we're trying to help you. Rejecting that help won't make you lots of friends.

The time to learn how to properly use the tools available is NOW. See Maxim 13.

Re: Distinct count other than proc sql

Srigyan — Mon, 25 Mar 2019 14:02:52 GMT

Thanks, I definetly look in to this. It is just I had some delivery.

Re: Distinct count other than proc sql

Ksharp — Tue, 26 Mar 2019 11:53:38 GMT

Hi nov,

"I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass."

you are right. it is indeed an extra pass if OP only run this code once. But if OP run this code many times ,it would be efficient .

And for " select count(*) from have ", if OP's dataset is in locale ,then the following code would be efficient.

%let dsid=%sysfunc(open(have));

%let nobs=%sysfunc(attrn(&dsid,nlobs));

%let dsid=%sysfunc(close(&dsid));

%put &nobs ;

P.S. I am not a perfect one, sometimes I would make some stupid errors ,so don't matter my code.

You could do better than me. 🙂