Solved: Re: Distinct count other than proc sql

Srigyan · Posted 03-25-2019 07:42 AM

It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.

table1

product	brand	date
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013

The query is;

data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013;
run;




Proc SQL; Select Count(distinct(catx(product,brand,date))),count(*)  from have; quit;

I am using catx becuase if there any space in any of these 3 column, that will be removed.

Above query will give the output 11 & 14.

Is there any way to optimise this query. so I can run on the huge data set.

Ksharp · Posted 03-25-2019 08:37 AM

Firstly make an index on it ,and no need catx() which can make your code very slow .

data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

options bufsize=128K bufno=100;
Proc SQL; 
create index x on have(product,brand,date);

Select Count(*)
from (select distinct product,brand,date from have); 
select count(*) from have;
quit;

View solution in original post

RW9 · Posted 03-25-2019 07:47 AM

Do a proc freq and put all three variables as tables:

proc freq data=have;
  tables product * brand * date / out=want;
run;

This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out, but it should be way faster than SQL. Note, I assume all three are character variables, you say date, but do not convert it in the catx so you are either implicitly converting it (bad) or its text (again not optimal).

Srigyan · Posted 03-25-2019 07:49 AM

i put the code for example dataset. your query is giving me another huge table. I just need two number as an ouput

14

11

RW9 · Posted 03-25-2019 07:52 AM

Refer to: "This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out"

Srigyan · Posted 03-25-2019 07:54 AM

I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.

PaigeMiller · Posted 03-25-2019 08:19 AM

@Srigyan wrote:
Whats the point of this answer which is not gonna help...

What's the point of insisting that you can't use PROC SQL, when it is the perfect tool for this situation and gets you the exact answer that you want?

--
Paige Miller

Srigyan · Posted 03-25-2019 08:23 AM

Because I am running this proc sql on more than a billioin record and running proc sql will take lot of time. So I am looking for an optimisation. Hope you can help.

Kurt_Bremser · Posted 03-25-2019 09:33 AM

@Srigyan wrote:
I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.

Then it's ******* time you start to learn how to really use SAS, and that's where we're trying to help you. Rejecting that help won't make you lots of friends.

The time to learn how to properly use the tools available is NOW. See Maxim 13.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Srigyan · Posted 03-25-2019 10:02 AM

Thanks, I definetly look in to this. It is just I had some delivery.

Ksharp · Posted 03-25-2019 08:37 AM

Firstly make an index on it ,and no need catx() which can make your code very slow .

data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

options bufsize=128K bufno=100;
Proc SQL; 
create index x on have(product,brand,date);

Select Count(*)
from (select distinct product,brand,date from have); 
select count(*) from have;
quit;

Srigyan · Posted 03-25-2019 08:56 AM

Perfect!!!

Thanks a lot.

novinosrin · Posted 03-25-2019 09:12 AM

Hi @Ksharp As usual your solutions are spot on and great. I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass. I lack clarity in the trade-off to determine how sometimes index can be costly or efficient not necessarily pertaining to this thread.

Of course I am reading some papers of Michael raithel et al but not really getting a through grasping of performance of B-tree should the unique combination be very large in proportion to the original count.

PS

If it's coming from you, it's ought to be good. I can conform to blind obedience.

Ksharp · Posted 03-26-2019 07:47 AM

Hi nov,

"I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass."

you are right. it is indeed an extra pass if OP only run this code once. But if OP run this code many times ,it would be efficient .

And for " select count(*) from have ", if OP's dataset is in locale ,then the following code would be efficient.

%let dsid=%sysfunc(open(have));

%let nobs=%sysfunc(attrn(&dsid,nlobs));

%let dsid=%sysfunc(close(&dsid));

%put &nobs ;

P.S. I am not a perfect one, sometimes I would make some stupid errors ,so don't matter my code.

You could do better than me. 🙂

mkeintz · Posted 03-25-2019 08:58 AM

Are you merely trying to generate a true/false result on whether the cardinality of the 3-variable-combination is the same as the number of obs in the data set? Or do you need to know exactly how many 3-var-combos there are?

If you just want the true/false, then

you can stop checking your data upon discovery of the first duplicate, which might occur long before you read then entire file.

If the data are sorted, as in your example then your can

data _null_;
  set have end=end_of_have;
  by product brand date;
  if last.date=0 then do;
    put 'found duplicate at ' _N_= (product brand date) (=);
	stop;
  end;
  if end_of_have then put  'No duplicates for dataset of ' _n_ ' observations.';
run;

But if it's NOT sorted the I would recommend a hash object to track the combinations already discovered.

data _null_;
  set have end=end_of_have;
  if _n_=1 then do;
    declare hash h (dataset:'have (obs=0 keep=product brand date)');
	  h.definekey(all:'Y');
	  h.definedone();
  end;
  if h.add()^=0 then do;
    put 'found duplicate at ' _N_= (product brand date) (=);
	stop;
  end;
  if end_of_have then put  'No duplicates for dataset of ' _n_ ' observations.';
run;

But if you actually want frequencies of each combination, then a modification of the two above (depending on whether the data is sorted) will be needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 03-25-2019 09:11 AM

Using CATX() is going to give you the wrong answer. Especially misusing it by using one of the variables as the delimiter instead of constant.

It will map different sets of values to the same resulting string, thereby giving you an under count of the combinations.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away