It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.
table1
product | brand | date |
1 | b1 | 01/10/2013 |
1 | b3 | 03/10/2013 |
1 | b3 | 03/10/2013 |
1 | b4 | 04/10/2013 |
2 | b1 | 01/10/2013 |
2 | b2 | 01/10/2013 |
2 | b3 | 01/10/2013 |
2 | b4 | 01/10/2013 |
1 | b1 | 01/10/2013 |
4 | b1 | 02/10/2013 |
5 | b3 | 03/10/2013 |
5 | b3 | 03/10/2013 |
6 | b4 | 03/10/2013 |
6 | b4 | 04/10/2013 |
The query is;
data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;1 b1 01/10/2013
1 b3 03/10/2013
1 b3 03/10/2013
1 b4 04/10/2013
2 b1 01/10/2013
2 b2 01/10/2013
2 b3 01/10/2013
2 b4 01/10/2013
1 b1 01/10/2013
4 b1 02/10/2013
5 b3 03/10/2013
5 b3 03/10/2013
6 b4 03/10/2013
6 b4 04/10/2013;
run;
Proc SQL; Select Count(distinct(catx(product,brand,date))),count(*) from have; quit;
I am using catx becuase if there any space in any of these 3 column, that will be removed.
Above query will give the output 11 & 14.
Is there any way to optimise this query. so I can run on the huge data set.
Firstly make an index on it ,and no need catx() which can make your code very slow .
data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1 b1 01/10/2013
1 b3 03/10/2013
1 b3 03/10/2013
1 b4 04/10/2013
2 b1 01/10/2013
2 b2 01/10/2013
2 b3 01/10/2013
2 b4 01/10/2013
1 b1 01/10/2013
4 b1 02/10/2013
5 b3 03/10/2013
5 b3 03/10/2013
6 b4 03/10/2013
6 b4 04/10/2013
;
run;
options bufsize=128K bufno=100;
Proc SQL;
create index x on have(product,brand,date);
Select Count(*)
from (select distinct product,brand,date from have);
select count(*) from have;
quit;
Do a proc freq and put all three variables as tables:
proc freq data=have; tables product * brand * date / out=want; run;
This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out, but it should be way faster than SQL. Note, I assume all three are character variables, you say date, but do not convert it in the catx so you are either implicitly converting it (bad) or its text (again not optimal).
i put the code for example dataset. your query is giving me another huge table. I just need two number as an ouput
14
11
Refer to: "This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out"
@Srigyan wrote:
Whats the point of this answer which is not gonna help...
What's the point of insisting that you can't use PROC SQL, when it is the perfect tool for this situation and gets you the exact answer that you want?
@Srigyan wrote:
I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.
Then it's ******* time you start to learn how to really use SAS, and that's where we're trying to help you. Rejecting that help won't make you lots of friends.
The time to learn how to properly use the tools available is NOW. See Maxim 13.
Thanks, I definetly look in to this. It is just I had some delivery.
Firstly make an index on it ,and no need catx() which can make your code very slow .
data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1 b1 01/10/2013
1 b3 03/10/2013
1 b3 03/10/2013
1 b4 04/10/2013
2 b1 01/10/2013
2 b2 01/10/2013
2 b3 01/10/2013
2 b4 01/10/2013
1 b1 01/10/2013
4 b1 02/10/2013
5 b3 03/10/2013
5 b3 03/10/2013
6 b4 03/10/2013
6 b4 04/10/2013
;
run;
options bufsize=128K bufno=100;
Proc SQL;
create index x on have(product,brand,date);
Select Count(*)
from (select distinct product,brand,date from have);
select count(*) from have;
quit;
Perfect!!!
Thanks a lot.
Hi @Ksharp As usual your solutions are spot on and great. I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass. I lack clarity in the trade-off to determine how sometimes index can be costly or efficient not necessarily pertaining to this thread.
Of course I am reading some papers of Michael raithel et al but not really getting a through grasping of performance of B-tree should the unique combination be very large in proportion to the original count.
PS
If it's coming from you, it's ought to be good. I can conform to blind obedience.
Hi nov,
"I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass."
you are right. it is indeed an extra pass if OP only run this code once. But if OP run this code many times ,it would be efficient .
And for " select count(*) from have ", if OP's dataset is in locale ,then the following code would be efficient.
%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%put &nobs ;
P.S. I am not a perfect one, sometimes I would make some stupid errors ,so don't matter my code.
You could do better than me. 🙂
Are you merely trying to generate a true/false result on whether the cardinality of the 3-variable-combination is the same as the number of obs in the data set? Or do you need to know exactly how many 3-var-combos there are?
If you just want the true/false, then
data _null_;
set have end=end_of_have;
by product brand date;
if last.date=0 then do;
put 'found duplicate at ' _N_= (product brand date) (=);
stop;
end;
if end_of_have then put 'No duplicates for dataset of ' _n_ ' observations.';
run;
data _null_;
set have end=end_of_have;
if _n_=1 then do;
declare hash h (dataset:'have (obs=0 keep=product brand date)');
h.definekey(all:'Y');
h.definedone();
end;
if h.add()^=0 then do;
put 'found duplicate at ' _N_= (product brand date) (=);
stop;
end;
if end_of_have then put 'No duplicates for dataset of ' _n_ ' observations.';
run;
Using CATX() is going to give you the wrong answer. Especially misusing it by using one of the variables as the delimiter instead of constant.
It will map different sets of values to the same resulting string, thereby giving you an under count of the combinations.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.