BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Srigyan
Quartz | Level 8
 

 

table1

productbranddate
1b101/10/2013
1b303/10/2013
1b303/10/2013
1b404/10/2013
2b101/10/2013
2b201/10/2013
2b301/10/2013
2b401/10/2013
1b101/10/2013
4b102/10/2013
5b303/10/2013
5b303/10/2013
6b403/10/2013
6b404/10/2013

 

The query is;

 

data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013;
run;



Proc SQL; Select Count(distinct(catx(product,brand,date))),count(*)  from have; quit;

 

I am using catx becuase if there any space in any of these 3 column, that will be removed.

 

Above query will give the output 11 & 14.

 

Is there any way to optimise this query. so I can run on the huge data set.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Firstly make an index on it ,and no need catx() which can make your code very slow .

 

data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

options bufsize=128K bufno=100;
Proc SQL; 
create index x on have(product,brand,date);

Select Count(*)
from (select distinct product,brand,date from have); 
select count(*) from have;
quit;

View solution in original post

14 REPLIES 14
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Do a proc freq and put all three variables as tables:

proc freq data=have;
  tables product * brand * date / out=want;
run;

This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out, but it should be way faster than SQL.  Note, I assume all three are character variables, you say date, but do not convert it in the catx so you are either implicitly converting it (bad) or its text (again not optimal).

Srigyan
Quartz | Level 8

i put the code for example dataset. your query is giving me another huge table. I just need two number as an ouput

14

11

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Refer to: "This will give a you a large table with the various counts - note you will need to refine it (not going to type test data in to do this for you) to only get the results you want out"

Srigyan
Quartz | Level 8
I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.
PaigeMiller
Diamond | Level 26

@Srigyan wrote:
Whats the point of this answer which is not gonna help... 

What's the point of insisting that you can't use PROC SQL, when it is the perfect tool for this situation and gets you the exact answer that you want?

--
Paige Miller
Srigyan
Quartz | Level 8
Because I am running this proc sql on more than a billioin record and running proc sql will take lot of time. So I am looking for an optimisation. Hope you can help.
Kurt_Bremser
Super User

@Srigyan wrote:
I know proc freq, my query is getting these numbers. Whats the point of this answer which is not gonna help... I am not very good in SAS thats why I am putting this question here.Any way if you can't, thanks for your effort.

Then it's ******* time you start to learn how to really use SAS, and that's where we're trying to help you. Rejecting that help won't make you lots of friends.

The time to learn how to properly use the tools available is NOW. See Maxim 13.

Srigyan
Quartz | Level 8

Thanks, I definetly look in to this. It is just I had some delivery.

Ksharp
Super User

Firstly make an index on it ,and no need catx() which can make your code very slow .

 

data have;
infile datalines expandtabs ;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

options bufsize=128K bufno=100;
Proc SQL; 
create index x on have(product,brand,date);

Select Count(*)
from (select distinct product,brand,date from have); 
select count(*) from have;
quit;
Srigyan
Quartz | Level 8

Perfect!!!

Thanks a lot.Smiley Happy

novinosrin
Tourmaline | Level 20

Hi @Ksharp   As usual your solutions are spot on and great. I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass. I lack clarity in the trade-off to determine how sometimes index can be costly or efficient not necessarily pertaining to this thread.

Of course I am reading some papers of Michael raithel et al but not really getting a through grasping of performance of B-tree should the unique combination be very large in proportion to the original count.

 

PS

If it's coming from you, it's ought to be good. I can conform to blind obedience. 

Ksharp
Super User

Hi nov,

"I am just wondering how the index which indeed is an implicit sort optimizes performance if indeed it is an extra pass."

you are right. it is indeed an extra pass if OP only run this code once. But if OP run this code many times ,it would be efficient .

 

And for  " select count(*) from have ", if OP's dataset is in locale  ,then the following code would be efficient.

 

%let dsid=%sysfunc(open(have));

%let nobs=%sysfunc(attrn(&dsid,nlobs));

%let dsid=%sysfunc(close(&dsid));

%put &nobs ;

 

P.S. I am not a perfect one, sometimes I would make some stupid errors ,so don't matter my code.

You could do better than me. 🙂 

mkeintz
PROC Star

Are you merely trying to generate a true/false result on whether the cardinality of the 3-variable-combination is the same as the number of obs in the data set?  Or do you need to know exactly how many 3-var-combos there are?

 

If you just want the true/false, then

  1. you can stop checking your data upon discovery of the first duplicate, which might occur long before you read then entire file.
    1. If  the data are sorted, as  in your example then your can

      data _null_;
        set have end=end_of_have;
        by product brand date;
        if last.date=0 then do;
          put 'found duplicate at ' _N_= (product brand date) (=);
      	stop;
        end;
        if end_of_have then put  'No duplicates for dataset of ' _n_ ' observations.';
      run;
    2. But if it's NOT sorted the I would recommend a hash object to track the combinations already discovered.
      data _null_;
        set have end=end_of_have;
        if _n_=1 then do;
          declare hash h (dataset:'have (obs=0 keep=product brand date)');
      	  h.definekey(all:'Y');
      	  h.definedone();
        end;
        if h.add()^=0 then do;
          put 'found duplicate at ' _N_= (product brand date) (=);
      	stop;
        end;
        if end_of_have then put  'No duplicates for dataset of ' _n_ ' observations.';
      run;
  2. But if  you actually want frequencies of each combination, then a modification of the two above (depending on whether the data is sorted) will be needed.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Using CATX() is going to give you the wrong answer.  Especially misusing it by using one of the variables as the delimiter instead of constant.

It will map different sets of values to the same resulting string, thereby giving you an under count of the combinations.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 3798 views
  • 0 likes
  • 8 in conversation