distinct count

Srigyan · Posted 03-22-2019 12:09 PM

It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.

i.e.

product	brand	date
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013

The query is;

Proc SQL;

Select Count(distinct(catx(product,brand,date))),count(*) from table1;

quit;

I am using catx becuase if there any space in any of these 3 column, that will be removed.

Above query will give the output 11 & 14.

Is there any way to optimise this query. so I can run on huge data

novinosrin · Posted 03-22-2019 12:12 PM

"I have a table with 1b record"

1billion????

and proc sql?

Blimey!!!!!!!

Srigyan · Posted 03-25-2019 07:32 AM

edit my question with an additional example, please check and share your answer. Thanks for your effort.

Kurt_Bremser · Posted 03-25-2019 07:42 AM

Please make it easier for use to run tests by providing example data in a readily usable form, see my below example:

data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

proc sort
  data=have
  out=test
  dupout=duplicates
  nodupkey
;
by product brand date;
run;

The log from that:

27         data have;
28         infile datalines dlm='09'x;
29         input product brand $ date :mmddyy10.;
30         format date mmddyy10.;
31         datalines;

NOTE: The data set WORK.HAVE has 14 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      
46         ;

47         run;
48         
49         proc sort
50           data=have
51           out=test
52           dupout=duplicates
53           nodupkey
54         ;
55         by product brand date;
56         run;

NOTE: There were 14 observations read from the data set WORK.HAVE.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 11 observations and 3 variables.
NOTE: The data set WORK.DUPLICATES has 3 observations and 3 variables.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Srigyan · Posted 03-25-2019 07:44 AM

I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.

it will be updated in another table where it will say

Total Count=14

Level count=11

Tom · Posted 03-25-2019 08:31 AM

You could just use a sub-query in PROC SQL.

Into a table:

proc sql noprint;
create table want as 
select sum(nobs) as Total_count,count(*) as Level_count
  from
  (select col1,col2,col3,count(*) as nobs
   from have 
   group by col1,col2,col3
  )
;
quit;

Or into macro variables:

proc sql noprint;
select sum(nobs) format=32.,count(*) format=32.
  into :total_count trimmed, :level_count trimmed
  from
  (select col1,col2,col3,count(*) as nobs
   from have 
   group by col1,col2,col3
  )
;
quit;
%put Total Count = %sysfunc(putn(&total_count,comma32.-L));
%put Level Count = %sysfunc(putn(&level_count,comma32.-L));

Kurt_Bremser · Posted 03-25-2019 09:21 AM

@Srigyan wrote:

I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.

it will be updated in another table where it will say

Total Count=14

Level count=11

You can pull the resulting sums from the output of the proc sort from dictionary.tables:

proc sql noprint;
select nobs into :distobs from dictionary.tables where libname = 'WORK' and memname = 'TEST';
select nobs into :dupobs from dictionary.tables where libname = 'WORK' and memname = 'DUPLICATES';
select nobs into :allobs from dictionary.tables where libname = 'WORK' and memname = 'HAVE';
quit;

%put allobs=&allobs. distobs=&distobs. dupobs=&dupobs.;

or create a table

proc sql;
create table want as
select "Total Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'HAVE'
union all
select "Level Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'TEST'
;
quit;

You can now compare the performance of the suggested duplicate-detection methods and select which one to use.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 03-22-2019 12:36 PM

One simple rule: if datasets are large, don't use proc sql at all.

novinosrin · Posted 03-22-2019 12:53 PM

@andreas_lds Bingo! I like that!

Srigyan · Posted 03-25-2019 07:32 AM

edited my question with an additional example, please check and share your answer. Thanks for your effort.

ballardw · Posted 03-22-2019 01:16 PM

@Srigyan wrote:

It's more of optimisation question. I have a table with 1b record. Now I want to check if the total record is matching with distinct count based on 3 diferent variable. Though I have a query which runs great on small data but takes lot of time when I run this on huge data like 1b record.

Query is;

proc sql;

Select Count(distinct(catx(col1,col2,col3))),count(*) from table1;

quit;

is there any way to optimise this query. so I can run on huge data

Can you provide a small, maybe 15 records or so example of the data this "works" on correctly?

I am not sure I understand exactly what you are looking for but when I test on a set I have this does not give a count of the distinct combinations of variables.

proc sql;
    create table work.sqlcount2 as
    Select Count(distinct(catx(sex,age))) as distinctcount,count(*) as recordcount
    from sashelp.class;
quit;

Shows 6 for the distinctcount but there are actually 11 different combinations of sex and age in the data set.

Tom · Posted 03-23-2019 12:41 AM

If you use CATX(sex,age) you are just going to get the same values are if you did CATS(age).

The first argument to CATX() is the delimiter string. If there is only one other argument then the delimiter is never inserted.

mkeintz · Posted 03-22-2019 02:21 PM

Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:

proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;

run;

To see what such a table would look like, run

proc freq data=sashelp.cars noprint;

tables model*type*origin / out=freqs missing;

run;

There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 03-22-2019 05:32 PM

@mkeintz wrote:

Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:

proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;

run;

To see what such a table would look like, run

proc freq data=sashelp.cars noprint;

tables model*type*origin / out=freqs missing;

run;

There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.

@mkeintz

Pretty much what I did only I used SASHELP.CLASS as a smaller data set on only two variables. I asked what @Srigyan was actually attempting to count because his SQL returned a count of six values for the distinct where there are 11 combinations, so his code seems to be returning a count of the variable with the single largest number of distinct values.

proc sql;
   select distinct(catx(sex,age)) as sexage
   from sashelp.class;
quit;

Yields:

sexage

11
12
13
14
15
16

Patrick · Posted 03-22-2019 09:02 PM

@Srigyan

The moment you hit large data volumes the coding approach chosen can make a big difference in performance. Knowing your data will help you a lot in making "the right" choice.

For your use case:

1. Is this a table in a database (which one) or is it a SAS table?

2. If a SAS table: Is the table already sorted by col1-col3?

3. What number of distinct combinations of col1-col3 do you expect?

A SQL DISTINCT requires implicit sorting of the data. That's the costly step. I would assume also Proc Freq requires implicit sorting.

IF the expected number of distinct combinations of col1-col3 is much lower than 1 billion then using a data step hash approach would likely perform much better as it avoids sorting 1billion rows. But that's only going to work if the distinct combinations of col1-col3 fit into memory.

distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

Re: distinct count

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away