topic Re: distinct count in SAS Programming

distinct count

Srigyan — Mon, 25 Mar 2019 11:31:17 GMT

It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.

i.e.

product	brand	date
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013

The query is;

Proc SQL;

Select Count(distinct(catx(product,brand,date))),count(*) from table1;

quit;

I am using catx becuase if there any space in any of these 3 column, that will be removed.

Above query will give the output 11 & 14.

Is there any way to optimise this query. so I can run on huge data

Re: distinct count

novinosrin — Fri, 22 Mar 2019 16:14:43 GMT

"I have a table with 1b record"

1billion????

and proc sql?

Blimey!!!!!!!

Re: distinct count

andreas_lds — Fri, 22 Mar 2019 16:36:15 GMT

One simple rule: if datasets are large, don't use proc sql at all.

Re: distinct count

novinosrin — Fri, 22 Mar 2019 16:53:25 GMT

@andreas_lds Bingo! I like that!

Re: distinct count

ballardw — Fri, 22 Mar 2019 17:16:20 GMT

@Srigyan wrote:

It's more of optimisation question. I have a table with 1b record. Now I want to check if the total record is matching with distinct count based on 3 diferent variable. Though I have a query which runs great on small data but takes lot of time when I run this on huge data like 1b record.

Query is;

proc sql;

Select Count(distinct(catx(col1,col2,col3))),count(*) from table1;

quit;

is there any way to optimise this query. so I can run on huge data

Can you provide a small, maybe 15 records or so example of the data this "works" on correctly?

I am not sure I understand exactly what you are looking for but when I test on a set I have this does not give a count of the distinct combinations of variables.

proc sql;
    create table work.sqlcount2 as
    Select Count(distinct(catx(sex,age))) as distinctcount,count(*) as recordcount
    from sashelp.class;
quit;

Shows 6 for the distinctcount but there are actually 11 different combinations of sex and age in the data set.

Re: distinct count

mkeintz — Fri, 22 Mar 2019 18:21:59 GMT

Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:

proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;

run;

To see what such a table would look like, run

proc freq data=sashelp.cars noprint;

tables model*type*origin / out=freqs missing;

run;

There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.

Re: distinct count

ballardw — Fri, 22 Mar 2019 21:35:38 GMT

@mkeintz wrote:

Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:

proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;

run;

To see what such a table would look like, run

proc freq data=sashelp.cars noprint;

tables model*type*origin / out=freqs missing;

run;

There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.

@mkeintz

Pretty much what I did only I used SASHELP.CLASS as a smaller data set on only two variables. I asked what @Srigyan was actually attempting to count because his SQL returned a count of six values for the distinct where there are 11 combinations, so his code seems to be returning a count of the variable with the single largest number of distinct values.

proc sql;
   select distinct(catx(sex,age)) as sexage
   from sashelp.class;
quit;

Yields:

sexage

11
12
13
14
15
16

Re: distinct count

Patrick — Sat, 23 Mar 2019 01:09:55 GMT

@Srigyan

The moment you hit large data volumes the coding approach chosen can make a big difference in performance. Knowing your data will help you a lot in making "the right" choice.

For your use case:

1. Is this a table in a database (which one) or is it a SAS table?

2. If a SAS table: Is the table already sorted by col1-col3?

3. What number of distinct combinations of col1-col3 do you expect?

A SQL DISTINCT requires implicit sorting of the data. That's the costly step. I would assume also Proc Freq requires implicit sorting.

IF the expected number of distinct combinations of col1-col3 is much lower than 1 billion then using a data step hash approach would likely perform much better as it avoids sorting 1billion rows. But that's only going to work if the distinct combinations of col1-col3 fit into memory.

Re: distinct count

Tom — Sat, 23 Mar 2019 14:59:12 GMT

Why are you using CATX()?

That is for making delimited lists. Because it trims and skips missing values you can have different values of the inputs result in the same output. For example compare when COL2=. and COL3=10 with COL2=10 and COL3=. when using CATX('-',col2,col3) both will result with '10' as the result.

Also if you are using CATX() it then why are you use COL1 as the delimiter instead of a constant string, like '/' or '-'?

Re: distinct count

Tom — Sat, 23 Mar 2019 04:41:51 GMT

If you use CATX(sex,age) you are just going to get the same values are if you did CATS(age).

The first argument to CATX() is the delimiter string. If there is only one other argument then the delimiter is never inserted.

Re: distinct count

Kurt_Bremser — Sat, 23 Mar 2019 08:39:40 GMT

Checking for duplicates always involves sorting, and doing it implicitly with proc sql will often be the least efficient way.

Try this instead:

proc sort
  data=have (keep=col1 col2 col3)
  out=test
  nodupkey
;
by col1 col2 col3;
run;

and look at the log. It will tell you how many duplicates were deleted.

Re: distinct count

Astounding — Sat, 23 Mar 2019 10:51:49 GMT

If your data set is already sorted by ANY of the three variables, that should help PROC FREQ run faster. For example, if your data set is sorted by COL2, you could use:

proc freq data=have noprint;

by col2;

tables col1*col3 / out=counts;

run;

The number of observations in COUNTS will be the right total, and PROC FREQ can take advantage of the sorted order.

Re: distinct count

mkeintz — Sun, 24 Mar 2019 20:02:40 GMT

A point I overlooked. You are using

count(distinct(catx(col1,col2,col3)))

It would probably be better practice to use

count(distinct(catx(' ',col1,col2,col3)))

It would certainly be better once you attempt to go beyond 3 variables.

And finally, if you only want the NUMBER of 3-way combinations, and not the DISTRIBUTION, then a hash object allows you to write out the count to the log in one step, as in:

data _null_;
  if 0 then set sashelp.cars (keep=model type origin);
  declare hash h (dataset:'sashelp.cars (keep=model type origin)');
   h.definekey(all:'Y');
   h.definedone();
 n_combos=h.num_items;
 put n_combos=;
run;

Re: distinct count

Srigyan — Mon, 25 Mar 2019 11:32:17 GMT

edit my question with an additional example, please check and share your answer. Thanks for your effort.

Re: distinct count

Srigyan — Mon, 25 Mar 2019 11:32:42 GMT

edited my question with an additional example, please check and share your answer. Thanks for your effort.

Re: distinct count

Kurt_Bremser — Mon, 25 Mar 2019 11:42:09 GMT

Please make it easier for use to run tests by providing example data in a readily usable form, see my below example:

data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

proc sort
  data=have
  out=test
  dupout=duplicates
  nodupkey
;
by product brand date;
run;

The log from that:

27         data have;
28         infile datalines dlm='09'x;
29         input product brand $ date :mmddyy10.;
30         format date mmddyy10.;
31         datalines;

NOTE: The data set WORK.HAVE has 14 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      
46         ;

47         run;
48         
49         proc sort
50           data=have
51           out=test
52           dupout=duplicates
53           nodupkey
54         ;
55         by product brand date;
56         run;

NOTE: There were 14 observations read from the data set WORK.HAVE.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 11 observations and 3 variables.
NOTE: The data set WORK.DUPLICATES has 3 observations and 3 variables.

Re: distinct count

Srigyan — Mon, 25 Mar 2019 11:44:57 GMT

I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.

it will be updated in another table where it will say

Total Count=14

Level count=11

Re: distinct count

Tom — Mon, 25 Mar 2019 12:35:07 GMT

You could just use a sub-query in PROC SQL.

Into a table:

proc sql noprint;
create table want as 
select sum(nobs) as Total_count,count(*) as Level_count
  from
  (select col1,col2,col3,count(*) as nobs
   from have 
   group by col1,col2,col3
  )
;
quit;

Or into macro variables:

proc sql noprint;
select sum(nobs) format=32.,count(*) format=32.
  into :total_count trimmed, :level_count trimmed
  from
  (select col1,col2,col3,count(*) as nobs
   from have 
   group by col1,col2,col3
  )
;
quit;
%put Total Count = %sysfunc(putn(&total_count,comma32.-L));
%put Level Count = %sysfunc(putn(&level_count,comma32.-L));

Re: distinct count

Kurt_Bremser — Mon, 25 Mar 2019 13:23:25 GMT

@Srigyan wrote:

I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.

it will be updated in another table where it will say

Total Count=14

Level count=11

You can pull the resulting sums from the output of the proc sort from dictionary.tables:

proc sql noprint;
select nobs into :distobs from dictionary.tables where libname = 'WORK' and memname = 'TEST';
select nobs into :dupobs from dictionary.tables where libname = 'WORK' and memname = 'DUPLICATES';
select nobs into :allobs from dictionary.tables where libname = 'WORK' and memname = 'HAVE';
quit;

%put allobs=&allobs. distobs=&distobs. dupobs=&dupobs.;

or create a table

proc sql;
create table want as
select "Total Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'HAVE'
union all
select "Level Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'TEST'
;
quit;

You can now compare the performance of the suggested duplicate-detection methods and select which one to use.