SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
Srigyan
Quartz | Level 8

It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.

 

i.e.

productbranddate
1b101/10/2013
1b303/10/2013
1b303/10/2013
1b404/10/2013
2b101/10/2013
2b201/10/2013
2b301/10/2013
2b401/10/2013
1b101/10/2013
4b102/10/2013
5b303/10/2013
5b303/10/2013
6b403/10/2013
6b404/10/2013

 

The query is;

 

Proc SQL;

Select Count(distinct(catx(product,brand,date))),count(*)  from table1;

quit;

 

I am using catx becuase if there any space in any of these 3 column, that will be removed.

 

Above query will give the output 11 & 14.

 

Is there any way to optimise this query. so I can run on huge data

18 REPLIES 18
novinosrin
Tourmaline | Level 20

"I have a table with 1b record"

 

1billion????

 

and proc sql?

 

Blimey!!!!!!! 

Srigyan
Quartz | Level 8

edit my question with an additional example, please check and share your answer. Thanks for your effort.

Kurt_Bremser
Super User

Please make it easier for use to run tests by providing example data in a readily usable form, see my below example:

data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1	b1	01/10/2013
1	b3	03/10/2013
1	b3	03/10/2013
1	b4	04/10/2013
2	b1	01/10/2013
2	b2	01/10/2013
2	b3	01/10/2013
2	b4	01/10/2013
1	b1	01/10/2013
4	b1	02/10/2013
5	b3	03/10/2013
5	b3	03/10/2013
6	b4	03/10/2013
6	b4	04/10/2013
;
run;

proc sort
  data=have
  out=test
  dupout=duplicates
  nodupkey
;
by product brand date;
run;

The log from that:

27         data have;
28         infile datalines dlm='09'x;
29         input product brand $ date :mmddyy10.;
30         format date mmddyy10.;
31         datalines;

NOTE: The data set WORK.HAVE has 14 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      
46         ;

47         run;
48         
49         proc sort
50           data=have
51           out=test
52           dupout=duplicates
53           nodupkey
54         ;
55         by product brand date;
56         run;

NOTE: There were 14 observations read from the data set WORK.HAVE.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 11 observations and 3 variables.
NOTE: The data set WORK.DUPLICATES has 3 observations and 3 variables.
Srigyan
Quartz | Level 8

I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.

it will be updated in another table where it will say

Total Count=14

Level count=11

Tom
Super User Tom
Super User

You could just use a sub-query in PROC SQL.

Into a table:

proc sql noprint;
create table want as 
select sum(nobs) as Total_count,count(*) as Level_count
  from
  (select col1,col2,col3,count(*) as nobs
   from have 
   group by col1,col2,col3
  )
;
quit;

Or into macro variables:

proc sql noprint;
select sum(nobs) format=32.,count(*) format=32.
  into :total_count trimmed, :level_count trimmed
  from
  (select col1,col2,col3,count(*) as nobs
   from have 
   group by col1,col2,col3
  )
;
quit;
%put Total Count = %sysfunc(putn(&total_count,comma32.-L));
%put Level Count = %sysfunc(putn(&level_count,comma32.-L));

 

Kurt_Bremser
Super User

@Srigyan wrote:

I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.

it will be updated in another table where it will say

Total Count=14

Level count=11


You can pull the resulting sums from the output of the proc sort from dictionary.tables:

proc sql noprint;
select nobs into :distobs from dictionary.tables where libname = 'WORK' and memname = 'TEST';
select nobs into :dupobs from dictionary.tables where libname = 'WORK' and memname = 'DUPLICATES';
select nobs into :allobs from dictionary.tables where libname = 'WORK' and memname = 'HAVE';
quit;

%put allobs=&allobs. distobs=&distobs. dupobs=&dupobs.;

or create a table

proc sql;
create table want as
select "Total Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'HAVE'
union all
select "Level Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'TEST'
;
quit;

 

 

You can now compare the performance of the suggested duplicate-detection methods and select which one to use.

andreas_lds
Jade | Level 19

One simple rule: if datasets are large, don't use proc sql at all.

novinosrin
Tourmaline | Level 20

@andreas_lds   Bingo!  I like that!

Srigyan
Quartz | Level 8

edited my question with an additional example, please check and share your answer. Thanks for your effort.

ballardw
Super User

@Srigyan wrote:

It's more of optimisation question. I have a table with 1b record. Now I want to check if the total record is matching with distinct count based on 3 diferent variable. Though I have a query which runs great on small data but takes lot of time when I run this on huge data like 1b record.

 

Query is;

 

proc sql;

Select Count(distinct(catx(col1,col2,col3))),count(*)  from table1;

quit;

 

is there any way to optimise this query. so I can run on huge data


Can you provide a small, maybe 15 records or so example of the data this "works" on correctly?

 

I am not sure I understand exactly what you are looking for but when I test on a set I have this does not give a count of the distinct combinations of variables.

 

proc sql;
    create table work.sqlcount2 as
    Select Count(distinct(catx(sex,age))) as distinctcount,count(*) as recordcount
    from sashelp.class;
quit;

Shows 6 for the distinctcount but there are actually 11 different combinations of sex and age in the data set.

 

Tom
Super User Tom
Super User

If you use CATX(sex,age) you are just going to get the same values are if you did CATS(age).

The first argument to CATX() is the delimiter string.  If there is only one other argument then the delimiter is never inserted.

mkeintz
PROC Star

Use a procedure whose whole existence is to produce frequencies.  Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need.   Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:

 

proc freq data=tables noprint;
  tables col1*col2*col3 / out=freqs missing;

run;

 

To see what such a table would look like, run

 

proc freq data=sashelp.cars noprint;

  tables model*type*origin / out=freqs missing;

run;

 

There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

@mkeintz wrote:

Use a procedure whose whole existence is to produce frequencies.  Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need.   Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:

 

proc freq data=tables noprint;
  tables col1*col2*col3 / out=freqs missing;

run;

 

To see what such a table would look like, run

 

proc freq data=sashelp.cars noprint;

  tables model*type*origin / out=freqs missing;

run;

 

There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.


@mkeintz 

Pretty much what I did only I used SASHELP.CLASS as a smaller data set on only two variables. I asked what @Srigyan was actually attempting to count because his SQL returned a count of six values for the distinct where there are 11 combinations, so his code seems to be returning a count of the variable with the single largest number of distinct values.

proc sql;
   select distinct(catx(sex,age)) as sexage
   from sashelp.class;
quit;

Yields:

 

sexage

11
12
13
14
15
16
Patrick
Opal | Level 21

@Srigyan 

The moment you hit large data volumes the coding approach chosen can make a big difference in performance. Knowing your data will help you a lot in making "the right" choice.

 

For your use case:

1. Is this a table in a database (which one) or is it a SAS table?

2. If a SAS table: Is the table already sorted by col1-col3?

3. What number of distinct combinations of col1-col3 do you expect?

 

A SQL DISTINCT requires implicit sorting of the data. That's the costly step. I would assume also Proc Freq requires implicit sorting.

 

IF the expected number of distinct combinations of col1-col3 is much lower than 1 billion then using a data step hash approach would likely perform much better as it avoids sorting 1billion rows. But that's only going to work if the distinct combinations of col1-col3 fit into memory.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 3137 views
  • 6 likes
  • 9 in conversation