It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.
i.e.
product | brand | date |
1 | b1 | 01/10/2013 |
1 | b3 | 03/10/2013 |
1 | b3 | 03/10/2013 |
1 | b4 | 04/10/2013 |
2 | b1 | 01/10/2013 |
2 | b2 | 01/10/2013 |
2 | b3 | 01/10/2013 |
2 | b4 | 01/10/2013 |
1 | b1 | 01/10/2013 |
4 | b1 | 02/10/2013 |
5 | b3 | 03/10/2013 |
5 | b3 | 03/10/2013 |
6 | b4 | 03/10/2013 |
6 | b4 | 04/10/2013 |
The query is;
Proc SQL;
Select Count(distinct(catx(product,brand,date))),count(*) from table1;
quit;
I am using catx becuase if there any space in any of these 3 column, that will be removed.
Above query will give the output 11 & 14.
Is there any way to optimise this query. so I can run on huge data
"I have a table with 1b record"
1billion????
and proc sql?
Blimey!!!!!!!
edit my question with an additional example, please check and share your answer. Thanks for your effort.
Please make it easier for use to run tests by providing example data in a readily usable form, see my below example:
data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1 b1 01/10/2013
1 b3 03/10/2013
1 b3 03/10/2013
1 b4 04/10/2013
2 b1 01/10/2013
2 b2 01/10/2013
2 b3 01/10/2013
2 b4 01/10/2013
1 b1 01/10/2013
4 b1 02/10/2013
5 b3 03/10/2013
5 b3 03/10/2013
6 b4 03/10/2013
6 b4 04/10/2013
;
run;
proc sort
data=have
out=test
dupout=duplicates
nodupkey
;
by product brand date;
run;
The log from that:
27 data have; 28 infile datalines dlm='09'x; 29 input product brand $ date :mmddyy10.; 30 format date mmddyy10.; 31 datalines; NOTE: The data set WORK.HAVE has 14 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 46 ; 47 run; 48 49 proc sort 50 data=have 51 out=test 52 dupout=duplicates 53 nodupkey 54 ; 55 by product brand date; 56 run; NOTE: There were 14 observations read from the data set WORK.HAVE. NOTE: 3 observations with duplicate key values were deleted. NOTE: The data set WORK.TEST has 11 observations and 3 variables. NOTE: The data set WORK.DUPLICATES has 3 observations and 3 variables.
I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.
it will be updated in another table where it will say
Total Count=14
Level count=11
You could just use a sub-query in PROC SQL.
Into a table:
proc sql noprint;
create table want as
select sum(nobs) as Total_count,count(*) as Level_count
from
(select col1,col2,col3,count(*) as nobs
from have
group by col1,col2,col3
)
;
quit;
Or into macro variables:
proc sql noprint;
select sum(nobs) format=32.,count(*) format=32.
into :total_count trimmed, :level_count trimmed
from
(select col1,col2,col3,count(*) as nobs
from have
group by col1,col2,col3
)
;
quit;
%put Total Count = %sysfunc(putn(&total_count,comma32.-L));
%put Level Count = %sysfunc(putn(&level_count,comma32.-L));
@Srigyan wrote:
I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.
it will be updated in another table where it will say
Total Count=14
Level count=11
You can pull the resulting sums from the output of the proc sort from dictionary.tables:
proc sql noprint;
select nobs into :distobs from dictionary.tables where libname = 'WORK' and memname = 'TEST';
select nobs into :dupobs from dictionary.tables where libname = 'WORK' and memname = 'DUPLICATES';
select nobs into :allobs from dictionary.tables where libname = 'WORK' and memname = 'HAVE';
quit;
%put allobs=&allobs. distobs=&distobs. dupobs=&dupobs.;
or create a table
proc sql;
create table want as
select "Total Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'HAVE'
union all
select "Level Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'TEST'
;
quit;
You can now compare the performance of the suggested duplicate-detection methods and select which one to use.
One simple rule: if datasets are large, don't use proc sql at all.
@andreas_lds Bingo! I like that!
edited my question with an additional example, please check and share your answer. Thanks for your effort.
@Srigyan wrote:
It's more of optimisation question. I have a table with 1b record. Now I want to check if the total record is matching with distinct count based on 3 diferent variable. Though I have a query which runs great on small data but takes lot of time when I run this on huge data like 1b record.
Query is;
proc sql;
Select Count(distinct(catx(col1,col2,col3))),count(*) from table1;
quit;
is there any way to optimise this query. so I can run on huge data
Can you provide a small, maybe 15 records or so example of the data this "works" on correctly?
I am not sure I understand exactly what you are looking for but when I test on a set I have this does not give a count of the distinct combinations of variables.
proc sql; create table work.sqlcount2 as Select Count(distinct(catx(sex,age))) as distinctcount,count(*) as recordcount from sashelp.class; quit;
Shows 6 for the distinctcount but there are actually 11 different combinations of sex and age in the data set.
If you use CATX(sex,age) you are just going to get the same values are if you did CATS(age).
The first argument to CATX() is the delimiter string. If there is only one other argument then the delimiter is never inserted.
Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:
proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;
run;
To see what such a table would look like, run
proc freq data=sashelp.cars noprint;
tables model*type*origin / out=freqs missing;
run;
There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.
@mkeintz wrote:
Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:
proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;run;
To see what such a table would look like, run
proc freq data=sashelp.cars noprint;
tables model*type*origin / out=freqs missing;
run;
There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.
Pretty much what I did only I used SASHELP.CLASS as a smaller data set on only two variables. I asked what @Srigyan was actually attempting to count because his SQL returned a count of six values for the distinct where there are 11 combinations, so his code seems to be returning a count of the variable with the single largest number of distinct values.
proc sql; select distinct(catx(sex,age)) as sexage from sashelp.class; quit;
Yields:
sexage 11 12 13 14 15 16
The moment you hit large data volumes the coding approach chosen can make a big difference in performance. Knowing your data will help you a lot in making "the right" choice.
For your use case:
1. Is this a table in a database (which one) or is it a SAS table?
2. If a SAS table: Is the table already sorted by col1-col3?
3. What number of distinct combinations of col1-col3 do you expect?
A SQL DISTINCT requires implicit sorting of the data. That's the costly step. I would assume also Proc Freq requires implicit sorting.
IF the expected number of distinct combinations of col1-col3 is much lower than 1 billion then using a data step hash approach would likely perform much better as it avoids sorting 1billion rows. But that's only going to work if the distinct combinations of col1-col3 fit into memory.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.