- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It's more about optimization question. I have a table with 1b record. Now I want to check if the total record is matched with distinct count based on 3 different variables. Though I have a query which runs great on small data, but takes lots of time when I run this on huge data like 1b record.
i.e.
product | brand | date |
1 | b1 | 01/10/2013 |
1 | b3 | 03/10/2013 |
1 | b3 | 03/10/2013 |
1 | b4 | 04/10/2013 |
2 | b1 | 01/10/2013 |
2 | b2 | 01/10/2013 |
2 | b3 | 01/10/2013 |
2 | b4 | 01/10/2013 |
1 | b1 | 01/10/2013 |
4 | b1 | 02/10/2013 |
5 | b3 | 03/10/2013 |
5 | b3 | 03/10/2013 |
6 | b4 | 03/10/2013 |
6 | b4 | 04/10/2013 |
The query is;
Proc SQL;
Select Count(distinct(catx(product,brand,date))),count(*) from table1;
quit;
I am using catx becuase if there any space in any of these 3 column, that will be removed.
Above query will give the output 11 & 14.
Is there any way to optimise this query. so I can run on huge data
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
"I have a table with 1b record"
1billion????
and proc sql?
Blimey!!!!!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
edit my question with an additional example, please check and share your answer. Thanks for your effort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please make it easier for use to run tests by providing example data in a readily usable form, see my below example:
data have;
infile datalines dlm='09'x;
input product brand $ date :mmddyy10.;
format date mmddyy10.;
datalines;
1 b1 01/10/2013
1 b3 03/10/2013
1 b3 03/10/2013
1 b4 04/10/2013
2 b1 01/10/2013
2 b2 01/10/2013
2 b3 01/10/2013
2 b4 01/10/2013
1 b1 01/10/2013
4 b1 02/10/2013
5 b3 03/10/2013
5 b3 03/10/2013
6 b4 03/10/2013
6 b4 04/10/2013
;
run;
proc sort
data=have
out=test
dupout=duplicates
nodupkey
;
by product brand date;
run;
The log from that:
27 data have; 28 infile datalines dlm='09'x; 29 input product brand $ date :mmddyy10.; 30 format date mmddyy10.; 31 datalines; NOTE: The data set WORK.HAVE has 14 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.00 seconds 46 ; 47 run; 48 49 proc sort 50 data=have 51 out=test 52 dupout=duplicates 53 nodupkey 54 ; 55 by product brand date; 56 run; NOTE: There were 14 observations read from the data set WORK.HAVE. NOTE: 3 observations with duplicate key values were deleted. NOTE: The data set WORK.TEST has 11 observations and 3 variables. NOTE: The data set WORK.DUPLICATES has 3 observations and 3 variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.
it will be updated in another table where it will say
Total Count=14
Level count=11
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could just use a sub-query in PROC SQL.
Into a table:
proc sql noprint;
create table want as
select sum(nobs) as Total_count,count(*) as Level_count
from
(select col1,col2,col3,count(*) as nobs
from have
group by col1,col2,col3
)
;
quit;
Or into macro variables:
proc sql noprint;
select sum(nobs) format=32.,count(*) format=32.
into :total_count trimmed, :level_count trimmed
from
(select col1,col2,col3,count(*) as nobs
from have
group by col1,col2,col3
)
;
quit;
%put Total Count = %sysfunc(putn(&total_count,comma32.-L));
%put Level Count = %sysfunc(putn(&level_count,comma32.-L));
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Srigyan wrote:
I will create the data set going forward. I need these numbers in a table so I can update this somewhere else. I can't check log for this.
it will be updated in another table where it will say
Total Count=14
Level count=11
You can pull the resulting sums from the output of the proc sort from dictionary.tables:
proc sql noprint;
select nobs into :distobs from dictionary.tables where libname = 'WORK' and memname = 'TEST';
select nobs into :dupobs from dictionary.tables where libname = 'WORK' and memname = 'DUPLICATES';
select nobs into :allobs from dictionary.tables where libname = 'WORK' and memname = 'HAVE';
quit;
%put allobs=&allobs. distobs=&distobs. dupobs=&dupobs.;
or create a table
proc sql;
create table want as
select "Total Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'HAVE'
union all
select "Level Count" as description, nobs as count
from dictionary.tables
where libname = 'WORK' and memname = 'TEST'
;
quit;
You can now compare the performance of the suggested duplicate-detection methods and select which one to use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One simple rule: if datasets are large, don't use proc sql at all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@andreas_lds Bingo! I like that!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
edited my question with an additional example, please check and share your answer. Thanks for your effort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Srigyan wrote:
It's more of optimisation question. I have a table with 1b record. Now I want to check if the total record is matching with distinct count based on 3 diferent variable. Though I have a query which runs great on small data but takes lot of time when I run this on huge data like 1b record.
Query is;
proc sql;
Select Count(distinct(catx(col1,col2,col3))),count(*) from table1;
quit;
is there any way to optimise this query. so I can run on huge data
Can you provide a small, maybe 15 records or so example of the data this "works" on correctly?
I am not sure I understand exactly what you are looking for but when I test on a set I have this does not give a count of the distinct combinations of variables.
proc sql; create table work.sqlcount2 as Select Count(distinct(catx(sex,age))) as distinctcount,count(*) as recordcount from sashelp.class; quit;
Shows 6 for the distinctcount but there are actually 11 different combinations of sex and age in the data set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you use CATX(sex,age) you are just going to get the same values are if you did CATS(age).
The first argument to CATX() is the delimiter string. If there is only one other argument then the delimiter is never inserted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:
proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;
run;
To see what such a table would look like, run
proc freq data=sashelp.cars noprint;
tables model*type*origin / out=freqs missing;
run;
There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mkeintz wrote:
Use a procedure whose whole existence is to produce frequencies. Because PROC SQL is a much more general purpose procedure, it likely has time-consuming overhead that you don't need. Because you are counting CATX(COL1,COL2,COL3) you apparently want 3-ways crosstabulations:
proc freq data=tables noprint;
tables col1*col2*col3 / out=freqs missing;run;
To see what such a table would look like, run
proc freq data=sashelp.cars noprint;
tables model*type*origin / out=freqs missing;
run;
There will be one row per combination (including combinations in which 1 or more variable is missing) with a the values of MODEL TYPE ORIGIN, and new variables COUNT and PERCENT.
Pretty much what I did only I used SASHELP.CLASS as a smaller data set on only two variables. I asked what @Srigyan was actually attempting to count because his SQL returned a count of six values for the distinct where there are 11 combinations, so his code seems to be returning a count of the variable with the single largest number of distinct values.
proc sql; select distinct(catx(sex,age)) as sexage from sashelp.class; quit;
Yields:
sexage 11 12 13 14 15 16
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The moment you hit large data volumes the coding approach chosen can make a big difference in performance. Knowing your data will help you a lot in making "the right" choice.
For your use case:
1. Is this a table in a database (which one) or is it a SAS table?
2. If a SAS table: Is the table already sorted by col1-col3?
3. What number of distinct combinations of col1-col3 do you expect?
A SQL DISTINCT requires implicit sorting of the data. That's the costly step. I would assume also Proc Freq requires implicit sorting.
IF the expected number of distinct combinations of col1-col3 is much lower than 1 billion then using a data step hash approach would likely perform much better as it avoids sorting 1billion rows. But that's only going to work if the distinct combinations of col1-col3 fit into memory.