I have a table with millions of records.
For example fields: a, b, c, d, e, f, g.
I need to select a total count of 'a' and total sum of 'b' for each 'c'.
SQL is not the best solution when we have many records.
So, here I should use hash objects.
Please, post your suggestions. Probably, somebody can post a code, that solves similar problem.
I always find it useful to see what the standard SAS procedures will do before I code something as an SQL query. For example, given this data:
input a $ b c $ d e;
a1 11 c1 10 5
a1 12 c1 10 5
a1 13 c1 10 5
a2 14 c1 10 5
a2 15 c2 10 5
a2 16 c2 10 5
a3 17 c2 10 5
a3 18 c3 10 5
a4 19 c3 10 5
a4 20 c3 10 5
in a SAS dataset called TESTDATA, I can run this code:
There's a good chance that your company already has procedures that they prefer to use for these kinds of reports or standards about which procedures should be used for tasks like this. You might consider looking for documentation about how your company produces reports with SAS.
It really depends on your data and the kind of analysis or summary you need, whether PROC MEANS is the right procedure for your task. Is your variable C a character variable or a numeric variable??? PROC MEANS needs a Numeric variable to calculate statistics. If the variable C is a character variable, as shown above, then you could use PROC FREQ, or PROC TABULATE or PROC REPORT or PROC SQL to get the count of distinct values of C. Using the above data as an example:
options nodate nonumber nocenter;
proc freq data=testdata;
title 'Proc Freq';
tables c /nocum nopercent;
title 'Proc SQL';
select c, count(c)
group by c;
Both procedures yield the same results:
*** *** *** *** *** *** ***
The documentation on PROC FREQ and PROC MEANS and some of the other basic SAS procedures will help you understand how they work and which one to use for what kind of data.
If you look in the SAS documentation, go to the Table of Contents and then follow this click path:
Contents --> Base SAS --> SAS Procedures -->Concepts --> Choosing the Right Procedure
If you need more help, you might consider contacting Tech Support or searching the Tech Support site for examples or looking for books that talk about SAS fundamental concepts and which procedures to use for different types of analysis.
In particular, for the Pharmaceutical Industry, you might find this book to be useful:
SAS® Programming in the Pharmaceutical Industry
by Jack Shostak
Publisher: SAS Publishing
Print ISBN-10: 1-59047-793-6
Print ISBN-13: 978-1-59047-793-9
You can find more books on the subject by visiting the SAS Press web site and using the Search feature.
If you have a numeric variable, then proc means will give you 5 default statistics:
N, MEAN, Std Dev, Minimum and Maximum...to get those statistics for your numeric variable, then you would do:
proc means data=testdata ;
If you had some variable (like Country) and you wanted to see those default statistics for variable C for each COUNTRY value, then you would do this:
proc means data=testdata ;
Reading some of the introductory material on the basic SAS procedures will be very useful to you. It is possible to get these statistics into a SAS data set so you can use them for further analysis or reporting. The documentation covers this ability in detail.
Can you please help me to create data step, which will perform the action, like this SQL Query: create table testtable as (select K.n, sum(P.b), count(P.a)
from P right join K on K.a = P.a
where K.n in (11, 12, 17, 21)
group by K.n);
I am not very good in datastep. I wrote this query, but it works a little long.
I hope that data step will perform faster.
Also it could be great to see the same query with using hash objects.
I doubt you'll save any time in in the second question (and better to start a new thread for a new question), by using a DATA step. You will need to sort both data sets (doing the K subsetting in the SORT), and then use a data step to merge them (as I showed code for earlier).
Then you can use MEANS to output the statistics (if both variables are numeric) or TABULATE if they are a mix.
Alternately, you could use RETAIN coding in the data step to aggregate the statistics.
These are moderately advanced topics, so you may want to read up in either
Aster & Seidman's "Professional SAS Programming secrets" or
Jaffe's "Mastering the SAS System"
(both were available from the SAS publications web site).