Re: Using hash objects for selecting data

deleted_user · Posted 01-28-2008 10:49 AM

Hi gyes

I have a table with millions of records.
For example fields: a, b, c, d, e, f, g.
I need to select a total count of 'a' and total sum of 'b' for each 'c'.
SQL is not the best solution when we have many records.
So, here I should use hash objects.
Please, post your suggestions. Probably, somebody can post a code, that solves similar problem.

Cynthia_sas · Posted 01-28-2008 12:04 PM

Hi:
I always find it useful to see what the standard SAS procedures will do before I code something as an SQL query. For example, given this data:
[pre]
data testdata;
infile datalines;
input a $ b c $ d e;
return;
datalines;
a1 11 c1 10 5
a1 12 c1 10 5
a1 13 c1 10 5
a2 14 c1 10 5
a2 15 c2 10 5
a2 16 c2 10 5
a3 17 c2 10 5
a3 18 c3 10 5
a4 19 c3 10 5
a4 20 c3 10 5
;
run;
[/pre]
in a SAS dataset called TESTDATA, I can run this code:
[pre]

ods listing;
options nodate nonumber nocenter;
proc freq data=testdata;
tables a c;
run;

proc means data=testdata sum n;
class c;
var b;
run;
[/pre]
and get these results from PROC FREQ:
[pre]
The FREQ Procedure

Cumulative Cumulative
a Frequency Percent Frequency Percent
-------------------------------------------------------
a1 3 30.00 3 30.00
a2 3 30.00 6 60.00
a3 2 20.00 8 80.00
a4 2 20.00 10 100.00

Cumulative Cumulative
c Frequency Percent Frequency Percent
-------------------------------------------------------
c1 4 40.00 4 40.00
c2 3 30.00 7 70.00
c3 3 30.00 10 100.00
[/pre]

and these results from PROC MEANS:
[pre]

Analysis Variable : b

N
c Obs Sum N
-------------------------------------
c1 4 50.0000000 4

c2 3 48.0000000 3

c3 3 57.0000000 3
-------------------------------------
[/pre]
or you could do it with PROC TABULATE:
[pre]
options formchar='|-++++++++';
proc tabulate data=testdata f=comma6.;
class a c;
var b;
table a all,n pctn;
table c,b*sum;
keylabel n = 'Count'
sum = 'Total';
run;
[/pre]
which results in these 2 output tables:
[pre]
+----------------------+------+------+
| |Count | PctN |
+----------------------+------+------+
|a | | |
+----------------------+ | |
|a1 | 3| 30|
+----------------------+------+------+
|a2 | 3| 30|
+----------------------+------+------+
|a3 | 2| 20|
+----------------------+------+------+
|a4 | 2| 20|
+----------------------+------+------+
|All | 10| 100|
+----------------------+------+------+

+----------------------+------+
| | b |
| +------+
| |Total |
+----------------------+------+
|c | |
+----------------------+ |
|c1 | 50|
+----------------------+------+
|c2 | 48|
+----------------------+------+
|c3 | 57|
+----------------------+------+

[/pre]
There's a good chance that your company already has procedures that they prefer to use for these kinds of reports or standards about which procedures should be used for tasks like this. You might consider looking for documentation about how your company produces reports with SAS.

cynthia

deleted_user · Posted 01-31-2008 05:11 AM

This is perfect.

How can do with the help of proc means this select:
select c, count(c) from x group by c?

Do proc freq and proc means use indexes?

Cynthia_sas · Posted 01-31-2008 11:16 AM

Hi:
It really depends on your data and the kind of analysis or summary you need, whether PROC MEANS is the right procedure for your task. Is your variable C a character variable or a numeric variable??? PROC MEANS needs a Numeric variable to calculate statistics. If the variable C is a character variable, as shown above, then you could use PROC FREQ, or PROC TABULATE or PROC REPORT or PROC SQL to get the count of distinct values of C. Using the above data as an example:
[pre]
options nodate nonumber nocenter;
proc freq data=testdata;
title 'Proc Freq';
tables c /nocum nopercent;
run;

title 'Proc SQL';
proc sql;
select c, count(c)
from testdata
group by c;
quit;

[/pre]

Both procedures yield the same results:
[pre]
Proc Freq

c Frequency
---------------
c1 4
c2 3
c3 3
*** *** *** *** *** *** ***
Proc SQL

c
------------------
c1 4
c2 3
c3 3

[/pre]

The documentation on PROC FREQ and PROC MEANS and some of the other basic SAS procedures will help you understand how they work and which one to use for what kind of data.

If you look in the SAS documentation, go to the Table of Contents and then follow this click path:
Contents --> Base SAS --> SAS Procedures -->Concepts --> Choosing the Right Procedure

If you need more help, you might consider contacting Tech Support or searching the Tech Support site for examples or looking for books that talk about SAS fundamental concepts and which procedures to use for different types of analysis.

In particular, for the Pharmaceutical Industry, you might find this book to be useful:
SAS® Programming in the Pharmaceutical Industry
by Jack Shostak

Publisher: SAS Publishing
Print ISBN-10: 1-59047-793-6
Print ISBN-13: 978-1-59047-793-9

You can find more books on the subject by visiting the SAS Press web site and using the Search feature.

cynthia

deleted_user · Posted 01-31-2008 11:24 AM

Thanks, I will take a look at this.
My 'c' is numeric.

Cynthia_sas · Posted 02-01-2008 01:26 PM

Hi:
If you have a numeric variable, then proc means will give you 5 default statistics:
N, MEAN, Std Dev, Minimum and Maximum...to get those statistics for your numeric variable, then you would do:
[pre]
proc means data=testdata ;
var c;
run;
[/pre]

If you had some variable (like Country) and you wanted to see those default statistics for variable C for each COUNTRY value, then you would do this:
[pre]
proc means data=testdata ;
class country;
var c;
run;
[/pre]

Reading some of the introductory material on the basic SAS procedures will be very useful to you. It is possible to get these statistics into a SAS data set so you can use them for further analysis or reporting. The documentation covers this ability in detail.

cynthia

deleted_user · Posted 02-05-2008 09:58 AM

Hi
Can you please help me to create data step, which will perform the action, like this SQL Query:
create table testtable as (select K.n, sum(P.b), count(P.a)
from P right join K on K.a = P.a
where K.n in (11, 12, 17, 21)
group by K.n);
I am not very good in datastep. I wrote this query, but it works a little long.
I hope that data step will perform faster.
Also it could be great to see the same query with using hash objects.

Doc_Duke · Posted 02-06-2008 10:33 AM

Hippo,

I doubt you'll save any time in in the second question (and better to start a new thread for a new question), by using a DATA step. You will need to sort both data sets (doing the K subsetting in the SORT), and then use a data step to merge them (as I showed code for earlier).

Then you can use MEANS to output the statistics (if both variables are numeric) or TABULATE if they are a mix.

Alternately, you could use RETAIN coding in the data step to aggregate the statistics.

These are moderately advanced topics, so you may want to read up in either

Aster & Seidman's "Professional SAS Programming secrets" or
Jaffe's "Mastering the SAS System"
(both were available from the SAS publications web site).

Doc Muhbaier
Duke

SAS Innovate 2025: Call for Content

Health and Life Sciences Learning