topic Re: Dividing row count of 2 datasets in SAS Programming

Dividing row count of 2 datasets

1ashg — Sat, 23 Jan 2016 16:17:22 GMT

I have a Dataset A and Dataset B and both data sets have only 1 variable, ID.

I'm having a hard time trying to do this : (# of rows in Dataset A / # of rows in Dataset B) and then display that answer somehow.

I've tried this but it takes way to long because it involves cartesian product join:

proc sql;
create table ppv as
select count(a.ID)/count(b.ID) as ppv5
from DatasetA as a, DatasetB as b;
quit;

Overall I'm trying to program to calculate ppv after a model is generated.

Re: Dividing row count of 2 datasets

FreelanceReinh — Sat, 23 Jan 2016 17:05:31 GMT

Although it seems uncommon to me to calculate the PPV this way (cf. http://support.sas.com/kb/24/170.html), you could do it as follows:

data _null_;
if 0 then set DatasetA nobs=na;
if 0 then set DatasetB nobs=nb;
ppv=na/nb;
put ppv=;
stop; /* Only to avoid NOTE: DATA STEP stopped due to looping. */
run;

The result is written to the log.

Re: Dividing row count of 2 datasets

stat_sas — Sat, 23 Jan 2016 17:26:04 GMT

proc sql;

create table ppv as
select count(ID)/(select count(ID) from DatasetB) as ppv5
from DatasetA;
quit;

Re: Dividing row count of 2 datasets

1ashg — Sat, 23 Jan 2016 17:56:29 GMT

Thank you for the reference, I'll have to read through that, it may be much more efficient way to get what we need.

Re: Dividing row count of 2 datasets

FreelanceReinh — Sat, 23 Jan 2016 21:06:09 GMT

You're welcome.

Please note that the two suggested solutions are not equivalent. You described your requirement as dividing the row counts of two datasets. This is what my suggested code does. The ppv5 value as per @stat_sas's solution, however, is the quotient of the numbers of rows with non-missing values of ID (see documentation). So, the results are likely to be different if there are one or more missing values in either of the two datasets.

If no missing values are involved, the results will be equal, but there is still a remarkable difference:

The data step solution with the SET statements preceded by "if 0" (i.e., a condition which is definitely not met) retrieves the numbers of observations from header information (cf. PROC CONTENTS output) already at compile time. The SET statements are not executed, because the IF conditions are not met. Therefore, the run time of this data step is almost zero, no matter how large the datasets are.

In contrast, the PROC SQL step must read through all observations of both datasets, because it has to look into the values of ID in order to check how many of them are missing and hence have to be disregarded in the count (see above). So, the run time of the PROC SQL step can be substantial if the datasets are large.

I've tested it with two datasets of about 7 GB each, stored on a fast SSD of a professional workstation. The PROC SQL step took 58.47 seconds, whereas the data step took between 0.00 and 0.02 seconds, >2000 times faster.

So, if your intention is to count rows rather than non-missing values of a particular variable, the data step solution could be a "much more efficient way to get what" you need.

Re: Dividing row count of 2 datasets

Babloo — Mon, 25 Jan 2016 07:10:15 GMT

What is "if 0" in following code? Will it execute only when there is zero observations?

if 0 then set DatasetA nobs=na;

Re: Dividing row count of 2 datasets

FreelanceReinh — Mon, 25 Jan 2016 11:45:07 GMT

@Babloo: No, the 0 has nothing to do with the number of observations. It stands for the logical (Boolean) value FALSE. You could replace it by any logical expression which is likewise never true, e.g. "if 4=7 then ..." or "if 1+1=3 then ...". Also, "if . then ..." (with the numeric missing value .) would work the same way, because both 0 and . are FALSE when evaluated as logical expressions. But 0 is simply the most "natural" way to express the Boolean value FALSE in SAS and it is evaluated more easily than a compound expression such as "1+1=3".

So, the idea of "if 0 then set ..." is to tell SAS: "Do not execute the SET statement" (in order to save time, because we do not need to read any variable values from the dataset; we are only interested in the number of observations in the dataset and we will get it thanks to the effect of NOBS=... at compile time).

It can be regarded as an improved version of

set DatasetA(obs=1) nobs=na;

The above SET statement will be executed (there is no IF condition to prevent that), but only one observation will be read from DatasetA (unnecessarily, though). So, run time will also be very low, but probably slightly longer than if SET was not executed at all. (By the way, "obs=0" wouldn't work for our purpose, because in this case the statements following the SET statement would not be executed.)

Re: Dividing row count of 2 datasets

Babloo — Mon, 25 Jan 2016 12:06:18 GMT

Thank you for detailed explanation. So we're using "if 0 then set ..." only to find out the total number of observations? To understand completely, any possibilities to replace this if clause ("if 0 then set ...") with some other if clause to achive the same task?

Re: Dividing row count of 2 datasets

FreelanceReinh — Mon, 25 Jan 2016 12:26:17 GMT

The construct "if 0 then set ..." can be used for other purposes as well. See, for example, this older thread and in particular p. 26, section E, of the paper linked there.

As mentioned in my previous post, the zero in "if 0" can be replaced equivalently by any logical expression which (always) evaluates to FALSE. An example which I have seen from time to time is "if 0=1 then set ...". Again, 0=1 is one of the simplest possible false expressions. But "if 0 then set ..." is still shorter, so perhaps more elegant (and programmers are sometimes lazy).