turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Dividing row count of 2 datasets

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-23-2016 11:17 AM

I have a Dataset A and Dataset B and both data sets have only 1 variable, ID.

I'm having a hard time trying to do this : (# of rows in Dataset A / # of rows in Dataset B) and then display that answer somehow.

I've tried this but it takes way to long because it involves cartesian product join:

proc sql;

create table ppv as

select count(a.ID)/count(b.ID) as ppv5

from DatasetA as a, DatasetB as b;

quit;

Overall I'm trying to program to calculate ppv after a model is generated.

Accepted Solutions

Solution

01-23-2016
12:55 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

01-23-2016 12:26 PM

proc sql;

create table ppv as

select count(ID)/(select count(ID) from DatasetB) as ppv5

from DatasetA;

quit;

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to 1ashg

01-23-2016 12:05 PM

Although it seems uncommon to me to calculate the PPV this way (cf. http://support.sas.com/kb/24/170.html), you could do it as follows:

```
data _null_;
if 0 then set DatasetA nobs=na;
if 0 then set DatasetB nobs=nb;
ppv=na/nb;
put ppv=;
stop; /* Only to avoid NOTE: DATA STEP stopped due to looping. */
run;
```

The result is written to the log.

Solution

01-23-2016
12:55 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

01-23-2016 12:26 PM

proc sql;

create table ppv as

select count(ID)/(select count(ID) from DatasetB) as ppv5

from DatasetA;

quit;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

01-23-2016 12:56 PM

Thank you for the reference, I'll have to read through that, it may be much more efficient way to get what we need.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to 1ashg

01-23-2016 04:06 PM

You're welcome.

Please note that the two suggested solutions are not equivalent. You described your requirement as dividing the row counts of two datasets. This is what my suggested code does. The ppv5 value as per @stat_sas's solution, however, is the quotient of the numbers of rows __with non-missing values of ID__* *(see documentation). So, the results are likely to be different if there are one or more missing values in either of the two datasets.

If no missing values are involved, the *results* will be equal, but there is still a remarkable difference:

The data step solution with the SET statements preceded by "if 0" (i.e., a condition which is definitely not met) retrieves the numbers of observations from *header information* (cf. PROC CONTENTS output) already *at compile time. *The SET statements are *not executed*, because the IF conditions are not met. Therefore, the run time of this data step is almost zero, no matter how large the datasets are.

In contrast, the PROC SQL step *must* read through all observations of both datasets, because it has to look into the values of ID in order to check how many of them are missing and hence have to be disregarded in the count (see above). So, the run time of the PROC SQL step can be substantial if the datasets are large.

I've tested it with two datasets of about 7 GB each, stored on a fast SSD of a professional workstation. The PROC SQL step took 58.47 seconds, whereas the data step took between 0.00 and 0.02 seconds, >2000 times faster.

So, if your intention is to count rows rather than non-missing values of a particular variable, the data step solution could be a "much more efficient way to get what" you need.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

01-25-2016 02:10 AM

What is "if 0" in following code? Will it execute only when there is zero observations?

`if 0 then set DatasetA nobs=na;`

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Babloo

01-25-2016 06:45 AM

@Babloo: No, the 0 has nothing to do with the number of observations. It stands for the logical (Boolean) value FALSE. You could replace it by any logical expression which is likewise never true, e.g. "if 4=7 then ..." or "if 1+1=3 then ...". Also, "if . then ..." (with the numeric missing value .) would work the same way, because both 0 and . are FALSE when evaluated as logical expressions. But 0 is simply the most "natural" way to express the Boolean value FALSE in SAS and it is evaluated more easily than a compound expression such as "1+1=3".

So, the idea of "if 0 then set ..." is to tell SAS: "Do not execute the SET statement" (in order to save time, because we do not need to read any variable values from the dataset; we are only interested in the number of observations in the dataset and we will get it thanks to the effect of NOBS=... at *compile* time).

It can be regarded as an improved version of

`set DatasetA(obs=1) nobs=na;`

The above SET statement will be executed (there is no IF condition to prevent that), but only one observation will be read from DatasetA (unnecessarily, though). So, run time will also be very low, but probably slightly longer than if SET was not executed at all. (By the way, "obs=0" wouldn't work for our purpose, because in this case the statements following the SET statement would not be executed.)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

01-25-2016 07:06 AM

Thank you for detailed explanation. So we're using "if 0 then set ..." only to find out the total number of observations? To understand completely, any possibilities to replace this if clause ("if 0 then set ...") with some other if clause to achive the same task?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Babloo

01-25-2016 07:26 AM

The construct "if 0 then set ..." can be used for other purposes as well. See, for example, this older thread and in particular p. 26, section E, of the paper linked there.

As mentioned in my previous post, the zero in "if 0" can be replaced equivalently by any logical expression which (always) evaluates to FALSE. An example which I have seen from time to time is "if 0=1 then set ...". Again, 0=1 is one of the simplest possible false expressions. But "if 0 then set ..." is still shorter, so perhaps more elegant (and programmers are sometimes lazy).