Re: Count distinct for multiple tables

dht115 · Posted 02-21-2023 03:58 PM

Hello,

I need to write a count macro which is calculate distinct value by variable and populate the table. For example, in test1 dataset, I do have total 5 observations but out of 5, 4 are unique. If I run the macro, I need to have final table listing as below.

data test1;
input var1 var2 $;
datalines;
001
001
002
003
004
;
run;

data test2;
input var1 var2 $;
datalines;
001
001
002
002
004
;
run;

ballardw · Posted 02-21-2023 05:15 PM

First, make sure that any data step you provide for example data runs correctly. Both of your examples throw errors. Are you sure there should be an input for Var2 at all? Note that the number of observations is not matching your claim of "For example, in test1 dataset, I do have total 5 observations but out of 5, 4 are unique" but only 2 are in the output because of the error.

56   data test1;
57   input var1 var2 $;
58   datalines;

NOTE: LOST CARD.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
64         ;
var1=4 var2=  _ERROR_=1 _N_=3
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.TEST1 has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


64   ;
65   run;
66
67   data test2;
68   input var1 var2 $;
69   datalines;

NOTE: LOST CARD.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
75         ;
var1=4 var2=  _ERROR_=1 _N_=3
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.TEST2 has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


75   ;
76   run;

Consider this code:

data test1;
input var1;
datalines;
001
001
002
003
004
;
run;

data test2;
input var1;
datalines;
001
001
002
002
004
;
run;

data helpful;
   set test1 test2 indsname=source;
   table_name=source;
run;

proc freq data=helpful nlevels;
   ods output nlevels=want;
   by notsorted table_name;
   tables var1;
run;

This reads the data in a manner that seems to match your description.

The data step combines the data and adds a variable to the data that has the original dataset name. If you don't like the library part that is an exercise for the interested reader to remove that from the Table_name variable. The Proc Freq has an option, Nlevels that actually counts the unique levels of each variable. The BY means that there is an output record in the set Want for each table_name value. Renaming the NLevels variable to Count and dropping the other variables from the Want set are also nearly trivial exercises.

Note that if your data sets actually have more than one variable you add them to the Tables statement and you will have a count for each.

Note, no macro code needed at all.

PaigeMiller · Posted 02-21-2023 05:20 PM

No macro needed, and probably very inefficient if you did write one.

Combine all your data sets into one long data set. Then run PROC FREQ.

data test1;
input var1 $;
datalines;
001
001
002
003
004
;


data test2;
input var1 $;
datalines;
001
001
002
002
004
;

data combined;
    set test1 test2 indsname=indsname;
    table=indsname;
run;

proc freq data=combined nlevels;
    ods output nlevels=nlevels;
	by table notsorted;
    tables var1/noprint;
run;

The data set NLEVELS is what you want.

--
Paige Miller

Count distinct for multiple tables

Re: Count distinct for multiple tables

Re: Count distinct for multiple tables

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away