BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Srigyan
Quartz | Level 8

How to create a dataset with number of 

1) Missing

2) Non-Missing

3) Unique value

 

in all column of a dataset, irespective if the filed is numeric or char.

 

column nameMissingNon-MissingUnique Values
Colum1283
Colum2734
Colum31006
Colum40103
Colum5199
1 ACCEPTED SOLUTION

Accepted Solutions
gamotte
Rhodochrosite | Level 12

Hello,

 

data have;
    input x $ y;
    cards;
1 .
2 6
. 6
4 8
. 10
6 10
7 .
. 16
9 16
. 20
;
run;

data _NULL_;
    set have;
    length _NAME $32.;
    call execute('proc sql noprint; CREATE TABLE counts AS SELECT ');
    do until (_NAME='_NAME');
        i+1;
        call vnext(_NAME);
        if _NAME ne '_NAME' then
            call execute(cats('"', _NAME, '" AS var_name',i,', sum(missing(', strip(_NAME),')) AS nmiss', i,
                         ', (count(*)-CALCULATED nmiss', i,') AS nnonmiss', i, 
                         ', count(DISTINCT '|| _NAME, ') AS nuniq', i,','));
    end;
    call execute('1 AS kludge FROM have; quit;');
    stop;
run;

data want;
    set counts;
    keep variable missing nonmissing unique;
    array vars (*) var_name:;
    array miss (*) nmiss:;
    array nonmiss (*) nnonmiss:;
    array uni (*) nuniq:;

    do i=1 to dim(miss);
        variable=vars(i);
        missing=miss(i);
        nonmissing=nonmiss(i);
        unique=uni(i);
        output;
    end;
run;


Edit : added retrieval of variable names in the sql query.

View solution in original post

12 REPLIES 12
Kurt_Bremser
Super User

Use a data step run from sashelp.vcolumn that creates the necessary SQL with call execute:

%let library=SASHELP;
%let memname=CLASS;

data _null_;
set sashelp.vcolumn (where=(libname = "&library." and memname = "&memname.")) end=eof;
if _n_ = 1 then call execute("
proc sql;
create table want as
");
if _n_ > 1 then call execute(" union all ");
call execute("
select
  '" !! strip(name) !! "' as column_name,
  sum(case when " !! strip(name) !! " is missing then 1 else 0 end) as missing,
  sum(case when " !! strip(name) !! " is missing then 0 else 1 end) as non_missing,
  count(distinct " !! strip(name) !! ") as unique_values 
from sashelp.class
");
if eof then call execute("
;
quit;
");
run;
Srigyan
Quartz | Level 8

it keeps running for long time even for a small out. Can't see if this is giving any result or not. Please help...

Kurt_Bremser
Super User

@Srigyan wrote:

it keeps running for long time even for a small out. Can't see if this is giving any result or not. Please help...


Run it with a small subset (like obs=100) first; keep in mind that the number of columns is important, as it determines the number of passes through the dataset.

My test with sashelp.class took less than a tenth of a second here.

 

Srigyan
Quartz | Level 8

not working...taking too much time

 

data abc;
set sashelp.class(obs=10);
run;

%let library=work;
%let memname=abc;

data _null_;
set sashelp.vcolumn (where=(libname = "&library." and memname = "&memname.")) end=eof;
if _n_ = 1 then call execute("
proc sql;
create table want as
");
if _n_ > 1 then call execute(" union all ");
call execute("
select
  '" !! strip(name) !! "' as column_name,
  sum(case when " !! strip(name) !! " is missing then 1 else 0 end) as missing,
  sum(case when " !! strip(name) !! " is missing then 0 else 1 end) as non_missing,
  count(distinct " !! strip(name) !! ") as unique_values 
from sashelp.class
");
if eof then call execute("
;
quit;
");
run;
gamotte
Rhodochrosite | Level 12

Hello,

 

data have;
    input x $ y;
    cards;
1 .
2 6
. 6
4 8
. 10
6 10
7 .
. 16
9 16
. 20
;
run;

data _NULL_;
    set have;
    length _NAME $32.;
    call execute('proc sql noprint; CREATE TABLE counts AS SELECT ');
    do until (_NAME='_NAME');
        i+1;
        call vnext(_NAME);
        if _NAME ne '_NAME' then
            call execute(cats('"', _NAME, '" AS var_name',i,', sum(missing(', strip(_NAME),')) AS nmiss', i,
                         ', (count(*)-CALCULATED nmiss', i,') AS nnonmiss', i, 
                         ', count(DISTINCT '|| _NAME, ') AS nuniq', i,','));
    end;
    call execute('1 AS kludge FROM have; quit;');
    stop;
run;

data want;
    set counts;
    keep variable missing nonmissing unique;
    array vars (*) var_name:;
    array miss (*) nmiss:;
    array nonmiss (*) nnonmiss:;
    array uni (*) nuniq:;

    do i=1 to dim(miss);
        variable=vars(i);
        missing=miss(i);
        nonmissing=nonmiss(i);
        unique=uni(i);
        output;
    end;
run;


Edit : added retrieval of variable names in the sql query.

Srigyan
Quartz | Level 8
%let have=sd.customer;

data _NULL_;
    set &have.;
    call execute('proc sql noprint; CREATE TABLE counts AS SELECT ');
    do until (NAME='NAME');
        call vnext(NAME);
        if NAME ne 'NAME' then
            call execute(cat('sum(missing(', NAME,')) AS nmiss',NAME,', count(*)-CALCULATED nmiss', NAME,' AS nnonmiss', NAME, ', count(DISTINCT ', NAME, ') AS nuniq', NAME,','));
        end;
call execute('1 AS kludge FROM &have.; quit;');
stop;
run;

Giving me following error:

 

NOTE: CALL EXECUTE generated line.
1         + proc sql noprint;
1         +                   CREATE TABLE counts AS SELECT
2         + sum(missing(iden)) AS nmissiden, count(*)-CALCULATED nmissiden AS nnonmissiden, count(DISTINCT iden) AS nuniqiden,
3         + sum(missing(titl)) AS nmisstitl, count(*)-CALCULATED nmisstitl AS nnonmisstitl, count(DISTINCT titl) AS nuniqtitl,
4         + sum(missing(prin)) AS nmissprin, count(*)-CALCULATED nmissprin AS nnonmissprin, count(DISTINCT prin) AS nuniqprin,
5         + sum(missing(birt)) AS nmissbirt, count(*)-CALCULATED nmissbirt AS nnonmissbirt, count(DISTINCT birt) AS nuniqbirt,
6         + sum(missing(cred)) AS nmisscred, count(*)-CALCULATED nmisscred AS nnonmisscred, count(DISTINCT cred) AS nuniqcred,
7         + sum(missing(stat)) AS nmissstat, count(*)-CALCULATED nmissstat AS nnonmissstat, count(DISTINCT stat) AS nuniqstat,
8         + sum(missing(post)) AS nmisspost, count(*)-CALCULATED nmisspost AS nnonmisspost, count(DISTINCT post) AS nuniqpost,
9         + sum(missing(date)) AS nmissdate, count(*)-CALCULATED nmissdate AS nnonmissdate, count(DISTINCT date) AS nuniqdate,
10        + sum(missing(date)) AS nmissdate, count(*)-CALCULATED nmissdate AS nnonmissdate, count(DISTINCT date) AS nuniqdate,
11        + sum(missing(date)) AS nmissdate, count(*)-CALCULATED nmissdate AS nnonmissdate, count(DISTINCT date) AS nuniqdate,
12        + sum(missing(date)) AS nmissdate, count(*)-CALCULATED nmissdate AS nnonmissdate, count(DISTINCT date) AS nuniqdate,
13        + sum(missing(open)) AS nmissopen, count(*)-CALCULATED nmissopen AS nnonmissopen, count(DISTINCT open) AS nuniqopen,
14        + sum(missing(LTIM)) AS nmissLTIM, count(*)-CALCULATED nmissLTIM AS nnonmissLTIM, count(DISTINCT LTIM) AS nuniqLTIM,
15        + sum(missing(LTIM)) AS nmissLTIM, count(*)-CALCULATED nmissLTIM AS nnonmissLTIM, count(DISTINCT LTIM) AS nuniqLTIM,
16        + sum(missing(LTIM)) AS nmissLTIM, count(*)-CALCULATED nmissLTIM AS nnonmissLTIM, count(DISTINCT LTIM) AS nuniqLTIM,
17        + 1 AS kludge FROM sd.customer;
ERROR: The following columns were not found in the contributing tables: birt, cred, date, iden, LTIM, open, post, prin, stat, titl.

 

gamotte
Rhodochrosite | Level 12
Add the instruction :
length NAME $32.;
after "set have;".

Note that 32 is the maximal length for variable names and the sql procedure creates new variables names by addiing prefixes to existiong variables names. If your dataset has variables with very long names you may have to adapt the program.
Srigyan
Quartz | Level 8

Made small change in substr

 

variable=substr(vname(miss(i)),6,length(vname(miss(i))));

 

now its working.

gamotte
Rhodochrosite | Level 12

I have edited my initial post to avoid problems with long variables names.

ballardw
Super User

The easiest way I know to get a count of unique levels of a variable is:

 

ods select nlevels;
proc freq data=have nlevels;
ods output nlevels=work.levels;
run;

The output dataset work.levels will have the name of the variable in a variable named TableVar and the variable NNonMissLevels will have the number of unique non-missing values. If NNonMissLevels=0 then the variable has missing values for all records.

 

The NMissLevels variable will indicate if you have more than one "level" of missing when the value is greater than one. That indicates that you have one or more "special missing" values assigned, which may be something you need to consider in your overall project. If the value is 0 then that variable has no missing values.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 3883 views
  • 3 likes
  • 4 in conversation