BookmarkSubscribeRSS Feed
drawerder
Calcite | Level 5

Dears, 

 

I have a big data set containing more than 100 variables, am trying to have a statistical table representing Each Variable (Blank cells count and Blank cells percentage and an amount(another variable in the data set) that reflect the amount of blank cells of each variable.

Data Sample

SocksT-shirtsShirtsCost 
Large  20
 X largeSmall40
Small Large 60

 

Data output desired : 

 

VariablesBlanks CountBlanks %Amount representing blank dataAmount %
Socks133%4033%
T-shirts267%8067%
Shirts133%2017%

 

Either having it as a table representing all variables. Please not that i am a beginner on SAS enterprise guide and am not good with coding. my plan was to do it by the query builder or by statistics tables.

9 REPLIES 9
drawerder
Calcite | Level 5

Sorry for the inconveniences. I have updated the question for better understanding. 

 

Thank you for your reply.

Reeza
Super User
/*This program creates a report with the number and percent of
missing data for each variable in the data set.
The ony change should be to the macro variable, INPUT_DSN.
Author: F. Khurshed
Date: 2019-01-04
*/
*create sample data to work with;

data class;
    set sashelp.class;

    if age=14 then
        call missing(height, weight, sex);

    if name='Alfred' then
        call missing(sex, age, height);
    label age="Fancy Age Label";
run;

*set input data set name;
%let INPUT_DSN = class;
%let OUTPUT_DSN = want;
*create format for missing;

proc format;
    value $ missfmt ' '="Missing" other="Not Missing";
    value nmissfmt .="Missing" other="Not Missing";
run;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN.;
    table _all_ / missing;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;
*Format output;

data long;
    length variable $32. variable_value $50.;
    set temp;
    Variable=scan(table, 2);
    Variable_Value=strip(trim(vvaluex(variable)));
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    keep variable variable_value frequency percent cum: presentation;
    label variable='Variable' variable_value='Variable Value';
run;

proc sort data=long;
    by variable;
run;

*make it a wide data set for presentation, with values as N (Percent);

proc transpose data=long out=wide_presentation (drop=_name_);
    by variable;
    id variable_value;
    var presentation;
run;

*transpose only N;

proc transpose data=long out=wide_N prefix=N_;
    by variable;
    id variable_value;
    var frequency;
run;

*transpose only percents;

proc transpose data=long out=wide_PCT prefix=PCT_;
    by variable;
    id variable_value;
    var percent;
run;

*final output file;

data &Output_DSN.;
    merge wide_N wide_PCT wide_presentation;
    by variable;
    drop _name_;
    label N_Missing='# Missing' N_Not_Missing='# Not Missing' 
        PCT_Missing='% Missing' N_Not_Missing='% Not Missing' Missing='Missing' 
        Not_missing='Not Missing';
run;

title "Missing Report of &INPUT_DSN.";

proc print data=&output_dsn. noobs label;
run;

@drawerder wrote:

Dears, 

 

I have a big data set containing more than 100 variables, am trying to have a statistical table representing Each Variable (Blank cells count and Blank cells percentage and an amount(another variable in the data set) that reflect the amount of blank cells of each variable

 

Thanks in advance.


 

ballardw
Super User

@drawerder wrote:

Dears, 

 

I have a big data set containing more than 100 variables, am trying to have a statistical table representing Each Variable (Blank cells count and Blank cells percentage and an amount(another variable in the data set) that reflect the amount of blank cells of each variable

 

Thanks in advance.


Hint: Create a small data set with 3 or 4 variables such that you can determine what the results for that set should look like. Provide the data and the desired result.

Is the result supposed to be a report that people read or a data set used by other processes?

drawerder
Calcite | Level 5

I updated the post, Kindly check it 

 

Thank you for your comment.

 

Reeza
Super User
You need to explain that logic. Why does Socks have count=1, amount=40? Where do those values come from? Did you run/test the code I posted? It does what you need.
drawerder
Calcite | Level 5

Socks Column has 3 cells with (Large,Blank,Small) so there are 2 non blanks and 1 blank. the amount 40 is the cost associated with the blank. 

 

Your code already has variables from the data (CLass), after i entered the code and changed data name. all summaries and tables are representing (Height, sex, Age).

 

 

Reeza
Super User

If you want to do this in EG yourself, you need to create a custom format to show MISSING/NON-MISSING, as I did above (see the comments). You can do this via the GUI interface to create a user format. 

 

Then you apply the formats to your data set - not sure how in EG you'll need to google that. One option would be to add it in via Query Builder or via the assign column properties options.

 

Then run a One Way Frequency task on it twice, once with all the variables you want summarized. 

Then again, with using the AMOUNT as the Weight variable. Then merge the two results together.

 

EDIT: Search on here and someone posted an answer that would solve this is less steps by using a transpose task first and then doing a  one way frequency on that instead. I don't have EG to test that though. You may also want to try the Characterize Data task, it may have it all done for you but I don't recall the options and don't have EG anymore to check. 

 

Here's a code version. First run it exactly as posted and see the results that are generated. If that's what you want then change the section in red, specifically replace CLASS to be the variable name of interest and WANT with the name of the output data set you want. 

 

Coding is always more efficient, since I've now written this program in a generic fashion, the next time I need to do this I change the input data set name and it works 🙂 More likely though, I can copy that first section which is a useful program for me. There are other ways but this works quite well. 

 

data class;
    set sashelp.class;

    if age=14 then
        call missing(height, weight, sex);

    if name='Alfred' then
        call missing(sex, age, height);
    label age="Fancy Age Label";
    amount=rand('integer', 10, 100);
run;

*set input data set name;
%let INPUT_DSN = class; *change these lines only - input data set name;
%let OUTPUT_DSN = want; *output data set name;

*create format for missing;

proc format;
    value $ missfmt ' '="Missing" other="Not Missing";
    value nmissfmt .="Missing" other="Not Missing";
run;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp1;

proc freq data=&INPUT_DSN.;
    table _all_ / missing out=temp_Check;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;

*clean up output to reduce clutter;
data long1;
    length variable $32. variable_value $50.;
    set temp1;
    *find name of variable being summarized;
    Variable=scan(table, 2);
    
    *getting value from correct variable - using variable name as lookup retrieved via VVALUEX() function;
    *note this returns a character value;
    Variable_Value=strip(trim(vvaluex(variable)));
    
    *create a presentation variable that is N(##.#%) format;
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    
    *keep only variables of interest;
    keep variable variable_value frequency percent presentation;
  
run;

*sort for merging;
proc sort data=long1;
    by variable;
run;

ods select all;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp2;

proc freq data=&INPUT_DSN.;
    table _all_ / missing;
    weight amount;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;

*clean up output to reduce clutter;
data long2;
    length variable $32. variable_value $50.;
    set temp2;
    *find name of variable being summarized;
    Variable=scan(table, 2);
    
    *getting value from correct variable - using variable name as lookup;
    *note this is a character value;
    Variable_Value=strip(trim(vvaluex(variable)));
    
    *create a presentation variable that is N(##.#%) format;
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    
    *keep only variables of interest;
    keep variable variable_value frequency percent presentation;
  
run;

*sort for merging;
proc sort data=long2;
    by variable;
run;

data want;
*merge but rename variables first to make them represent values you wanted;
*note that if you merge without renaming it's problematic because the variables will overwrite each other;

merge long1 (rename= (frequency = BlankCounts percent = BlankPct presentation=BlankPres)) 
      long2 (rename= (frequency =AmountBlank percent =AmountBlankPct presentation=AmountBlankPres ));

by variable variable_value;

*commented out where clause that would limit it to missing only;
where variable_value='Missing';
run;

Reeza
Super User
I've tried to comment the code quite thoroughly but if you have any questions about any line feel free to ask. HTH.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2640 views
  • 0 likes
  • 4 in conversation