BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bpearce
Calcite | Level 5


I'm working with a very large data set (over 15 million observations and over 1000 variables). I would like to count the number of non-zero entries for a handful of variables from within my data set.  For example, in the table below VAR 1 has 3 occurrences, and VAR 2 has two occurrences.  I'm a beginner, so please provide some explanation if possible.  I appreciate any help!

Client #Var 1Var 2

1

2452699
200
336940
415480
503554
1 ACCEPTED SOLUTION

Accepted Solutions
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ah, ok, so you want to check if the sum per column is 0.  Then your best off checking one of the procedures associated with that.  Proc means, or proc summary.  Say:

proc means data=have;

     var var1-var1000;

     output out=results mean=mean;

run;

Then if the resulting mean for the variable is 0 you know they are all zeroes. 

View solution in original post

9 REPLIES 9
M_Maldonado
Barite | Level 11

If you have SAS IML, there is one really fast way to count missing/not missing.

Look at this post: http://blogs.sas.com/content/iml/2011/09/19/count-the-number-of-missing-values-for-each-variable.htm...

Do you have HP procs and a grid license? That is another route.

I hope it helps,

Miguel

Steelers_In_DC
Barite | Level 11

I'm sure there will be some good array solutions, but this is the first thing that came to mind.  If it is only a handful of variables it's a viable solution.

data have;

infile cards dsd;

input Client Var1 Var2;

cards;

1,245,2699

2,0,0

3,3694,0

4,1548,0

5,0,3554

;

run;

data want;

set have;

count = sum((var1 ne 0)+(var2 ne 0));

run;

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

Yes, you can use arrays.  This would be preferred if there were many variables. 

data have;

  client=1; var1=245; var2=2696; output;

  client=2; var1=0; var2=0; output;

run;

data want (drop=i);

  set have;

  array list{*} var1 var2;  

  count_miss=0;

  do i=1 to dim(list);

    if list{i}=0 then count_miss=sum(count_miss,1);

  end;

run;

If they all have the same prefix then its even easier:

data want (drop=i);

  set have;

  array var{*};  

  count_miss=0;

  do i=1 to dim(var);

    if var{i}=0 then count_miss=sum(count_miss,1);

  end;

run;

Steelers_In_DC
Barite | Level 11

RW9, I'm a beginner at arrays.  I ran what you have and get the following error.  Can you assist?

22     GOPTIONS ACCESSIBLE;
23     data want2 (drop=i);
24       set have;
25       array var{*};

ERROR: The array var has been defined with zero elements.

26       count_miss=0;
27       do i=1 to dim(var);
28         if var{i}=0 then count_miss=sum(count_miss,1);

ERROR: Too many array subscripts specified for array var.

29       end;
30     run;
bpearce
Calcite | Level 5

Hi RW9,

I ran the code, but I'm not getting what I had hoped for.  The code ran the number of misses (zero values), by row.  For example, client 1 would have 0 misses and client 2 would have 2 misses.  However, I want the total number of times each variable (column) has a value.  In my real data set, I have over 15 million clients and 1000 variables. I want separate totals for each variable, so variable 1 would have 8 million occurrences and variable 2 would have 9 million (as an example).  Can you help me modify the code?

Thanks,

Steelers_In_DC
Barite | Level 11

I thought you were checking "a handful" of variables.  Do you need to check 1000 now?  If so are they similarly named, as in VAR1 VAR2...VAR1000?

bpearce
Calcite | Level 5

I am only checking a handful of them.  They are not similarly named.  I think RW9's response did the trick. 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ah, ok, so you want to check if the sum per column is 0.  Then your best off checking one of the procedures associated with that.  Proc means, or proc summary.  Say:

proc means data=have;

     var var1-var1000;

     output out=results mean=mean;

run;

Then if the resulting mean for the variable is 0 you know they are all zeroes. 

ballardw
Super User

A different approach:

Use a custom format to identify ranges of values of interest. This one treats every non-missing value except 0 as nonzero and shows 0 as such.

proc format;

value nonzero

low - <0, 0<-high = 'Non-zero'

0 = 'Zero';

run;


_numeric_ is a special variable list identifier that says to do what ever for all numeric variables. So can be a short hand for large lists.

Using proc freq and the format will give you nonzero and zero counts and percentages for each variable plus the missing by default.

proc freq data=have;

tables _numeric_;

format _numeric_ nonzero. ;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 6034 views
  • 1 like
  • 5 in conversation