turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Count the number of non-zero occurrences for a var...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-30-2015 10:12 AM

I'm working with a very large data set (over 15 million observations and over 1000 variables). I would like to count the number of non-zero entries for a handful of variables from within my data set. For example, in the table below VAR 1 has 3 occurrences, and VAR 2 has two occurrences. I'm a beginner, so please provide some explanation if possible. I appreciate any help!

Client # | Var 1 | Var 2 |
---|---|---|

1 | 245 | 2699 |

2 | 0 | 0 |

3 | 3694 | 0 |

4 | 1548 | 0 |

5 | 0 | 3554 |

Accepted Solutions

Solution

07-31-2015
10:01 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-31-2015 10:01 AM

Ah, ok, so you want to check if the sum per column is 0. Then your best off checking one of the procedures associated with that. Proc means, or proc summary. Say:

proc means data=have;

var var1-var1000;

output out=results mean=mean;

run;

Then if the resulting mean for the variable is 0 you know they are all zeroes.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-30-2015 10:20 AM

If you have SAS IML, there is one really fast way to count missing/not missing.

Look at this post: http://blogs.sas.com/content/iml/2011/09/19/count-the-number-of-missing-values-for-each-variable.htm...

Do you have HP procs and a grid license? That is another route.

I hope it helps,

Miguel

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-30-2015 10:29 AM

I'm sure there will be some good array solutions, but this is the first thing that came to mind. If it is only a handful of variables it's a viable solution.

data have;

infile cards dsd;

input Client Var1 Var2;

cards;

1,245,2699

2,0,0

3,3694,0

4,1548,0

5,0,3554

;

run;

data want;

set have;

count = sum((var1 ne 0)+(var2 ne 0));

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-30-2015 10:47 AM

Hi,

Yes, you can use arrays. This would be preferred if there were many variables.

data have;

client=1; var1=245; var2=2696; output;

client=2; var1=0; var2=0; output;

run;

data want (drop=i);

set have;

array list{*} var1 var2;

count_miss=0;

do i=1 to dim(list);

if list{i}=0 then count_miss=sum(count_miss,1);

end;

run;

If they all have the same prefix then its even easier:

data want (drop=i);

set have;

array var{*};

count_miss=0;

do i=1 to dim(var);

if var{i}=0 then count_miss=sum(count_miss,1);

end;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-30-2015 10:56 AM

RW9, I'm a beginner at arrays. I ran what you have and get the following error. Can you assist?

22 | GOPTIONS ACCESSIBLE; |

23 | data want2 (drop=i); |

24 | set have; |

25 | array var{*}; |

ERROR: The array var has been defined with zero elements.

26 | count_miss=0; |

27 | do i=1 to dim(var); |

28 | if var{i}=0 then count_miss=sum(count_miss,1); |

ERROR: Too many array subscripts specified for array var.

29 | end; |

30 | run; |

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-31-2015 09:20 AM

Hi RW9,

I ran the code, but I'm not getting what I had hoped for. The code ran the number of misses (zero values), by row. For example, client 1 would have 0 misses and client 2 would have 2 misses. However, I want the total number of times each variable (column) has a value. In my real data set, I have over 15 million clients and 1000 variables. I want separate totals for each variable, so variable 1 would have 8 million occurrences and variable 2 would have 9 million (as an example). Can you help me modify the code?

Thanks,

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-31-2015 09:51 AM

I thought you were checking "a handful" of variables. Do you need to check 1000 now? If so are they similarly named, as in VAR1 VAR2...VAR1000?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Steelers_In_DC

07-31-2015 10:30 AM

I am only checking a handful of them. They are not similarly named. I think RW9's response did the trick.

Solution

07-31-2015
10:01 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-31-2015 10:01 AM

Ah, ok, so you want to check if the sum per column is 0. Then your best off checking one of the procedures associated with that. Proc means, or proc summary. Say:

proc means data=have;

var var1-var1000;

output out=results mean=mean;

run;

Then if the resulting mean for the variable is 0 you know they are all zeroes.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bpearce

07-31-2015 10:37 AM

A different approach:

Use a custom format to identify ranges of values of interest. This one treats every non-missing value except 0 as nonzero and shows 0 as such.

proc format;

value nonzero

low - <0, 0<-high = 'Non-zero'

0 = 'Zero';

run;

_numeric_ is a special variable list identifier that says to do what ever for all numeric variables. So can be a short hand for large lists.

Using proc freq and the format will give you nonzero and zero counts and percentages for each variable plus the missing by default.

proc freq data=have;

tables _numeric_;

format _numeric_ nonzero. ;

run;