Re: Count Without Proc SQL

AllanN · Posted 10-18-2011 09:34 PM

Hi Everyone, first of all sorry about my english.

I need help, i have a large dataset and i need to do some time expensive operations, like count.

Example, in a variable (column) i have products codes (like 1001 1002 1003 1004). I need to count every time each code appears in total (incrementing a variable per example). The question is, i have to do this with a lot of columns, so, i use a lot of proc sql's or i use one proc sql with a large case clause.

Im thinking about to do this in Data Step's (Importing data set using the set command, after this using "if" for filtering, and after incrementing counting variables). Please anyone can tell me how i do this, ou post some documentation?

Thanks is advance.

Doc_Duke · Posted 10-18-2011 10:00 PM

Have you considered using PROC FREQ? That could get you the counts and would do it in one pass of the data. Something like

PROC FREQ DATA=mydata;

TABLES <mycolumn>/NOCUM NOPERCENT;

RUN;

The documentation will give you other options and an explanation of these. If you need to change the format, you could use ODS to output the results into another SAS dataset.

You can also count multiple columns in one pass. The number that you can do without writing temporary files is limited by your platform and memory.

Doc Muhlbaier

Duke

AllanN · Posted 10-18-2011 10:19 PM

Duke, Thx for Quick Reply.

I tried to do my code using proc freq today earlier. But reading the documentation and trying i encountered 2 problems.

1) Imagine the example table:

Product_ID

1001

1002

1003

1002

1004

1001

Suppose that i wanna count records containing just "1001" and "1002" values. When i use PROC FREQ, SAS will count and give me percentage about all records, so, i just wanna catch "1001" "1002" and put count value in new column each one, or put count values in same column.

The second problem that i encountered is about the PROC FREQ concept. The example above works because im using 1 column in tables statement (inside proc freq), if i want to count the others column value (this is what i want) isnt possible (i just read documentation, i dont know if exist other possibilities), PROC FREQ use "crossing tables" concept.

Any other suggestions?

Again, thx for quickly reply.

art297 · Posted 10-18-2011 11:31 PM

Why not use proc transpose to flip the table so that everything is in one column? e.g., take a look at: http://www.ats.ucla.edu/stat/sas/modules/wtol_transpose.htm

Then, you could analyze the entire table with one simple proc freq.

And, if you wanted to limit the analysis to just certain values, in specifying the data for proc freq you can use a where statement to limit precisely what is going to be analyzed.

AllanN · Posted 10-19-2011 06:40 AM

Thx For Reply art297.

I`ll try this, i liked ur idea. Usint this method its possible to create other record (or variable before transpose) with sum of certain values?

Example:

Product A ProductB

1001a 999

1002 1001b

1001a 987

1004 1004

After transpose ill have something like that

Products

1001a

1002

1001a

1004

999

1001b

987

1004

And after proc freq ill receive something like that:

1001a 2

1001b 1

1002 1

1004 2

999 1

987 1

Suppose that i wanna count total ou 1001 values, theres possibilities to do this in proc freq?

THX in advance

art297 · Posted 10-19-2011 08:16 AM

You can aways create a new field and analyze it. E.g.,

data have;

input x $;

cards;

1001a

1001b

1002

1002c

1001c

;

data want;

set have;

num_only_x=compress(x,,'kd');

run;

Ksharp · Posted 10-19-2011 12:13 AM

You can use where= dataset option to filter obs.

NOTE: name -- weight is a range, name is the first variable and weight is the last variable in class dataset.

data class;
 set sashelp.class;
run;
options missing=' '; 
ods output onewayfreqs=want ;
proc freq data=class(where=(age in (14 15 16) )) ;
 tables name -- weight/nocum nopercent ;
run;

data want(keep=table frequency value);
retain table  frequency  ; 
 set want;
 table=scan(table,-1);
 value=strip(coalescec(of name--weight));
run;

Ksharp

SAS_Niels · Posted 10-19-2011 10:19 AM

Hi Allan,

I think a very smart way to count the appearance of a lot of variables and values is to use the ACROSS option in a proc report.

Please take a look at the program code:

** Sample data with 3 variables wich contain repeated values ;

data prod;

input product_id product $ country $;

datalines;

1001 screws US

1002 nails D

1003 wood GB

1004 iron LT

1001 screws US

1002 nails D

1003 wood GB

1004 iron LT

;

run;

proc report data = prod nowd headline headskip;

column product_id product country;

define product_id / order across;

define product / display across;

define country / display country;

run;

As you can see in the output window all frequencies of values were accumulated.

Or do you also need the counted values in the dataset? Not only a report?

SAS_Niels · Posted 10-19-2011 12:03 PM

** Alternative;

Data prod;
input product_id product $ country $;
datalines;
1001 screws US
1002 nails D
1003 wood GB
1004 iron LT
1001 screws US
1002 nails D
1003 wood GB
1004 iron LT
;
run;

proc sort data=prod;
by product_id;
run;
data overview (drop= countid countprod countcountry);
set prod;

      countID=0;
   if product_id ne . then
      totalID = countID+1;
      countPROD=0;
   if product ne '' then
      totalPROD = countPROD+1;
   countcountry =0;
   if country ne '' then
     totcountry = countcountry+1;
   label totalid = "Sum of unique product Id's"
   totalprod = "Sum of unique products"
   totcountry ="Sum of unique countries";
run;

options nobyline;
title "Accumulated counts for #BYVAL(product_id)";

proc freq data=overview ;
by product_id;
table totalid totalprod totcountry / nocum nopercent;
run;

rtritz · Posted 10-19-2011 05:54 PM

Hi AllanN,

If you truly want to use data step logic for the counting and keeping it as simple as I can then the code below should get you started.

You can use the where statement to look at specific id values or

leave it out and get a running total for all id values.

data id;

input id $;

datalines;

1001

1002

1003

1002

1004

1001

1003

1004

1001

1002

;

run;

proc sort data = id;

by id;

run;

data counts;

set id;

by id;

where id in('1001' '1003');

if first.id then do;

count=0;

end;

count+1;

if last.id then do;

total = count;

end;

add+total;

if total ne .;

run;

HTH, Rich

AllanN · Posted 10-19-2011 11:36 PM

Thx Again Guys! Ver useful all tricks and suggestions.

Just for knownledge (and possible help), im running other process with a small base to do the process with the large base.

I represented the small base in this worksheets:

https://docs.google.com/spreadsheet/pub?key=0AgrPWWYXoaItdENpeHpJZXRUaVhHRUV4NUk2TDZRRHc&output=html

What i Have to Do:

1) Count product_desc column

2) Count location1_desc and location2_desc

3) Count and put score column in a range in new column ( example: total (count) in range 10 between 15...) do for all scores

4) Convert days columns to months and put ir in a range in new column (new column will have the count of 1-12 months... 12-24... and go...)

5) Do for chg_v column the same for steps 3 and 4

6) Count column status

7) Count column message

😎 Generate results for all above steps is less dataset`s possible

Easy, not? kkkk All these steps using a efficient proccess (not prohibited to use proc sql, but im searching for alternatives because in my point proc sql isnt the better way). The solution that approximate to what i want is using proc freq for columns and merging results, but in this case i have rename all columns names for a sequence (column1,column2,column3..).

I dont give a shot for rtritz solution yet.

A friend suggested me to think about the base like a cube (related to olap data), to minimize data and make the "querys" easier and faster... but i have no idea what he's talking about...

Thanks for everybody, and..

if everyone wants to help... Be my Guest.

Ksharp · Posted 10-20-2011 02:22 AM

I can not open the URL above.

Can you post it at this forum.

AllanN · Posted 10-20-2011 08:14 AM

Here is, just a line extracted from my other post.

date_ref	product_code	product_desc	proposal_num	contract_number	customer_code	location1_id	location2_id	location1_desc	location2_desc	Restrictive	rest_1	rest_2	rest_3	Rebuild	score	concept	days	chg_v	status
201107	0000115432	Cars	8972634	0053452	05075672	0406	2054	East	Maryland	D	Y	D	D	Y	10	A	1250	128	analyzing

Peter_C · Posted 10-20-2011 08:41 AM

It should be straightforward (if not easy) when the data are in the order required of the output - it's just not a generalised solution.

Is the result like an unclassed summary?

Some idea of preferred score ranges and day/month ranges, some idea of the volume of products, locations, statuses and messages would indicate whether a straightforward array collection would be adequate or whether hash-table counters would be better.

An example of output required would help too.

peterC

Ksharp · Posted 10-21-2011 12:42 AM

As Peter said. If you can post some more data or a attachment and output you want , that would be more helpful.

data temp;
input 
date_ref 
     (product_code     product_desc     proposal_num     contract_number     customer_code     location1_id     location2_id     location1_desc
     location2_desc     Restrictive
     rest_1
     rest_2
     rest_3
     Rebuild ) (: $20.)
     score
     concept : $20.
     days : $20.
     chg_v
     status : $20.;
cards;
201107     0000115432     ars     8972634     0053432     05075632     0306     2024     East     Maryland     D     Y     D     D     Y     12     A     1220 118     analyzing
201107     0000115432     Cars     8972632     0053452     05074672     0406     2054     West     Maryland     D     Y     Y     D     S     11     D     1250 128     lyzing
201208     0000115432     Cars     8972334     0033452     05075672     0406     2024     West     Mland     D     Y     D     D     Y     10     A     1150 128     analyzing
201208     0000115432     Bars     8972634     0053452     05075672     0206     2054     East     Maryland     D     Y     D     A     Y     10     A     1250 148     zing
;
run;
/*For 1) 2) 4) 6)*/
data temp;
 set temp;
 year= int(date_ref/100);
run;
options missing=' '; 
ods output onewayfreqs=want ;
proc freq data=temp ;
 tables product_desc location1_desc location2_desc status year /nocum nopercent ;
run;
data want(keep=table frequency value);
retain table  frequency  ; 
 set want;
 table=scan(table,-1);
 value=strip(coalescec(of product_desc location1_desc location2_desc status year ));
run;

/*For 3) */
proc means data=temp noprint  ;
 var score;
 output out=want2 n= min= max= /autoname;
run;
data want2(keep=score_new);
 set want2;
 score_new='Total('||strip(score_n)||') in range '||strip(score_min)||' between '||strip(score_max);
run;

/*For 5) */
proc means data=temp noprint  ;
 var chg_v;
 output out=want3 n= min= max= /autoname;
run;
data want3(keep=chv_new);
 set want3;
 chv_new='Total('||strip(chg_v_n)||') in range '||strip(chg_v_min)||' between '||strip(chg_v_max);
run;
proc freq data=temp noprint;
 tables year*chg_v /list nocum nopercent out=want4;
run;

Ksharp

Registration is open

SAS Training: Just a Click Away