BookmarkSubscribeRSS Feed
AllanN
Calcite | Level 5

Hi Everyone, first of all sorry about my english.

I need help, i have a large dataset and i need to do some time expensive operations, like count.

Example, in a variable (column) i have products codes (like 1001 1002 1003 1004). I need to count every time each code appears in total (incrementing a variable per example). The question is, i have to do this with a lot of columns, so, i use a lot of proc sql's or i use one proc sql with a large case clause.

Im thinking about to do this in Data Step's (Importing data set using the set command, after this using "if" for filtering, and after incrementing counting variables). Please anyone can tell me how i do this, ou post some documentation?

Thanks is advance.

17 REPLIES 17
Doc_Duke
Rhodochrosite | Level 12

Have you considered using PROC FREQ?  That could get you the counts and would do it in one pass of the data.  Something like

PROC FREQ DATA=mydata;

  TABLES <mycolumn>/NOCUM NOPERCENT;

  RUN;

The documentation will give you other options and an explanation of these.  If you need to change the format, you could use ODS to output the results into another SAS dataset. 

You can also count multiple columns in one pass.  The number that you can do without writing temporary files is limited by your platform and memory.

Doc Muhlbaier

Duke

AllanN
Calcite | Level 5

Duke, Thx for Quick Reply.

I tried to do my code using proc freq today earlier. But reading the documentation and trying i encountered 2 problems.

1) Imagine the example table:

Product_ID

1001

1002

1003

1002

1004

1001

Suppose that i wanna count records containing just "1001" and "1002" values. When i use PROC FREQ, SAS will count and give me percentage about all records, so, i just wanna catch "1001" "1002" and put count value in new column each one, or put count values in same column.

The second problem that i encountered is about the PROC FREQ concept. The example above works because im using 1 column in tables statement (inside proc freq), if i want to count the others column value (this is what i want) isnt possible (i just read documentation, i dont know if exist other possibilities), PROC FREQ use "crossing tables" concept.

Any other suggestions?

Again, thx for quickly reply.

art297
Opal | Level 21

Why not use proc transpose to flip the table so that everything is in one column?  e.g., take a look at: http://www.ats.ucla.edu/stat/sas/modules/wtol_transpose.htm

Then, you could analyze the entire table with one simple proc freq.

And, if you wanted to limit the analysis to just certain values, in specifying the data for proc freq you can use a where statement to limit precisely what is going to be analyzed.

AllanN
Calcite | Level 5

Thx For Reply art297.

I`ll try this, i liked ur idea. Usint this method its possible to create other record (or variable before transpose) with sum of certain values?

Example:

Product A     ProductB

1001a               999

1002               1001b

1001a               987   

1004               1004

After transpose ill have something like that

Products

1001a

1002

1001a

1004

999

1001b

987

1004

And after proc freq ill receive something like that:

1001a         2

1001b         1

1002          1

1004          2

999            1

987           1

Suppose that i wanna count total ou 1001 values, theres possibilities to do this in proc freq?

THX in advance

art297
Opal | Level 21

You can aways create a new field and analyze it.  E.g.,

data have;

  input x $;

  cards;

1001a

1001b

1002

1002c

1001c

;

data want;

  set have;

  num_only_x=compress(x,,'kd');

run;

Ksharp
Super User

You can use where= dataset option to filter obs.

NOTE: name -- weight is a range, name is the first variable and weight is the last variable in class dataset.

data class;
 set sashelp.class;
run;
options missing=' '; 
ods output onewayfreqs=want ;
proc freq data=class(where=(age in (14 15 16) )) ;
 tables name -- weight/nocum nopercent ;
run;

data want(keep=table frequency value);
retain table  frequency  ; 
 set want;
 table=scan(table,-1);
 value=strip(coalescec(of name--weight));
run;

Ksharp

SAS_Niels
Calcite | Level 5

Hi Allan,

I think a very smart way to count the appearance of a lot of variables and values is to use the ACROSS option in a proc report.

Please take a look at the program code:

** Sample data with 3 variables wich contain repeated values ;

data prod;

  input product_id product $ country $;

  datalines;

1001 screws US

1002 nails D

1003 wood GB

1004 iron LT

1001 screws US

1002 nails D

1003 wood GB

1004 iron LT

;

run;

proc report data = prod nowd headline headskip;

   column product_id product country;

   define product_id  / order across;

   define product      / display across;

   define  country     / display country;

run;

As you can see in the output window all frequencies of values were accumulated.

Or do you also need the counted values in the dataset? Not only a report?

SAS_Niels
Calcite | Level 5

** Alternative;

  Data prod;
    input product_id product $ country $;
    datalines;
  1001 screws US
  1002 nails D
  1003 wood GB
  1004 iron LT
  1001 screws US
  1002 nails D
  1003 wood GB
  1004 iron LT
  ;
run;

proc sort data=prod;
  by product_id;
run;
data overview (drop= countid countprod countcountry);
  set prod;

      countID=0;
   if product_id ne . then
      totalID = countID+1;
      countPROD=0;
   if product ne '' then
      totalPROD = countPROD+1;
   countcountry =0;
   if country ne '' then
     totcountry = countcountry+1;
   label totalid = "Sum of unique product Id's"
   totalprod = "Sum of unique products"
   totcountry ="Sum of unique countries";
run;

options nobyline;
title "Accumulated counts for #BYVAL(product_id)";

proc freq data=overview ;
  by product_id;
  table totalid totalprod totcountry / nocum nopercent;
run;

rtritz
Calcite | Level 5

Hi AllanN,

If you truly want to use data step logic for the counting and keeping it as simple as I can then the code below should get you started.

You can use the where statement to look at specific id values or

leave it out and get a running total for all id values.

data id;  

input id $;

datalines;

1001

1002

1003

1002

1004

1001

1003

1004

1001

1002

;

run;

proc sort data = id; 

by id;

run;

data counts;  

set id;  

by id; 

where id in('1001' '1003'); 

if first.id then do;  

count=0;  

end;  

count+1; 

if last.id then do;    

total = count;  

end;  

add+total; 

if total ne .;

run;

HTH, Rich

AllanN
Calcite | Level 5

Thx Again Guys! Ver useful all tricks and suggestions.

Just for knownledge (and possible help), im running other process with a small base to do the process with the large base.

I represented the small base in this worksheets:

https://docs.google.com/spreadsheet/pub?key=0AgrPWWYXoaItdENpeHpJZXRUaVhHRUV4NUk2TDZRRHc&output=html

What i Have to Do:

1) Count product_desc column

2) Count location1_desc and location2_desc

3) Count and put score column in a range in new column ( example: total (count) in range 10 between 15...) do for all scores

4) Convert days columns to months and put ir in a range in new column (new column will have the count of 1-12 months... 12-24... and go...)

5) Do for chg_v column the same for steps 3 and 4

6) Count column status

7) Count column message

😎 Generate results for all above steps is less dataset`s possible

Easy, not? kkkk All these steps using a efficient proccess (not prohibited to use proc sql, but im searching for alternatives because in my point proc sql isnt the better way). The solution that approximate to what i want is using proc freq for columns and merging results, but in this case i have rename all columns names for a sequence (column1,column2,column3..).

I dont give a shot for rtritz solution yet.

A friend suggested me to think about the base like a cube (related to olap data), to minimize data and make the "querys" easier and faster... but i have no idea what he's talking about...

Thanks for everybody, and..

if everyone wants to help... Be my Guest.

Ksharp
Super User

I can not open the URL above.

Can you post it at this forum.

AllanN
Calcite | Level 5

Here is, just a line extracted from my other post.

date_ref
product_codeproduct_descproposal_numcontract_numbercustomer_codelocation1_idlocation2_idlocation1_desc
location2_descRestrictive
rest_1
rest_2
rest_3
Rebuild
score
concept
days
chg_v
status
2011070000115432Cars897263400534520507567204062054EastMarylandDYDDY10A1250128analyzing
Peter_C
Rhodochrosite | Level 12

It should be straightforward (if not easy) when the data are in the order required of the output - it's just not a generalised solution.

Is the result like an unclassed summary? 

Some idea of preferred score ranges and day/month ranges, some idea of the volume of products, locations, statuses and messages would indicate whether a straightforward array collection would be adequate or whether hash-table counters would be better.

 

An example of output required would help too.

peterC

Ksharp
Super User

As Peter said. If you can post some more data or a attachment and output you want , that would be more helpful.

data temp;
input 
date_ref 
     (product_code     product_desc     proposal_num     contract_number     customer_code     location1_id     location2_id     location1_desc
     location2_desc     Restrictive
     rest_1
     rest_2
     rest_3
     Rebuild ) (: $20.)
     score
     concept : $20.
     days : $20.
     chg_v
     status : $20.;
cards;
201107     0000115432     ars     8972634     0053432     05075632     0306     2024     East     Maryland     D     Y     D     D     Y     12     A     1220 118     analyzing
201107     0000115432     Cars     8972632     0053452     05074672     0406     2054     West     Maryland     D     Y     Y     D     S     11     D     1250 128     lyzing
201208     0000115432     Cars     8972334     0033452     05075672     0406     2024     West     Mland     D     Y     D     D     Y     10     A     1150 128     analyzing
201208     0000115432     Bars     8972634     0053452     05075672     0206     2054     East     Maryland     D     Y     D     A     Y     10     A     1250 148     zing
;
run;
/*For 1) 2) 4) 6)*/
data temp;
 set temp;
 year= int(date_ref/100);
run;
options missing=' '; 
ods output onewayfreqs=want ;
proc freq data=temp ;
 tables product_desc location1_desc location2_desc status year /nocum nopercent ;
run;
data want(keep=table frequency value);
retain table  frequency  ; 
 set want;
 table=scan(table,-1);
 value=strip(coalescec(of product_desc location1_desc location2_desc status year ));
run;

/*For 3) */
proc means data=temp noprint  ;
 var score;
 output out=want2 n= min= max= /autoname;
run;
data want2(keep=score_new);
 set want2;
 score_new='Total('||strip(score_n)||') in range '||strip(score_min)||' between '||strip(score_max);
run;

/*For 5) */
proc means data=temp noprint  ;
 var chg_v;
 output out=want3 n= min= max= /autoname;
run;
data want3(keep=chv_new);
 set want3;
 chv_new='Total('||strip(chg_v_n)||') in range '||strip(chg_v_min)||' between '||strip(chg_v_max);
run;
proc freq data=temp noprint;
 tables year*chg_v /list nocum nopercent out=want4;
run;



Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 2083 views
  • 0 likes
  • 7 in conversation