Hi Everyone, first of all sorry about my english.
I need help, i have a large dataset and i need to do some time expensive operations, like count.
Example, in a variable (column) i have products codes (like 1001 1002 1003 1004). I need to count every time each code appears in total (incrementing a variable per example). The question is, i have to do this with a lot of columns, so, i use a lot of proc sql's or i use one proc sql with a large case clause.
Im thinking about to do this in Data Step's (Importing data set using the set command, after this using "if" for filtering, and after incrementing counting variables). Please anyone can tell me how i do this, ou post some documentation?
Thanks is advance.
Have you considered using PROC FREQ? That could get you the counts and would do it in one pass of the data. Something like
PROC FREQ DATA=mydata;
TABLES <mycolumn>/NOCUM NOPERCENT;
RUN;
The documentation will give you other options and an explanation of these. If you need to change the format, you could use ODS to output the results into another SAS dataset.
You can also count multiple columns in one pass. The number that you can do without writing temporary files is limited by your platform and memory.
Doc Muhlbaier
Duke
Duke, Thx for Quick Reply.
I tried to do my code using proc freq today earlier. But reading the documentation and trying i encountered 2 problems.
1) Imagine the example table:
Product_ID
1001
1002
1003
1002
1004
1001
Suppose that i wanna count records containing just "1001" and "1002" values. When i use PROC FREQ, SAS will count and give me percentage about all records, so, i just wanna catch "1001" "1002" and put count value in new column each one, or put count values in same column.
The second problem that i encountered is about the PROC FREQ concept. The example above works because im using 1 column in tables statement (inside proc freq), if i want to count the others column value (this is what i want) isnt possible (i just read documentation, i dont know if exist other possibilities), PROC FREQ use "crossing tables" concept.
Any other suggestions?
Again, thx for quickly reply.
Why not use proc transpose to flip the table so that everything is in one column? e.g., take a look at: http://www.ats.ucla.edu/stat/sas/modules/wtol_transpose.htm
Then, you could analyze the entire table with one simple proc freq.
And, if you wanted to limit the analysis to just certain values, in specifying the data for proc freq you can use a where statement to limit precisely what is going to be analyzed.
Thx For Reply art297.
I`ll try this, i liked ur idea. Usint this method its possible to create other record (or variable before transpose) with sum of certain values?
Example:
Product A ProductB
1001a 999
1002 1001b
1001a 987
1004 1004
After transpose ill have something like that
Products
1001a
1002
1001a
1004
999
1001b
987
1004
And after proc freq ill receive something like that:
1001a 2
1001b 1
1002 1
1004 2
999 1
987 1
Suppose that i wanna count total ou 1001 values, theres possibilities to do this in proc freq?
THX in advance
You can aways create a new field and analyze it. E.g.,
data have;
input x $;
cards;
1001a
1001b
1002
1002c
1001c
;
data want;
set have;
num_only_x=compress(x,,'kd');
run;
You can use where= dataset option to filter obs.
NOTE: name -- weight is a range, name is the first variable and weight is the last variable in class dataset.
data class; set sashelp.class; run; options missing=' '; ods output onewayfreqs=want ; proc freq data=class(where=(age in (14 15 16) )) ; tables name -- weight/nocum nopercent ; run; data want(keep=table frequency value); retain table frequency ; set want; table=scan(table,-1); value=strip(coalescec(of name--weight)); run;
Ksharp
Hi Allan,
I think a very smart way to count the appearance of a lot of variables and values is to use the ACROSS option in a proc report.
Please take a look at the program code:
** Sample data with 3 variables wich contain repeated values ;
data prod;
input product_id product $ country $;
datalines;
1001 screws US
1002 nails D
1003 wood GB
1004 iron LT
1001 screws US
1002 nails D
1003 wood GB
1004 iron LT
;
run;
proc report data = prod nowd headline headskip;
column product_id product country;
define product_id / order across;
define product / display across;
define country / display country;
run;
As you can see in the output window all frequencies of values were accumulated.
Or do you also need the counted values in the dataset? Not only a report?
** Alternative;
Data prod;
input product_id product $ country $;
datalines;
1001 screws US
1002 nails D
1003 wood GB
1004 iron LT
1001 screws US
1002 nails D
1003 wood GB
1004 iron LT
;
run;
proc sort data=prod;
by product_id;
run;
data overview (drop= countid countprod countcountry);
set prod;
countID=0;
if product_id ne . then
totalID = countID+1;
countPROD=0;
if product ne '' then
totalPROD = countPROD+1;
countcountry =0;
if country ne '' then
totcountry = countcountry+1;
label totalid = "Sum of unique product Id's"
totalprod = "Sum of unique products"
totcountry ="Sum of unique countries";
run;
options nobyline;
title "Accumulated counts for #BYVAL(product_id)";
proc freq data=overview ;
by product_id;
table totalid totalprod totcountry / nocum nopercent;
run;
Hi AllanN,
If you truly want to use data step logic for the counting and keeping it as simple as I can then the code below should get you started.
You can use the where statement to look at specific id values or
leave it out and get a running total for all id values.
data id;
input id $;
datalines;
1001
1002
1003
1002
1004
1001
1003
1004
1001
1002
;
run;
proc sort data = id;
by id;
run;
data counts;
set id;
by id;
where id in('1001' '1003');
if first.id then do;
count=0;
end;
count+1;
if last.id then do;
total = count;
end;
add+total;
if total ne .;
run;
HTH, Rich
Thx Again Guys! Ver useful all tricks and suggestions.
Just for knownledge (and possible help), im running other process with a small base to do the process with the large base.
I represented the small base in this worksheets:
https://docs.google.com/spreadsheet/pub?key=0AgrPWWYXoaItdENpeHpJZXRUaVhHRUV4NUk2TDZRRHc&output=html
What i Have to Do:
1) Count product_desc column
2) Count location1_desc and location2_desc
3) Count and put score column in a range in new column ( example: total (count) in range 10 between 15...) do for all scores
4) Convert days columns to months and put ir in a range in new column (new column will have the count of 1-12 months... 12-24... and go...)
5) Do for chg_v column the same for steps 3 and 4
6) Count column status
7) Count column message
😎 Generate results for all above steps is less dataset`s possible
Easy, not? kkkk All these steps using a efficient proccess (not prohibited to use proc sql, but im searching for alternatives because in my point proc sql isnt the better way). The solution that approximate to what i want is using proc freq for columns and merging results, but in this case i have rename all columns names for a sequence (column1,column2,column3..).
I dont give a shot for rtritz solution yet.
A friend suggested me to think about the base like a cube (related to olap data), to minimize data and make the "querys" easier and faster... but i have no idea what he's talking about...
Thanks for everybody, and..
if everyone wants to help... Be my Guest.
I can not open the URL above.
Can you post it at this forum.
Here is, just a line extracted from my other post.
date_ref | product_code | product_desc | proposal_num | contract_number | customer_code | location1_id | location2_id | location1_desc | location2_desc | Restrictive | rest_1 | rest_2 | rest_3 | Rebuild | score | concept | days | chg_v | status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
201107 | 0000115432 | Cars | 8972634 | 0053452 | 05075672 | 0406 | 2054 | East | Maryland | D | Y | D | D | Y | 10 | A | 1250 | 128 | analyzing |
It should be straightforward (if not easy) when the data are in the order required of the output - it's just not a generalised solution.
Is the result like an unclassed summary?
Some idea of preferred score ranges and day/month ranges, some idea of the volume of products, locations, statuses and messages would indicate whether a straightforward array collection would be adequate or whether hash-table counters would be better.
An example of output required would help too.
peterC
As Peter said. If you can post some more data or a attachment and output you want , that would be more helpful.
data temp; input date_ref (product_code product_desc proposal_num contract_number customer_code location1_id location2_id location1_desc location2_desc Restrictive rest_1 rest_2 rest_3 Rebuild ) (: $20.) score concept : $20. days : $20. chg_v status : $20.; cards; 201107 0000115432 ars 8972634 0053432 05075632 0306 2024 East Maryland D Y D D Y 12 A 1220 118 analyzing 201107 0000115432 Cars 8972632 0053452 05074672 0406 2054 West Maryland D Y Y D S 11 D 1250 128 lyzing 201208 0000115432 Cars 8972334 0033452 05075672 0406 2024 West Mland D Y D D Y 10 A 1150 128 analyzing 201208 0000115432 Bars 8972634 0053452 05075672 0206 2054 East Maryland D Y D A Y 10 A 1250 148 zing ; run; /*For 1) 2) 4) 6)*/ data temp; set temp; year= int(date_ref/100); run; options missing=' '; ods output onewayfreqs=want ; proc freq data=temp ; tables product_desc location1_desc location2_desc status year /nocum nopercent ; run; data want(keep=table frequency value); retain table frequency ; set want; table=scan(table,-1); value=strip(coalescec(of product_desc location1_desc location2_desc status year )); run; /*For 3) */ proc means data=temp noprint ; var score; output out=want2 n= min= max= /autoname; run; data want2(keep=score_new); set want2; score_new='Total('||strip(score_n)||') in range '||strip(score_min)||' between '||strip(score_max); run; /*For 5) */ proc means data=temp noprint ; var chg_v; output out=want3 n= min= max= /autoname; run; data want3(keep=chv_new); set want3; chv_new='Total('||strip(chg_v_n)||') in range '||strip(chg_v_min)||' between '||strip(chg_v_max); run; proc freq data=temp noprint; tables year*chg_v /list nocum nopercent out=want4; run;
Ksharp
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.