I have on the order of 60 million records, each with 12 monthly fields. I would like to accumulate counts across these fields, but their contents vary from record to record -- for example
id f1 f2 f3 ...
1 A B C
2 E G A
3 C C C
etc.
my hope was to create a series of variables called n_A, n_B, n_c, n_e, etc and then do something like
data counts;
set big_honker (keep=f1-f12) end=eof;
array f {12} f1-f12;
do i=1 to 12;
%let vbl=%magic(f{i});
&vbl+1;
end;
if eof then output;
drop f1-f12;
run;
where the %magic takes the value x in each of those monthly fields and creates the appropriate accumulator n_x.
But I'll be darned if I can figure it out.
Help, please!
There is a data step way:
data _null_;
set have end=end_of_have;
length value $10 freq 8;
if _n_=1 then do;
declare hash h (ordered:'a');
h.definekey('value');
h.definedata('value','freq');
h.definedone();
end;
array month {12} f1-f12;
do m=1 to 12;
value=month{m};
if h.find()^=0 then freq=1;
else freq=freq+1;
h.replace();
end;
if end_of_have then h.output(dataset:'want');
run;
or you could avoid constructing and using a hash. Just pipe the data (via a data set view) to proc FREQ, and let SAS do the counting:
data vneed / view=vneed;
set have ;
length value $10;
array month{12} f1-f12 ;
do m=1 to 12;
value=month{m};
output;
end;
run;
proc freq data=vneed noprint ;
tables value / out=want (keep=value count);
run;
There is a data step way:
data _null_;
set have end=end_of_have;
length value $10 freq 8;
if _n_=1 then do;
declare hash h (ordered:'a');
h.definekey('value');
h.definedata('value','freq');
h.definedone();
end;
array month {12} f1-f12;
do m=1 to 12;
value=month{m};
if h.find()^=0 then freq=1;
else freq=freq+1;
h.replace();
end;
if end_of_have then h.output(dataset:'want');
run;
or you could avoid constructing and using a hash. Just pipe the data (via a data set view) to proc FREQ, and let SAS do the counting:
data vneed / view=vneed;
set have ;
length value $10;
array month{12} f1-f12 ;
do m=1 to 12;
value=month{m};
output;
end;
run;
proc freq data=vneed noprint ;
tables value / out=want (keep=value count);
run;
Perfect! Thank you!
One thing that is a little weird: somehow this approach loses 1,800 person-months. But that's out of 818,600,000 million person-months, so I think I can live with the results in aggregate 🙂
@dwaldo wrote:
One thing that is a little weird: somehow this approach loses 1,800 person-months. But that's out of 818,600,000 million person-months, so I think I can live with the results in aggregate 🙂
Hard to tell which approach you may be discussing as there isn't any quote or reference to which...
In my head I had it right ... I meant
data _null_;
set have end=end_of_have;
length value $10 freq 8;
if _n_=1 then do;
declare hash h (ordered:'a');
h.definekey('value');
h.definedata('value','freq');
h.definedone();
end;
array month {12} f1-f12;
do m=1 to 12;
value=month{m};
if h.find()^=0 then freq=1;
else freq=freq+1;
h.replace();
end;
if end_of_have then h.output(dataset:'want');
run;
Maxim 19
Long beats wide.
(Don't keep data in structure)
In the world of spreadsheets, people tend to line up data side-by-side, and put data items (dates, categories, …) into column headers. This runs counter to all the methods available in SAS for group processing, and makes programming difficult, as one has variable column names and has to resort to creating dynamic code (with macros and/or call execute) where such is not necessary at all if categories were represented in their own column and data aligned vertically.
There are times where a wide format is needed, eg when preparing data for regression analysis. But for the processing and storing of data, long formats are always to be preferred.Dynamic variable names force unnecessary dynamic code.
In this case you don't need data step at all:
proc transpose data= have out= trance(drop= _name_ rename=(col1 = f));
by id;
var f:;
run;
proc freq data= trance noprint;
table f / out= want;
run;
Thanks to both of you for your comments; I agree that long beats wide and that tidy beats messy. And last night as I read my own post, I realized that macros are executed at compile-time execution and not at object-time.
However, I fear that I did not explain my problem very well. I was given a large file: 60-some million person-records, each with their county of residence in each of the 12 months of the year. My task is to count up person-months for each county. So transposing the existing dataset would not be a good thing.
The hit-it-with-a-stick approach would be to write out 12 records for each person and then summarize over counties; but I do not relish wrangling 720 million records. Another approach might be to create a monster SELECT(f{i}) "snippet" running from county 01000 to 99999, outputting at EOF and then transposing the results.
I was going for some kind of elegant run-time solution that points directly to the county counter that needs to be increased ... maybe a CALL function or something. But I also don't mind hitting something with a stick if it looks to be a one-time thing.
So just use the data step view.
Say you have data structured like:
data have;
input id year (month1-month12) (& :$40.);
cards;
1 2023 . . . . NEW YORK NEW YORK NEW YORK NEW YORK BRONX BRONX BRONX BRONX
1 2024 BRONX BRONX BRONX . . . . . . . . .
2 2023 QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS QUEENS
3 2024 RICHMOND RICHMOND RICHMOND RICHMOND . . . . . . . . .
;
Then you can convert it into a dataset (or to save disk space a view) that has ID, YEAR, MONTH and COUNTY instead.
data tall / view=tall;
set have;
array m month1-month12 ;
do month=1 to 12;
COUNTY=m[month];
if not missing(county) then output;
end;
keep id year month county;
run;
Now you can summarize by any of your different dimensions.
proc summary data=tall missing;
class year month county;
output out=want;
run;
Thanks! That seems to be the most elemental solution -- one that first occurred to me and that runs overnight. It has definite curb appeal. I have a wealth of options from which to choose!
@dwaldo wrote:
However, I fear that I did not explain my problem very well. I was given a large file: 60-some million person-records, each with their county of residence in each of the 12 months of the year. My task is to count up person-months for each county. So transposing the existing dataset would not be a good thing.
Why is transposing not a good thing?
Since transpose won't transpose any missing values you get one person per month with the value. Unless the person id is repeated (have you checked that yet?) then Transpose will have a data set that looks like
Person _name_of_month_variable col1 (or if you specify a prefix or id value a different name) with Col1 the value of the County.
Proc freq on the County variable and you have total person months per county.
Yes the data set is largish. But you are starting with large data.
IF the only thing you want is the count of the number of times a county value appears here is one way:
data have; input id f1 $ f2 $ f3 $ ; datalines; 1 A B C 2 E G A 3 C C C ; data want; set have end=lastone; array f(*) f1-f3; array t(7) $ 1 _temporary_ ('A','B','C','D','E','F','G'); array c(*) count1-count7; retain count: ; do i= 1 to dim(f); if not missing(f[i]) then c[ whichc(f[i],of t(*)) ] + 1; end; if lastone then output; drop i f: id; run;
The _temporary_ array would hold the values of the county name text -as it appears in the data-.
The C array would have the same the same number of elements as the number of counties in the temporary array.
This works by retaining the counter variables and incrementing them each time the value of a county is encountered. WHICHC searches for the first parameter (the monthly county variable value) in the remaining items, in this case an array holding all the names of the counties. The function returns the POSITION number in the searched list (or zero which is why we exclude the missing county if any). That returned value then increments the position counter by 1.
The data set option END is used so the only output written is when the last record is processed.
Thanks! I like the elegance of this ... but given that county FIPS code sets change from one year to the next, the code line
array t(7) $ 1 _temporary_ ('A','B','C','D','E','F','G');
could get cumbersome really fast.
I'll save this, though, for a time when there are fewer possible field values!
@dwaldo wrote:
Thanks! I like the elegance of this ... but given that county FIPS code sets change from one year to the next, the code line
array t(7) $ 1 _temporary_ ('A','B','C','D','E','F','G');could get cumbersome really fast.
I'll save this, though, for a time when there are fewer possible field values!
Counties don't change that often generally.My home state, Idaho, hasn't added a county since 1919. The newest county I find in a quick Google search dates from 2001. So changing county FIPS not an annual occurrence.
If your are using some other FIPS code that might be an issue.
Note that you very likely have a SASHELP.ZIPCODE data set. The STATE combined with the COUNTY code value: CountyFips= cats(put(state,z2.), put(county,z3.)); would create the 5-digit FIPS code in a data step or
cats(put(state,z2.), put(county,z3.)) as CountyFIPS in Proc SQL (likely use as you want distinct State and County values. Note that the county name is there as well though spelling sometimes isn't as locals use.
The SASHELP.ZIPCODE data set is updated by SAS periodically, 3 times in 2024 so far.
So a list of code should be easy enough to generate from that set.
If you don't have the data set you can get it with a valid SAS site code and your SAS Community login information. This message has a link you can follow to get it: https://communities.sas.com/t5/SAS-Hot-Fix-Announcements/Update-available-for-ZIPCODE-data-set-Augus... You can see the updates made available since 2019. They are listed in Year-MON order so the latest is currently 3 from the bottom as AUG comes before FEB and MAY.
In fact if done right you would have a the count to create the limits of the Count variables provided AND code that could assign the FIPS code values as the LABELs for variables such as the proposed Count1- CountXXX using that information.
Agreed that counties don't change very often ... Connecticut was the last big one of which I am aware. Thanks for pointing out that ZIP file -- always good to have something like that on tap.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.