Solved: Re: create macro variable from data field

dwaldo · Posted 09-05-2024 05:22 PM

I have on the order of 60 million records, each with 12 monthly fields. I would like to accumulate counts across these fields, but their contents vary from record to record -- for example

id f1 f2 f3 ...

1 A B C

2 E G A

3 C C C

etc.

my hope was to create a series of variables called n_A, n_B, n_c, n_e, etc and then do something like

data counts;

set big_honker (keep=f1-f12) end=eof;

array f {12} f1-f12;

do i=1 to 12;

%let vbl=%magic(f{i});

&vbl+1;

end;

if eof then output;

drop f1-f12;

run;

where the %magic takes the value x in each of those monthly fields and creates the appropriate accumulator n_x.

But I'll be darned if I can figure it out.

Help, please!

mkeintz · Posted 09-05-2024 11:35 PM

There is a data step way:

data _null_;
  set have end=end_of_have;
  length value $10 freq 8;
  if _n_=1 then do;
    declare hash h (ordered:'a');
      h.definekey('value');
      h.definedata('value','freq');
      h.definedone();
  end;
  array month {12} f1-f12;
  do m=1 to 12;
    value=month{m};
    if h.find()^=0 then freq=1;
    else freq=freq+1;
    h.replace();
  end;
  if end_of_have then h.output(dataset:'want');
run;

or you could avoid constructing and using a hash. Just pipe the data (via a data set view) to proc FREQ, and let SAS do the counting:


data vneed / view=vneed;
  set  have ;
  length value $10;
  array month{12} f1-f12 ;
  do m=1 to 12;
    value=month{m};
    output;
  end;
run;

proc freq data=vneed noprint ;
  tables value / out=want (keep=value count);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 09-05-2024 11:35 PM

There is a data step way:

data _null_;
  set have end=end_of_have;
  length value $10 freq 8;
  if _n_=1 then do;
    declare hash h (ordered:'a');
      h.definekey('value');
      h.definedata('value','freq');
      h.definedone();
  end;
  array month {12} f1-f12;
  do m=1 to 12;
    value=month{m};
    if h.find()^=0 then freq=1;
    else freq=freq+1;
    h.replace();
  end;
  if end_of_have then h.output(dataset:'want');
run;

or you could avoid constructing and using a hash. Just pipe the data (via a data set view) to proc FREQ, and let SAS do the counting:


data vneed / view=vneed;
  set  have ;
  length value $10;
  array month{12} f1-f12 ;
  do m=1 to 12;
    value=month{m};
    output;
  end;
run;

proc freq data=vneed noprint ;
  tables value / out=want (keep=value count);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

dwaldo · Posted 09-06-2024 08:15 AM

Perfect! Thank you!

dwaldo · Posted 09-06-2024 05:03 PM

One thing that is a little weird: somehow this approach loses 1,800 person-months. But that's out of 818,600,000 million person-months, so I think I can live with the results in aggregate 🙂

ballardw · Posted 09-06-2024 05:34 PM

@dwaldo wrote:

One thing that is a little weird: somehow this approach loses 1,800 person-months. But that's out of 818,600,000 million person-months, so I think I can live with the results in aggregate 🙂

Hard to tell which approach you may be discussing as there isn't any quote or reference to which...

dwaldo · Posted 09-07-2024 10:36 AM

In my head I had it right ... I meant

data _null_;
  set have end=end_of_have;
  length value $10 freq 8;
  if _n_=1 then do;
    declare hash h (ordered:'a');
      h.definekey('value');
      h.definedata('value','freq');
      h.definedone();
  end;
  array month {12} f1-f12;
  do m=1 to 12;
    value=month{m};
    if h.find()^=0 then freq=1;
    else freq=freq+1;
    h.replace();
  end;
  if end_of_have then h.output(dataset:'want');
run;

andreas_lds · Posted 09-06-2024 02:16 AM

Maxim 19

Long beats wide.

(Don't keep data in structure)

In the world of spreadsheets, people tend to line up data side-by-side, and put data items (dates, categories, …) into column headers. This runs counter to all the methods available in SAS for group processing, and makes programming difficult, as one has variable column names and has to resort to creating dynamic code (with macros and/or call execute) where such is not necessary at all if categories were represented in their own column and data aligned vertically.
There are times where a wide format is needed, eg when preparing data for regression analysis. But for the processing and storing of data, long formats are always to be preferred.

Dynamic variable names force unnecessary dynamic code.

from https://communities.sas.com/t5/SAS-Communities-Library/Maxims-of-Maximally-Efficient-SAS-Programmers...

In this case you don't need data step at all:

proc transpose data= have out= trance(drop= _name_ rename=(col1 = f));
    by id;
    var f:;
run;

proc freq data= trance noprint;
    table f / out= want;
run;

Kurt_Bremser · Posted 09-06-2024 03:45 AM

What @andreas_lds said. Long Beats Wide.

Bad data structure leads to bad code.

Also see Maxim 33.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

dwaldo · Posted 09-06-2024 08:04 AM

Thanks to both of you for your comments; I agree that long beats wide and that tidy beats messy. And last night as I read my own post, I realized that macros are executed at compile-time execution and not at object-time.

However, I fear that I did not explain my problem very well. I was given a large file: 60-some million person-records, each with their county of residence in each of the 12 months of the year. My task is to count up person-months for each county. So transposing the existing dataset would not be a good thing.

The hit-it-with-a-stick approach would be to write out 12 records for each person and then summarize over counties; but I do not relish wrangling 720 million records. Another approach might be to create a monster SELECT(f{i}) "snippet" running from county 01000 to 99999, outputting at EOF and then transposing the results.

I was going for some kind of elegant run-time solution that points directly to the county counter that needs to be increased ... maybe a CALL function or something. But I also don't mind hitting something with a stick if it looks to be a one-time thing.

Tom · Posted 09-06-2024 12:37 PM

So just use the data step view.

Say you have data structured like:

data have;
  input id year (month1-month12) (& :$40.);
cards;
1 2023 .  .  .  .  NEW YORK  NEW YORK  NEW YORK  NEW YORK  BRONX  BRONX  BRONX  BRONX
1 2024 BRONX  BRONX  BRONX  .  .  .  .  .  .  .  .  .
2 2023 QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS  QUEENS
3 2024 RICHMOND  RICHMOND  RICHMOND  RICHMOND    .  .  .  .  .  .  .  .  .
;

Then you can convert it into a dataset (or to save disk space a view) that has ID, YEAR, MONTH and COUNTY instead.

data tall / view=tall;
  set have;
  array m month1-month12 ;
  do month=1 to 12;
    COUNTY=m[month];
    if not missing(county) then output;
  end;
  keep id year month county;
run;

Now you can summarize by any of your different dimensions.

proc summary data=tall missing;
  class year month county;
  output out=want;
run;

dwaldo · Posted 09-06-2024 02:26 PM

Thanks! That seems to be the most elemental solution -- one that first occurred to me and that runs overnight. It has definite curb appeal. I have a wealth of options from which to choose!

ballardw · Posted 09-06-2024 01:20 PM

@dwaldo wrote:

However, I fear that I did not explain my problem very well. I was given a large file: 60-some million person-records, each with their county of residence in each of the 12 months of the year. My task is to count up person-months for each county. So transposing the existing dataset would not be a good thing.

Why is transposing not a good thing?

Since transpose won't transpose any missing values you get one person per month with the value. Unless the person id is repeated (have you checked that yet?) then Transpose will have a data set that looks like

Person _name_of_month_variable col1 (or if you specify a prefix or id value a different name) with Col1 the value of the County.

Proc freq on the County variable and you have total person months per county.

Yes the data set is largish. But you are starting with large data.

IF the only thing you want is the count of the number of times a county value appears here is one way:

data have;
   input id f1 $ f2 $ f3 $    ;
datalines;
1 A B C
2 E G A
3 C C C
;

data want;
   set have end=lastone;
   array f(*) f1-f3;
   array t(7) $ 1 _temporary_ ('A','B','C','D','E','F','G');
   array c(*) count1-count7;
   retain count: ;
   do i= 1 to dim(f);
      if not missing(f[i]) then c[ whichc(f[i],of t(*)) ] + 1;
   end;
   if lastone then output;
   drop i f: id;
run;

The _temporary_ array would hold the values of the county name text -as it appears in the data-.

The C array would have the same the same number of elements as the number of counties in the temporary array.

This works by retaining the counter variables and incrementing them each time the value of a county is encountered. WHICHC searches for the first parameter (the monthly county variable value) in the remaining items, in this case an array holding all the names of the counties. The function returns the POSITION number in the searched list (or zero which is why we exclude the missing county if any). That returned value then increments the position counter by 1.

The data set option END is used so the only output written is when the last record is processed.

dwaldo · Posted 09-06-2024 02:28 PM

Thanks! I like the elegance of this ... but given that county FIPS code sets change from one year to the next, the code line

array t(7) $ 1 _temporary_ ('A','B','C','D','E','F','G');

could get cumbersome really fast.

I'll save this, though, for a time when there are fewer possible field values!

ballardw · Posted 09-06-2024 03:19 PM

@dwaldo wrote:

Thanks! I like the elegance of this ... but given that county FIPS code sets change from one year to the next, the code line
array t(7) $ 1 _temporary_ ('A','B','C','D','E','F','G');
could get cumbersome really fast.

I'll save this, though, for a time when there are fewer possible field values!

Counties don't change that often generally.My home state, Idaho, hasn't added a county since 1919. The newest county I find in a quick Google search dates from 2001. So changing county FIPS not an annual occurrence.

If your are using some other FIPS code that might be an issue.

Note that you very likely have a SASHELP.ZIPCODE data set. The STATE combined with the COUNTY code value: CountyFips= cats(put(state,z2.), put(county,z3.)); would create the 5-digit FIPS code in a data step or

cats(put(state,z2.), put(county,z3.)) as CountyFIPS in Proc SQL (likely use as you want distinct State and County values. Note that the county name is there as well though spelling sometimes isn't as locals use.

The SASHELP.ZIPCODE data set is updated by SAS periodically, 3 times in 2024 so far.

So a list of code should be easy enough to generate from that set.

If you don't have the data set you can get it with a valid SAS site code and your SAS Community login information. This message has a link you can follow to get it: https://communities.sas.com/t5/SAS-Hot-Fix-Announcements/Update-available-for-ZIPCODE-data-set-Augus... You can see the updates made available since 2019. They are listed in Year-MON order so the latest is currently 3 from the bottom as AUG comes before FEB and MAY.

In fact if done right you would have a the count to create the limits of the Count variables provided AND code that could assign the FIPS code values as the LABELs for variables such as the proposed Count1- CountXXX using that information.

dwaldo · Posted 09-06-2024 04:13 PM

Agreed that counties don't change very often ... Connecticut was the last big one of which I am aware. Thanks for pointing out that ZIP file -- always good to have something like that on tap.

Maxim 19

Long beats wide.

Registration is open

SAS Training: Just a Click Away