Solved: Re: Combine Rows into New column and Keep Original Layout

jimbobob · Posted 01-21-2022 05:48 PM

I know there is data steps/arrays that will combine multiple rows into one row, I want to do that and keep the original table layout, and for the total column keep max value. For example;

Original Table:

Key	ID	NAME	TOTAL
A	262P00050013	Bob	15012.32
A	262P00050014	Joe	24484.12
A	262P00050015	Sally	18988.43

New Table:

Key	ID	NAME	TOTAL	ID_COMBINED	NAME_COMBINED	Max
A	262P00050013	Bob	15012.32	262P00050013, 262P00050014, 262P00050015	Bob, Joe, Sally	24484.12
A	262P00050014	Joe	24484.12	262P00050013, 262P00050014, 262P00050015	Bob, Joe, Sally	24484.12
A	262P00050015	Sally	18988.43	262P00050013, 262P00050014, 262P00050015	Bob, Joe, Sally	24484.12

Any help would be appreciated.

mkeintz · Posted 01-21-2022 09:47 PM

Yes, you have to read all obs for each key twice, once to build the MAX and COMBINED variables, and again to re-read (and output) each obs with the new variables added:

data have;
  input Key	$1. ID :$12.	NAME :$6.	TOTAL ;
datalines;
A	262P00050013	Bob	15012.32
A	262P00050014	Joe	24484.12
A	262P00050015	Sally	18988.43
run;

data want;
  if 0 then set have;  /*Insures that original vars in HAVE will be the leftmost vars in WANT*/

  /* Now read all obs for a key create the new variables for it */
  length id_combined $80 name_combined $60 ;
  do until (last.key);
    set have;
    by key;
    id_combined=catx(',',id_combined,id);
    name_combined=catx(',',name_combined,name);
    max=max(total,max);
  end;

  /* With new vars in hand, re-read old vars for the key and output each obs */
  do until (last.key);
    set have;
    by key;
    output;
  end;
run;

As @Reeza said, you have to read each key twice (the data are sorted by key, right?). But this approach does not create and write out a separate table/dataset with the combined variables. So if your dataset is large, you'll avoid writing out an intermediate dataset only to read it back in from disk for merging purposes. The complexity of the above does have its benefits.

You will need to know the lengths needed for id_combined and name_combined. For example, if the maximum number of observations possible for a key is (say) 10, then the length needed for ID_COMBINED would be 10*12 (each id is $12) plus 9*1 (9 commas) ==> $129 minimum length. Same concept for name_combined.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 01-21-2022 07:43 PM

Create the ID combined in a separate table and merge it back with the main table.

A DoW loop is the only other way to do this but it’s also still technically two passes of the data so same amount of work as the merge and the merge is infinitely easier to understand.

jimbobob · Posted 01-25-2022 11:14 AM

Thanks Reeza.

mkeintz · Posted 01-21-2022 09:47 PM

Yes, you have to read all obs for each key twice, once to build the MAX and COMBINED variables, and again to re-read (and output) each obs with the new variables added:

data have;
  input Key	$1. ID :$12.	NAME :$6.	TOTAL ;
datalines;
A	262P00050013	Bob	15012.32
A	262P00050014	Joe	24484.12
A	262P00050015	Sally	18988.43
run;

data want;
  if 0 then set have;  /*Insures that original vars in HAVE will be the leftmost vars in WANT*/

  /* Now read all obs for a key create the new variables for it */
  length id_combined $80 name_combined $60 ;
  do until (last.key);
    set have;
    by key;
    id_combined=catx(',',id_combined,id);
    name_combined=catx(',',name_combined,name);
    max=max(total,max);
  end;

  /* With new vars in hand, re-read old vars for the key and output each obs */
  do until (last.key);
    set have;
    by key;
    output;
  end;
run;

As @Reeza said, you have to read each key twice (the data are sorted by key, right?). But this approach does not create and write out a separate table/dataset with the combined variables. So if your dataset is large, you'll avoid writing out an intermediate dataset only to read it back in from disk for merging purposes. The complexity of the above does have its benefits.

You will need to know the lengths needed for id_combined and name_combined. For example, if the maximum number of observations possible for a key is (say) 10, then the length needed for ID_COMBINED would be 10*12 (each id is $12) plus 9*1 (9 commas) ==> $129 minimum length. Same concept for name_combined.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jimbobob · Posted 01-25-2022 11:13 AM

Thanks Mkeintz, this works great.

Combine Rows into New column and Keep Original Layout

Re: Combine Rows into New column and Keep Original Layout

Re: Combine Rows into New column and Keep Original Layout

Re: Combine Rows into New column and Keep Original Layout

Re: Combine Rows into New column and Keep Original Layout

Re: Combine Rows into New column and Keep Original Layout

Registration is open

SAS Training: Just a Click Away