Re: Collapse ID while keeping the max value of the column

amarikow57 · Posted 02-07-2021 06:30 PM

Note: I am still learning a lot about SAS, so a somewhat detailed response would be incredibly helpful.

I have a fairly large dataset (109 columns) that I would like to merge with another dataset to complete the information. However, this dataset has multiple rows per ID while the one I'd like to merge it to only has one row per ID. I would like to get the long dataset down to one row per ID.

Because of how the long dataset was compiled there is a lot of "missing" information. That is, each ID has a row for different surveys that assessed different questions. See simplified example below:

ID SURVEY 1Q1 1Q2 1Q3 2Q1 2Q2 3Q1 3Q2 3Q3 3Q4 3Q5 4Q1 4Q2 4Q3 4Q4 5Q1 5Q2 5Q3

1 1 1 2 3 . . . . . . . . . . . . . .

1 2 . . . 8 0 . . . . . . . . . . . .

1 3 . . . . . 4 0 1 5 2 . . . . . . .

1 4 . . . . . . . . . . 1 4 2 1 . . .

1 5 . . . . . . . . . . . . . . 1 3 1

2 1 0 3 6 . . . . . . . . . . . . . .

2 2 . . . 7 1 . . . . . . . . . . . .

2 3 . . . . . 5 1 1 5 3 . . . . . . .

2 5 . . . . . . . . . . . . . . 0 2 0

3 1 1 3 3 . . . . . . . . . . . . . .

3 2 . . . 6 0 . . . . . . . . . . . .

3 3 . . . . . 3 1 0 5 1 . . . . . . .

3 4 . . . . . . . . . . 1 4 2 1 . . .

3 5 . . . . . . . . . . . . . . 0 3 1

4 1 1 1 5 . . . . . . . . . . . . . .

4 2 . . . 4 0 . . . . . . . . . . . .

4 3 . . . . . 6 0 1 5 2 . . . . . . .

5 1 0 3 4 . . . . . . . . . . . . . .

5 3 . . . . . 9 0 0 4 1 . . . . . . .

5 4 . . . . . . . . . . 1 4 2 0 . . .

5 5 . . . . . . . . . . . . . . 1 3 1

Essentially I want to collapse the IDs without losing valuable information. So my questions are as followed:

(1) Is there a SAS function that will allow me to keep one observation per ID where the max value is kept? I can get down to one ID with the following function:

DATA OUTPUT.uniqueID;
 update OUTPUT.allID(obs = 1) OUTPUT.allID;
 by AUTISM_ID;
RUN;

However, it seems to only keep the first value, when I'd rather it keep the max.

(2) If not possible, is there a way to repeat the value for the entire ID in an efficient manner? I have tried:

PROC SQL;
 create table OUTPUT.allID as 
 	select *, 
                max(AGE_DIAG_M) as AGE_DIAG_MONTHS,
 		max(AGE_DIAG_Y) as AGE_DIAG_YEARS,
 		max(CHILD_GENDER) as SEX,
 		max(A_TOTAL_ADI_W1) as TOTAL_A,
 		max(B_V_TOTAL_ADI_W1) as TOTAL_B_VERB,
 		max(B_NV_TOTAL_ADI_W1) as TOTAL_B_NONVERB,
 		max(C_TOTAL_ADI_W1) as TOTAL_C,
 		max(D_TOTAL_ADI_W1) as TOTAL_D
 from OUTPUT.age
 group by AUTISM_ID;
QUIT;

Although it works, it seems inefficient considering how many variables I have. Is there a way to create some sort of DO LOOP or MACRO function that will run through all the variables and do it for me?

I appreciate any feedback. Thanks in advance!

mkeintz · Posted 02-07-2021 08:35 PM

The data you show is in the form of a block diagonal for each ID. In such a case, the maximum value for each column is also the ONLY value in that column for each ID. So I don't see the problem in getting the max.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

amarikow57 · Posted 02-07-2021 11:19 PM

Ah. So (1) code should automatically keep the only value then? Okay. I guess I thought I had less missing data than I do. Thank you.

Reeza · Posted 02-07-2021 11:33 PM

If you find you do need to take the max, use PROC MEANS. And ideally, change your variable names, adding quotes and an N each time must be a PITA since those are not valid SAS variable names.

proc means data=have noprint;
class ID;
var "1Q1"n -- "5Q3"n;
output out=want max= / autoname;
run;

@amarikow57 wrote:

Note: I am still learning a lot about SAS, so a somewhat detailed response would be incredibly helpful.

I have a fairly large dataset (109 columns) that I would like to merge with another dataset to complete the information. However, this dataset has multiple rows per ID while the one I'd like to merge it to only has one row per ID. I would like to get the long dataset down to one row per ID.

Because of how the long dataset was compiled there is a lot of "missing" information. That is, each ID has a row for different surveys that assessed different questions. See simplified example below:

ID SURVEY 1Q1 1Q2 1Q3 2Q1 2Q2 3Q1 3Q2 3Q3 3Q4 3Q5 4Q1 4Q2 4Q3 4Q4 5Q1 5Q2 5Q3

1 1 1 2 3 . . . . . . . . . . . . . .

1 2 . . . 8 0 . . . . . . . . . . . .

1 3 . . . . . 4 0 1 5 2 . . . . . . .

1 4 . . . . . . . . . . 1 4 2 1 . . .

1 5 . . . . . . . . . . . . . . 1 3 1

2 1 0 3 6 . . . . . . . . . . . . . .

2 2 . . . 7 1 . . . . . . . . . . . .

2 3 . . . . . 5 1 1 5 3 . . . . . . .

2 5 . . . . . . . . . . . . . . 0 2 0

3 1 1 3 3 . . . . . . . . . . . . . .

3 2 . . . 6 0 . . . . . . . . . . . .

3 3 . . . . . 3 1 0 5 1 . . . . . . .

3 4 . . . . . . . . . . 1 4 2 1 . . .

3 5 . . . . . . . . . . . . . . 0 3 1

4 1 1 1 5 . . . . . . . . . . . . . .

4 2 . . . 4 0 . . . . . . . . . . . .

4 3 . . . . . 6 0 1 5 2 . . . . . . .

5 1 0 3 4 . . . . . . . . . . . . . .

5 3 . . . . . 9 0 0 4 1 . . . . . . .

5 4 . . . . . . . . . . 1 4 2 0 . . .

5 5 . . . . . . . . . . . . . . 1 3 1

Essentially I want to collapse the IDs without losing valuable information. So my questions are as followed:

(1) Is there a SAS function that will allow me to keep one observation per ID where the max value is kept? I can get down to one ID with the following function:
DATA OUTPUT.uniqueID;
 update OUTPUT.allID(obs = 1) OUTPUT.allID;
 by AUTISM_ID;
RUN; 
However, it seems to only keep the first value, when I'd rather it keep the max.

(2) If not possible, is there a way to repeat the value for the entire ID in an efficient manner? I have tried:
PROC SQL;
 create table OUTPUT.allID as 
 	select *, 
                max(AGE_DIAG_M) as AGE_DIAG_MONTHS,
 		max(AGE_DIAG_Y) as AGE_DIAG_YEARS,
 		max(CHILD_GENDER) as SEX,
 		max(A_TOTAL_ADI_W1) as TOTAL_A,
 		max(B_V_TOTAL_ADI_W1) as TOTAL_B_VERB,
 		max(B_NV_TOTAL_ADI_W1) as TOTAL_B_NONVERB,
 		max(C_TOTAL_ADI_W1) as TOTAL_C,
 		max(D_TOTAL_ADI_W1) as TOTAL_D
 from OUTPUT.age
 group by AUTISM_ID;
QUIT;
Although it works, it seems inefficient considering how many variables I have. Is there a way to create some sort of DO LOOP or MACRO function that will run through all the variables and do it for me?

I appreciate any feedback. Thanks in advance!

Kurt_Bremser · Posted 02-09-2021 04:19 AM

You suffer from a bad dataset structure, causing you to use name literals, waste space for missing values, and write bad code.

First, transpose your dataset to a long layout:

options validvarname=any;
data have;
input
  ID  SURVEY
  "1Q1"n "1Q2"n "1Q3"n "2Q1"n "2Q2"n "3Q1"n "3Q2"n "3Q3"n "3Q4"n "3Q5"n
  "4Q1"n "4Q2"n "4Q3"n "4Q4"n "5Q1"n "5Q2"n "5Q3"n
;
datalines;
1    1               1      2      3     .       .       .       .       .      .       .       .       .      .       .       .       .       .
1    2               .       .        .     8      0      .       .       .      .       .       .       .      .       .       .       .       .
1    3               .       .        .     .       .       4      0      1     5      2      .       .      .       .       .       .       .
1    4               .       .        .     .       .       .       .       .      .       .       1      4     2      1      .       .       .
1    5               .       .        .     .       .       .       .       .      .       .       .       .      .       .       1      3      1
2    1               0      3      6     .       .       .       .       .      .       .       .       .      .       .       .       .       .
2    2               .       .        .     7      1      .       .       .      .       .       .       .      .       .       .       .       .
2    3               .       .        .     .       .       5      1      1     5      3      .       .      .       .       .       .       .
2    5               .       .        .     .       .       .       .       .      .       .       .       .      .       .       0      2      0
3   1               1      3      3     .       .       .       .       .      .       .       .       .      .       .       .       .       .
3    2               .       .        .     6      0      .       .       .      .       .       .       .      .       .       .       .       .
3    3               .       .        .     .       .       3      1      0     5      1      .       .      .       .       .       .       .
3    4               .       .        .     .       .       .       .       .      .       .       1      4     2      1      .       .       .
3    5               .       .        .     .       .       .       .       .      .       .       .       .      .       .       0      3      1
4    1               1      1      5     .       .       .       .       .      .       .       .       .      .       .       .       .       .
4    2               .       .        .     4      0      .       .       .      .       .       .       .      .       .       .       .       .
4    3               .       .        .     .       .       6      0      1     5      2      .       .      .       .       .       .       .
5    1               0      3      4     .       .       .       .       .      .       .       .       .      .       .       .       .       .
5    3               .       .        .     .       .       9      0      0     4      1      .       .      .       .       .       .       .
5    4               .       .        .     .       .       .       .       .      .       .       1      4     2      0      .       .       .
5    5               .       .        .     .       .       .       .       .      .       .       .       .      .       .       1      3      1
;

proc transpose
  data=have
  out=l1 (
    rename=(col1=answer)
    where=(answer ne .)
  )
;
by id survey;
var "1Q1"n--"5Q3"n;
run;

data long;
set l1;
surv = input(scan(_name_,1,"Q"),best.);
question = input(scan(_name_,2,"Q"),best.);
if surv ne survey then putlog "ERROR";
keep id survey question answer;
run;

In the resulting dataset, it is easy to group by (or select for) ID's, surveys, questions.

For reporting purposes, you can use the question as an ACROSS variable in proc report to create a wide layout.

For processing, a long layout is always preferred.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Collapse ID while keeping the max value of the column