Hello,
Any advice is much appreciated! I need to transpose data to long-form. I am getting tripped up because I initially assumed that the data I am cleaning was wide format and that the proc transpose command would be straightforward, however I determined that the data is BOTH wide and long. As such, the proc transpose command is only partially functioning as I intended. As a fake example, the raw data file looks like this:
Row | CaseID | Person1 | Person2 | Person3 | Person 4 | Person 5 |
1 | 2024-000-01 | Sally | James | |||
2 | 2024-000-01 | Gerry | Natalie | |||
3 | 2024-000-01 | Sam | ||||
4 | 2024-000-02 | Annie | Holly | Jaxson | Ned | Clark |
5 | 2024-000-03 | Camille | Henry | |||
6 | 2024-000-04 | Alice | Chris | |||
7 | 2024-000-05 | Maria | George | |||
8 | 2024-000-05 | Ronnie | Brenda |
I was initially under the impression that everything would be structured as wide-form, like row #4, #5, and #6. However about 15% of the CaseIDs are entered the way rows #1, #2, #3, #7 and #8 are. To complicate matters, for cases that have split rows, the number of rows they occupy varies - there are between 2 and 12 rows per case with multiple persons in each row.
I am trying to clean the data so that it is long-form with one row per person name, like this:
Row | CaseID | Person |
1 | 2024-000-01 | Sally |
2 | 2024-000-01 | James |
3 | 2024-000-01 | Gerry |
4 | 2024-000-01 | Natalie |
5 | 2024-000-01 | Sam |
6 | 2024-000-02 | Annie |
7 | 2024-000-02 | Holly |
8 | 2024-000-02 | Jackson |
9 | 2024-000-02 | Ned |
10 | 2024-000-02 | Clark |
11 | 2024-000-03 | Camille |
12 | 2024-000-03 | Henry |
13 | 2024-000-04 | Alice |
14 | 2024-000-04 | Chris |
15 | 2024-000-05 | Maria |
16 | 2024-000-05 | George |
17 | 2024-000-05 | Ronnie |
18 | 2024-000-05 | Brenda |
Hello @MHines,
You can transpose the list of persons for each row and remove the observations with missing values with a WHERE= option:
data have;
infile cards truncover;
input Row CaseID :$11. (Person1-Person5) ($);
cards;
1 2024-000-01 Sally James
2 2024-000-01 Gerry Natalie
3 2024-000-01 Sam
4 2024-000-02 Annie Holly Jaxson Ned Clark
5 2024-000-03 Camille Henry
6 2024-000-04 Alice Chris
7 2024-000-05 Maria George
8 2024-000-05 Ronnie Brenda
;
proc transpose data=have out=want(drop=_: row rename=(col1=Person) where=(Person));
by Row CaseID;
var Person:;
run;
If variable Row is not contained in your dataset HAVE or should be contained in dataset WANT (with new values), it can be created easily.
Could you please post the example dataset as code, i.e. a DATA step with CARDS statement?
Also, can you please post the PROC TRANPOSE code you have tried, running on the example data, and describe what went wrong?
Hello,
Thanks for following up. I tried this:
proc transpose data=have out=want;
by CaseID;
var person1-5;
run;
The code is working for the cases that are not split across multiple rows, but only partially working for cases that are split across multiple rows. It looks like the table below. My goal is that all of the names are in the same column.
CaseID | _NAME_ | col1 | col2 | col3 |
2024-000-01 | Person1 | Sally | Gerry | Sam |
2024-000-01 | Person2 | James | Natalie | |
2024-000-01 | Person3 | |||
2024-000-01 | Person4 | |||
2024-000-01 | Person5 | |||
2024-000-02 | Person1 | Annie | ||
2024-000-02 | Person2 | Holly | ||
2024-000-02 | Person3 | Jaxson | ||
2024-000-02 | Person4 | Ned | ||
2024-000-02 | Person5 | Clark | ||
2024-000-03 | Person1 | Camille | ||
2024-000-03 | Person2 | Henry | ||
2024-000-03 | Person3 | |||
2024-000-03 | Person4 | |||
2024-000-03 | Person5 | |||
2024-000-04 | Person1 | Alice | ||
2024-000-04 | Person2 | Chris | ||
2024-000-04 | Person3 | |||
2024-000-04 | Person4 | |||
2024-000-04 | Person5 | |||
2024-000-05 | Person1 | Maria | Ronnie | |
2024-000-05 | Person2 | George | Brenda | |
2024-000-05 | Person3 | |||
2024-000-05 | Person4 | |||
2024-000-05 | Person5 |
I suppose I could always create a unique CaseID for each row, something like 2024-000-01-a, 2024-000-01-b, 2024-000-01-c etc. to transpose them that way, then convert the IDs back to the original and append the rows, but it seems like there are faster alternatives.
@MHines wrote:
I suppose I could always create a unique CaseID for each row, something like 2024-000-01-a, 2024-000-01-b, 2024-000-01-c etc. to transpose them that way, then convert the IDs back to the original and append the rows, but it seems like there are faster alternatives.
See @FreelanceReinh 's transpose answer. You can use ROW on your BY statement, you don't need to change the values of your CaseID. If you don't have ROW in your data, you can add it.
So assuming you do not actually have the ROW variable you displayed in your first listing you can easily add it. It does not need to depend on the CASEID to work. So something as simple as this should work.
data add_row;
row+1;
set have;
run;
proc transpose data=add_row out=want(rename=(col1=person)) name=place;
by row CaseID;
var person:;
run;
If the dataset is really large then make the ADD_ROW table a view instead.
data add_row / view=add_row;
row+1;
set have;
run;
Best would be to provide the example data in the form of a working data step.
And show the code you attempted that failed.
Duplicates of BY variables means that transpose will combine the first values into in one row, the second values into second row and so on.
So a data step may be the appropriate approach:
data have; input Row CaseID :$12. Person1:$10. Person2:$10. Person3 :$10. Person4:$10. Person5:$10.; datalines; 1 2024-000-01 Sally James . . . 2 2024-000-01 Gerry Natalie . . . 3 2024-000-01 Sam . . . . 4 2024-000-02 Annie Holly Jaxson Ned Clark 5 2024-000-03 Camille Henry . . . 6 2024-000-04 Alice Chris . . . 7 2024-000-05 Maria George . . . 8 2024-000-05 Ronnie Brenda . . . ; data want; set have; array p(*) Person1 - Person5; do i=1 to dim(p); Person= p[i]; if not missing(person) then output; end; keep caseid person; run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.