data have;
input ID timepoint age education$ gender$ survey1 survey2 survey3;
datalines;
1 1 23 BA M 3 8 9
1 2 23 BA M 8 9 6
1 3 23 BA M 6 8 0
2 1 24 MA F 4 5 9
2 2 24 MA F 0 7 7
2 3 24 MA F 2 0 7
;
run;
Tranposed to following:
ID | age | education | gender | survey1_1 | survey1_2 | survey1_3 | survey2_1 | survey2_2 | survey2_3 | survey3_1 | survey3_2 | survey3_3 |
1 | 23 | BA | M | 30 | 8 | 6 | 8 | 9 | 8 | 9 | 6 | 0 |
2 | 24 | MA | F | 4 | 5 | 9 | 0 | 7 | 7 | 2 | 0 | 7 |
Creating this type of wide data set is easy, but programming with it will be a nightmare. Long data sets are much easier to handle, and you would be wise not to go down this path.
Please explain what analyses you are planning to do with this data, this will indicate whether or not long is a better format (a common situation), or wide is a better format (very uncommon)
@PaigeMiller For this I need it to be one ID per row. How do I tackle this?
Why do you need this to be one row per id?
@Nrjn7 wrote:
@PaigeMiller For this I need it to be one ID per row. How do I tackle this?
I think my point is that you may think you need one ID per row, but more experience SAS programmers would not go down this route, and would keep the data set long, and in the end, do much less programming to get to the end goal. Again, we really need to know what analysis you are going to do from this data in order to be sure, you need to state the end goal, and not intermediate goals such as one data set per row.
Also, just to be clear, there are indeed situations where you need one ID per row, these are relatively rare and usually involve fitting some form of statistical model. Which is why we want to know what your end goal is. What analysis will you do with this data?
@PaigeMiller I understand that the experienced user may not go this route. My case is indeed different.
The goal is to look at my repeated instruments of surveys and look at how the participants are making improvements. It would make much sense to look at this longitudinal data as a single row that have them as the way it was collected. I see the the posts are helpful but still struggling with how I'd want them to display.
Any way thanks for the help.
In my experience, following the development of values is much easier done in a longitudinal dataset.
Do all participants always have the same number of observations, and is that number known beforehand?
@Nrjn7 wrote:
@PaigeMiller I understand that the experienced user may not go this route. My case is indeed different.
The goal is to look at my repeated instruments of surveys and look at how the participants are making improvements. It would make much sense to look at this longitudinal data as a single row that have them as the way it was collected. I see the the posts are helpful but still struggling with how I'd want them to display.
Any way thanks for the help.
There's absolutely no need to have one row per ID to calculate if participants are making improvements. Of course, you don't say exactly what "making improvements" means, but procedures such as PROC SUMMARY (and many other PROCs) can calculate means and standard deviations (and many other statistics) for different participants and time points, from a LONG data set. By doing the work to convert this to a wide data set, you are doing unnecessary work (because PROC SUMMARY and many other PROCs work on LONG data sets) and making your programming more difficult.
Converting is simple and there are many examples on the Web and this site.
On complication with your data is you are starting in a middle state between truly LONG and truly WIDE.
For a simple case like this you could use a DATA step and ARRAY statements to collapse the observations. Looks like TIMEPOINT is the index into your arrays. The dimensions on the arrays should match the maximum value of TIMEPOINT.
data want ;
do until (last.gender) ;
set have;
by ID age education gender;
array survey1_ [3] ;
array survey2_ [3] ;
array survey3_ [3] ;
survey1_[timepoint] = survey1;
survey2_[timepoint] = survey2;
survey3_[timepoint] = survey3;
end;
drop timepoint survey1-survey3;
run;
You could potentially add more arrays and loops to handle survey1 to 3, but with N = 3 it is not really that really worth the effort.
For a more general solution first convert your semi-long structure to LONG. And then convert that to WIDE. The SORT step is just to try to set the order the variables are created in the final dataset.
proc transpose data=have out=tall name=question;
by ID age education gender timepoint ;
var survey1-survey3 ;
run;
proc sort data=tall;
by ID age education gender question timepoint ;
run;
proc transpose data=tall out=want2(drop=_name_) delim=_ ;
by ID age education gender;
id question timepoint;
var col1;
run;
Both methods produce the same result.
The COMPARE Procedure Comparison of WORK.WANT with WORK.WANT2 (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.WANT 22MAY21:10:55:00 22MAY21:10:55:00 13 2 WORK.WANT2 22MAY21:10:55:00 22MAY21:10:55:00 13 2 Variables Summary Number of Variables in Common: 13. Observation Summary Observation Base Compare First Obs 1 1 Last Obs 2 2 Number of Observations in Common: 2. Total Number of Observations Read from WORK.WANT: 2. Total Number of Observations Read from WORK.WANT2: 2. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 2. NOTE: No unequal values were found. All values compared are exactly equal.
%macro transp;
%do i=1 %to 3;
proc transpose prefix=survey&i._ data=have out=want&i;
by id age education gender;
var survey&i;
run;
%end;
data want (drop=_name_);
merge
%do i=1 %to 3;
want&i
%end;
;
by id age education gender;
run;
proc print;
id id;
run;
%mend;
%transp;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.