Transposing Multiple Variables into One Observation

cghost · Posted 09-13-2017 11:18 AM

I have a number of different tables containing information on college students, such as their GPA at the end of each year, their major each year, amounts of different kinds of financial aid they took, cumulative credit hours, etc. Right now this exists in separate tables by topic (major and credits, financial aid, background info, etc.), with one row for each academic year.

I'd like to transpose every variable in some of these tables so that instead of having anywhere from 1-8 rows per student in each table (depending on how long they were in school), we have a set of columns for each academic year 1 through 8.

So, for example, the data I'm starting with looks like this:

StudentID    AcademicYear    CumGPA    EdLevel      Major              AidAmount
1000            1             3.20      Freshman     BasketWeaving      $1,800
1002            1             2.77      Freshman     SpaceAccounting    $3,600
1002            2             2.93      Sophomore    SpaceAccounting    $933
1002            3             3.11      Junior       SpaceAccounting    $0
1002            4             3.03      Senior       SpaceAccounting    $1,200
1003            3             2.46      Junior       Journalism         $645
1003            4             2.75      Senior       Journalism         $4,300
1004            1             3.30      Freshman     CryptoZoology      $10,500
1004            2             2.73      Sophomore    CryptoZoology      $4,000
1004            3             2.34      Sophomore    Biology            $1,175

I want my output to look like this:

StudentID    CumGPA_1    EdLevel_1    Major_1          AidAmount_1    CumGPA_2    EdLevel_2    Major_2          AidAmount_2    CumGPA_3    EdLevel_3    Major_3          AidAmount_3    CumGPA_4    EdLevel_4    Major_4          AidAmount_4
1000          3.20        Freshman    BasketWeaving    $1,800
1002          2.77        Freshman    SpaceAccounting  $3,600          2.93       Sophomore    SpaceAccounting   $933           3.11       Junior       SpaceAccounting   $0             3.03        Senior      SpaceAccounting   $1,200
1003                                                                                                                            2.46       Junior       Journalism        $645           2.75        Senior      Journalism        $4,300
1004          3.30        Freshman    CryptoZoology    $10,500         2.73       Sophomore    CryptoZoology     $4,000         2.34       Sophomore    Biology           $1,175

The only way I can think of to do this with proc transpose involves rotating each of the above variables into new datasets individually (by StudentID with id AcademicYear), and then merging them all afterwards. I don't find that to be very practical given that I would have to do this for dozens of variables, and that we're expecting to receive additional data later in the year that will require everything to be done a second time.

I presume there are ways to do this with arrays in a datastep, but the examples of array code I have found are usually making the opposite transformation (from wide to long), and don't explain the steps very welll. Would someone be able to help me with this, or point me to resources that outline the steps and functions that I would need to do this in a very concice way?

ballardw · Posted 09-13-2017 11:27 AM

First, what exactly will you do with the transposed data?

The reason I ask is partially because of your statement:

" and that we're expecting to receive additional data later in the year that will require everything to be done a second time."

Any code you write to handle 8 sets of variables will likely have multiple places where you have to then handle 9 (or 10 or whatever).

And if you are doing this for the current year what about next year when you add yet more variables?

If the purpose is to have a one-line per StudentId report then Proc Report or Tabulate will do that directly with the current data format and would not require any additional coding with added data.

cghost · Posted 09-13-2017 12:53 PM

The transposed data will be cleaned and then passed on to academic researchers for secondary use. The purpose of transposition is due to flat-file, single-observation/single-row datasets being the norm for us.

The reason I want to avoid transposing each variable individually is that it sounds like it would create a great deal of additional code to revise if we get additional variables to add in later. I suppose I could build a macro that would loop transpose over a list of variables and merge the results, but I was hoping there would be a more straightforward solution.

If Proc Report or Proc Tabulate can produce an output dataset that looks like the above, then I'd be interested to hear more.

Reeza · Posted 09-13-2017 12:55 PM

@cghost wrote:
The transposed data will be cleaned and then passed on to academic researchers for secondary use. The purpose of transposition is due to flat-file, single-observation/single-row datasets being the norm for us.

Talk to your researchers. This doesn't follow the tidy principles of data management and I HIGHLY suspect one of the first steps they do is normalize the data.

cghost · Posted 09-13-2017 12:56 PM

Unfortunately, the researchers are actually the people who asked me to do this.

Reeza · Posted 09-13-2017 01:04 PM

I'm screaming on the inside*, but I get where you're coming from.

If it was me and I had to do this, I'd

1. create a macro that flips each variable. This would be a PROC TRANSPOSE essentially.

2. Create a data set with each variable that needs to be included

3. Call #1 based on data from #2 - Call Execute

4. Because you have #2 you can also use that to drive the final merge so it's fully automated

The next time you need to do this, since you likely provide data to researchers multiple times you would change #2 and the rest of the program would be the same. I'm not sure that's concise, but it would be dynamic and automated.

*Rant: Researchers who can't handle data management are more likely to make assumptions to simplify the analysis and end up doing bad research - and I've seen too much of this that it frustrates me a lot. Yes, I'm generalizing and a lot are good once they get the data into SPSS...but a lot isn't. /Rant.

Reeza · Posted 09-13-2017 01:05 PM

There's also this macro from some users on here that may do exactly what you want already:
http://support.sas.com/resources/papers/proceedings13/538-2013.pdf

Reeza · Posted 09-13-2017 12:13 PM

In my experience, this isn't a good storage mechanism for data. It makes it harder to work with in the long run.

Here's a tutorial on how to do this via a data step. I'm not sure it's concise but again, this isn't something I'd recommend in the first place. Long data is easier to work with.

cghost · Posted 09-13-2017 01:02 PM

The dataset won't be growing (it's capped after 2017) but the number of variables included may.

I don't see the tutorial you mention. Am I missing a link somewhere?

Reeza · Posted 09-13-2017 01:04 PM

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

ballardw · Posted 09-13-2017 01:38 PM

I might be tempted to see if this is exceptable;

data have;
informat StudentID $5.   AcademicYear best5.    CumGPA best5.    EdLevel $10.     Major  $25.   AidAmount Dollar8.;
input StudentID    AcademicYear    CumGPA    EdLevel      Major    AidAmount;
datalines;
1000            1             3.20      Freshman     BasketWeaving      $1,800
1002            1             2.77      Freshman     SpaceAccounting    $3,600
1002            2             2.93      Sophomore    SpaceAccounting    $933
1002            3             3.11      Junior       SpaceAccounting    $0
1002            4             3.03      Senior       SpaceAccounting    $1,200
1003            3             2.46      Junior       Journalism         $645
1003            4             2.75      Senior       Journalism         $4,300
1004            1             3.30      Freshman     CryptoZoology      $10,500
1004            2             2.73      Sophomore    CryptoZoology      $4,000
1004            3             2.34      Sophomore    Biology            $1,175
;
run;

ods csv file='c:\path\junk.csv';
options missing=' ';
proc report data = have;
   columns studentid academicyear,( cumgpa edlevel major aidamount);
   define studentid / group;
   define academicyear / across;
run;
options missing='.';
ods csv close;

The csv file will have the data in one row per student though the actual data will start on row 3. This will let the analysts pick the variable names when they read the data and simplifies the creation. Since the number of columns will change and they have to adust the import unless you are supposed to provide a SAS data set (not explicitly stated). If you are sending them a data set I second @Reeza's concerns about an analyst that can't reshape data or use it in the current form.

Ksharp · Posted 09-15-2017 10:43 AM

Try MERGE skill.



data have;
input StudentID    AcademicYear    CumGPA   ( EdLevel      Major   ) (: $20.)          AidAmount : comma32.;
cards;
1000            1             3.20      Freshman     BasketWeaving      $1,800
1002            1             2.77      Freshman     SpaceAccounting    $3,600
1002            2             2.93      Sophomore    SpaceAccounting    $933
1002            3             3.11      Junior       SpaceAccounting    $0
1002            4             3.03      Senior       SpaceAccounting    $1,200
1003            3             2.46      Junior       Journalism         $645
1003            4             2.75      Senior       Journalism         $4,300
1004            1             3.30      Freshman     CryptoZoology      $10,500
1004            2             2.73      Sophomore    CryptoZoology      $4,000
1004            3             2.34      Sophomore    Biology            $1,175
;
run;
proc sql;
create table x as
select distinct AcademicYear from have;
quit;

data _null_;
 set x end=last;
 if _n_=1 then call execute('data want;merge ');
 call execute(catt('have(where=(AcademicYear=',AcademicYear,')
rename=(CumGPA=CumGPA',AcademicYear
,' EdLevel=EdLevel',AcademicYear
,' Major=Major',AcademicYear,' AidAmount=AidAmount',AcademicYear,' ))'));
if last then call execute(';by StudentID;drop AcademicYear;run;');
run;

Ron_MacroMaven · Posted 09-15-2017 01:51 PM

I see a number of choices in your problem statement.

My perspective is show in

Database Vocabulary: Is Your Data Set a Dimension (LookUp) Table, a. Fact Table or a Report?

My commentary has two parts: example, and critique of your problem

* example of database transaction table, snapshots: periodic(monthly) and accumulating

1. transaction table, of credit card purchases

DATA library.transactions_data_structure;
     attrib id       length = 8
            datetime length = 8
            vendor   length = $16
            amount   length = 8;
stop;

DATA transactions_2017;
          if 0 then set library.transactions_data_structure;
infile datalines;
datalines;
...
;

2. summary to grain = monthly

macro or %include?

PROC summary data = transactions_&year
                   (where =(&date_start <=datetime <=&date_end));
*out =library.snapshot_accum_&year._month

This paper

Macro_Loops_with_Dates has code to fetch the date_* information into macro variables.

3. accumulating snapshot, which I suspect is kind of what you want.

DATA library.snapshot_accum_data_structure;
   if 0 then set library.transaction_data_structure(keep = id);
    attrib sum_01 length = 8
           ...
           sum_12 length = 8;
stop;

DATA library.snapshot_accum_&year;
if 0 then set library.library.snapshot_accum_data_structure;
update library.snapshot_accum_&year
      library.snapshot_accum_&year._&month;
by id;

summary: database theory is always about data structure and what it is used for

* transaction history is only appended to.

* snapshot-periodic (grain in(daily, weekly, monthly, quarter, annual))

is a summation of transaction history, between two date(-times)

run once after grain is closed

* snapshot-accumulating for the annual budget summary shown above

is run after the periodic-snapshot is complete

may be, probably is, in same program.

** my commentary on your problem.

your description kinda/sorta fits the model of snapshot-accumulating.

The choices I see for you are:

* I suggest two fields year and AcademicYear

I know you have several examples of people who were sophomores for many years.

* naming conventions of variables:

cum_GPA_year_2016

or cum_GPA_year_1

* text fields EdLevel and Major ought to be reduced to integers

the last thing you want to be doing is editing or worse, parsing text, for information

last advice:

design the data structure of your last report first

and move back up the line, with the example theory shown above.

hth

Ron Fehd recipe maven

art297 · Posted 09-15-2017 04:14 PM

Sorry to be so late to the party, but I just noticed @Reeza's post mentioning the macro that a group of us wrote in 2013.

I just updated that macro this afternoon. The latest version can be found at: http://www.sascommunity.org/wiki/A_Better_Way_to_Flip_(Transpose)_a_SAS_Dataset

I strongly recommend downloading the file from that site because trying to copy code from a pdf file is fraught with all kinds of problems.

The code runs 50 times or more faster than PROC TRANSPOSE, requires one to type a lot less, and does a lot of things that PROC TRANSPOSE doesn't do. It was designed, specifically, to make a wide file wider. What it does, additionally, is create transposed variables that maintain the original variables' types, lengths, formats and labels.

Art, CEO, AnalystFinder.com

deng47 · Posted 11-20-2019 02:46 AM

Try PROC TRANSPOSE for twice.

For example:

proc transpose data=aa out=bb;

by studentid academicyear;

var cumgpa edlevel major aidamout;

run;

proc transpose data=bb out=cc(drop=_name_);

by studentid;

var col1;

id academicyear _name_;

run;

Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

Re: Transposing Multiple Variables into One Observation

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away