BookmarkSubscribeRSS Feed
cghost
Obsidian | Level 7

I have a number of different tables containing information on college students, such as their GPA at the end of each year, their major each year, amounts of different kinds of financial aid they took, cumulative credit hours, etc. Right now this exists in separate tables by topic (major and credits, financial aid, background info, etc.), with one row for each academic year.

 

I'd like to transpose every variable in some of these tables so that instead of having anywhere from 1-8 rows per student in each table (depending on how long they were in school), we have a set of columns for each academic year 1 through 8. 

 

So, for example, the data I'm starting with looks like this:

 

 

StudentID    AcademicYear    CumGPA    EdLevel      Major              AidAmount
1000            1             3.20      Freshman     BasketWeaving      $1,800
1002 1 2.77 Freshman SpaceAccounting $3,600
1002 2 2.93 Sophomore SpaceAccounting $933
1002 3 3.11 Junior SpaceAccounting $0
1002 4 3.03 Senior SpaceAccounting $1,200
1003 3 2.46 Junior Journalism $645
1003 4 2.75 Senior Journalism $4,300
1004 1 3.30 Freshman CryptoZoology $10,500
1004 2 2.73 Sophomore CryptoZoology $4,000
1004 3 2.34 Sophomore Biology $1,175

 

 

I want my output to look like this:

 

 

StudentID    CumGPA_1    EdLevel_1    Major_1          AidAmount_1    CumGPA_2    EdLevel_2    Major_2          AidAmount_2    CumGPA_3    EdLevel_3    Major_3          AidAmount_3    CumGPA_4    EdLevel_4    Major_4          AidAmount_4
1000          3.20        Freshman    BasketWeaving    $1,800
1002 2.77 Freshman SpaceAccounting $3,600 2.93 Sophomore SpaceAccounting $933 3.11 Junior SpaceAccounting $0 3.03 Senior SpaceAccounting $1,200
1003 2.46 Junior Journalism $645 2.75 Senior Journalism $4,300
1004 3.30 Freshman CryptoZoology $10,500 2.73 Sophomore CryptoZoology $4,000 2.34 Sophomore Biology $1,175

 

The only way I can think of to do this with proc transpose involves rotating each of the above variables into new datasets individually (by StudentID with id AcademicYear), and then merging them all afterwards. I don't find that to be very practical given that I would have to do this for dozens of variables, and that we're expecting to receive additional data later in the year that will require everything to be done a second time.

 

I presume there are ways to do this with arrays in a datastep, but the examples of array code I have found are usually making the opposite transformation (from wide to long), and don't explain the steps very welll. Would someone be able to help me with this, or point me to resources that outline the steps and functions that I would need to do this in a very concice way?

 

14 REPLIES 14
ballardw
Super User

First, what exactly will you do with the transposed data?

The reason I ask is partially because of your statement:

" and that we're expecting to receive additional data later in the year that will require everything to be done a second time."

 

Any code you write to handle 8 sets of variables will likely have multiple places where you have to then handle 9 (or 10 or whatever).

And if you are doing this for the current year what about next year when you add yet more variables?

 

If the purpose is to have a one-line per StudentId report then Proc Report or Tabulate will do that directly with the current data format and would not require any additional coding with added data.

 

 

cghost
Obsidian | Level 7
The transposed data will be cleaned and then passed on to academic researchers for secondary use. The purpose of transposition is due to flat-file, single-observation/single-row datasets being the norm for us.

The reason I want to avoid transposing each variable individually is that it sounds like it would create a great deal of additional code to revise if we get additional variables to add in later. I suppose I could build a macro that would loop transpose over a list of variables and merge the results, but I was hoping there would be a more straightforward solution.

If Proc Report or Proc Tabulate can produce an output dataset that looks like the above, then I'd be interested to hear more.
Reeza
Super User

@cghost wrote:
The transposed data will be cleaned and then passed on to academic researchers for secondary use. The purpose of transposition is due to flat-file, single-observation/single-row datasets being the norm for us.



 

Talk to your researchers. This doesn't follow the tidy principles of data management and I HIGHLY suspect one of the first steps they do is normalize the data.

cghost
Obsidian | Level 7

Unfortunately, the researchers are actually the people who asked me to do this.

Reeza
Super User

I'm screaming on the inside*, but I get where you're coming from.

 

If it was me and I had to do this, I'd 

1. create a macro that flips each variable. This would be a PROC TRANSPOSE essentially.

2. Create a data set with each variable that needs to be included 

3. Call #1 based on data from #2 - Call Execute

4. Because you have #2 you can also use that to drive the final merge so it's fully automated

 

The next time you need to do this, since you likely provide data to researchers multiple times you would change #2 and the rest of the program would be the same. I'm not sure that's concise, but it would be dynamic and automated.

 

*Rant: Researchers who can't handle data management are more likely to make assumptions to simplify the analysis and end up doing bad research - and I've seen too much of this that it frustrates me a lot. Yes, I'm generalizing and a lot are good once they get the data into SPSS...but a lot isn't. /Rant.

Reeza
Super User
There's also this macro from some users on here that may do exactly what you want already:
http://support.sas.com/resources/papers/proceedings13/538-2013.pdf
Reeza
Super User

In my experience, this isn't a good storage mechanism for data. It makes it harder to work with in the long run.

 

Here's a tutorial on how to do this via a data step. I'm not sure it's concise but again, this isn't something I'd recommend in the first place. Long data is easier to work with.

cghost
Obsidian | Level 7

The dataset won't be growing (it's capped after 2017) but the number of variables included may. 

 

I don't see the tutorial you mention. Am I missing a link somewhere?

ballardw
Super User

I might be tempted to see if this is exceptable;

data have;
informat StudentID $5.   AcademicYear best5.    CumGPA best5.    EdLevel $10.     Major  $25.   AidAmount Dollar8.;
input StudentID    AcademicYear    CumGPA    EdLevel      Major    AidAmount;
datalines;
1000            1             3.20      Freshman     BasketWeaving      $1,800
1002            1             2.77      Freshman     SpaceAccounting    $3,600
1002            2             2.93      Sophomore    SpaceAccounting    $933
1002            3             3.11      Junior       SpaceAccounting    $0
1002            4             3.03      Senior       SpaceAccounting    $1,200
1003            3             2.46      Junior       Journalism         $645
1003            4             2.75      Senior       Journalism         $4,300
1004            1             3.30      Freshman     CryptoZoology      $10,500
1004            2             2.73      Sophomore    CryptoZoology      $4,000
1004            3             2.34      Sophomore    Biology            $1,175
;
run;

ods csv file='c:\path\junk.csv';
options missing=' ';
proc report data = have;
   columns studentid academicyear,( cumgpa edlevel major aidamount);
   define studentid / group;
   define academicyear / across;
run;
options missing='.';
ods csv close; 
   

The csv file will have the data in one row per student though the actual data will start on row 3. This will let the analysts pick the variable names when they read the data and simplifies the creation. Since the number of columns will change and they have to adust the import unless you are supposed to provide a SAS data set (not explicitly stated). If you are sending them a data set I second @Reeza's concerns about an analyst that can't reshape data or use it in the current form.

 

Ksharp
Super User
Try MERGE skill.



data have;
input StudentID    AcademicYear    CumGPA   ( EdLevel      Major   ) (: $20.)          AidAmount : comma32.;
cards;
1000            1             3.20      Freshman     BasketWeaving      $1,800
1002            1             2.77      Freshman     SpaceAccounting    $3,600
1002            2             2.93      Sophomore    SpaceAccounting    $933
1002            3             3.11      Junior       SpaceAccounting    $0
1002            4             3.03      Senior       SpaceAccounting    $1,200
1003            3             2.46      Junior       Journalism         $645
1003            4             2.75      Senior       Journalism         $4,300
1004            1             3.30      Freshman     CryptoZoology      $10,500
1004            2             2.73      Sophomore    CryptoZoology      $4,000
1004            3             2.34      Sophomore    Biology            $1,175
;
run;
proc sql;
create table x as
select distinct AcademicYear from have;
quit;

data _null_;
 set x end=last;
 if _n_=1 then call execute('data want;merge ');
 call execute(catt('have(where=(AcademicYear=',AcademicYear,')
rename=(CumGPA=CumGPA',AcademicYear
,' EdLevel=EdLevel',AcademicYear
,' Major=Major',AcademicYear,' AidAmount=AidAmount',AcademicYear,' ))'));
if last then call execute(';by StudentID;drop AcademicYear;run;');
run;
            



Ron_MacroMaven
Lapis Lazuli | Level 10

I see a number of choices in your problem statement.

 

My perspective is show in

Database Vocabulary: Is Your Data Set a Dimension (LookUp) Table, a. Fact Table or a Report?

 

My commentary has two parts: example, and critique of your problem

 

* example of database transaction table, snapshots: periodic(monthly) and accumulating

 

1. transaction table, of credit card purchases

 

DATA library.transactions_data_structure;
     attrib id       length = 8
datetime length = 8 vendor length = $16 amount length = 8; stop; DATA transactions_2017; if 0 then set library.transactions_data_structure; infile datalines; datalines; ... ;

2. summary to grain = monthly

macro or %include?

 

 

PROC summary data = transactions_&year
                   (where =(&date_start <=datetime <=&date_end));
*out =library.snapshot_accum_&year._month

This paper

Macro_Loops_with_Dates has code to fetch the date_* information into macro variables.

 

3. accumulating snapshot, which I suspect is kind of what you want.

DATA library.snapshot_accum_data_structure;
   if 0 then set library.transaction_data_structure(keep = id);
    attrib sum_01 length = 8
...
sum_12 length = 8;
stop;
DATA library.snapshot_accum_&year;
if 0 then set library.library.snapshot_accum_data_structure;
update library.snapshot_accum_&year
      library.snapshot_accum_&year._&month;
by id;

summary: database theory is always about data structure and what it is used for

 

* transaction history is only appended to.

 

* snapshot-periodic (grain in(daily, weekly, monthly, quarter, annual))

is a summation of transaction history, between two date(-times)

run once after grain is closed

 

* snapshot-accumulating for the annual budget summary shown above

is run after the periodic-snapshot is complete

may be, probably is, in same program.

 

** my commentary on your problem.

your description kinda/sorta fits the model of snapshot-accumulating.

 

The choices I see for you are:

* I suggest two fields year and AcademicYear

I know you have several examples of people who were sophomores for many years.

 

* naming conventions of variables:

cum_GPA_year_2016

or cum_GPA_year_1

 

* text fields EdLevel and Major ought to be reduced to integers

the last thing you want to be doing is editing or worse, parsing text, for information

 

last advice:

design the data structure of your last report first

and move back up the line, with the example theory shown above.

 

hth

Ron Fehd  recipe maven

 

art297
Opal | Level 21

Sorry to be so late to the party, but I just noticed @Reeza's post mentioning the macro that a group of us wrote in 2013.

 

I just updated that macro this afternoon. The latest version can be found at: http://www.sascommunity.org/wiki/A_Better_Way_to_Flip_(Transpose)_a_SAS_Dataset

 

I strongly recommend downloading the file from that site because trying to copy code from a pdf file is fraught with all kinds of problems.

 

The code runs 50 times or more faster than PROC TRANSPOSE, requires one to type a lot less, and does a lot of things that PROC TRANSPOSE doesn't do. It was designed, specifically, to make a wide file wider. What it does, additionally, is create transposed variables that maintain the original variables' types, lengths, formats and labels.

 

Art, CEO, AnalystFinder.com

 

deng47
Calcite | Level 5

Try PROC TRANSPOSE for twice.

For example:

 

proc transpose data=aa out=bb;

by studentid academicyear;

var cumgpa edlevel major aidamout;

run;

 

proc transpose data=bb out=cc(drop=_name_);

by studentid;

var col1;

id academicyear _name_;

run;

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 7763 views
  • 11 likes
  • 7 in conversation