- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a number of different tables containing information on college students, such as their GPA at the end of each year, their major each year, amounts of different kinds of financial aid they took, cumulative credit hours, etc. Right now this exists in separate tables by topic (major and credits, financial aid, background info, etc.), with one row for each academic year.
I'd like to transpose every variable in some of these tables so that instead of having anywhere from 1-8 rows per student in each table (depending on how long they were in school), we have a set of columns for each academic year 1 through 8.
So, for example, the data I'm starting with looks like this:
StudentID AcademicYear CumGPA EdLevel Major AidAmount 1000 1 3.20 Freshman BasketWeaving $1,800
1002 1 2.77 Freshman SpaceAccounting $3,600
1002 2 2.93 Sophomore SpaceAccounting $933
1002 3 3.11 Junior SpaceAccounting $0
1002 4 3.03 Senior SpaceAccounting $1,200
1003 3 2.46 Junior Journalism $645
1003 4 2.75 Senior Journalism $4,300
1004 1 3.30 Freshman CryptoZoology $10,500
1004 2 2.73 Sophomore CryptoZoology $4,000
1004 3 2.34 Sophomore Biology $1,175
I want my output to look like this:
StudentID CumGPA_1 EdLevel_1 Major_1 AidAmount_1 CumGPA_2 EdLevel_2 Major_2 AidAmount_2 CumGPA_3 EdLevel_3 Major_3 AidAmount_3 CumGPA_4 EdLevel_4 Major_4 AidAmount_4 1000 3.20 Freshman BasketWeaving $1,800
1002 2.77 Freshman SpaceAccounting $3,600 2.93 Sophomore SpaceAccounting $933 3.11 Junior SpaceAccounting $0 3.03 Senior SpaceAccounting $1,200
1003 2.46 Junior Journalism $645 2.75 Senior Journalism $4,300
1004 3.30 Freshman CryptoZoology $10,500 2.73 Sophomore CryptoZoology $4,000 2.34 Sophomore Biology $1,175
The only way I can think of to do this with proc transpose involves rotating each of the above variables into new datasets individually (by StudentID with id AcademicYear), and then merging them all afterwards. I don't find that to be very practical given that I would have to do this for dozens of variables, and that we're expecting to receive additional data later in the year that will require everything to be done a second time.
I presume there are ways to do this with arrays in a datastep, but the examples of array code I have found are usually making the opposite transformation (from wide to long), and don't explain the steps very welll. Would someone be able to help me with this, or point me to resources that outline the steps and functions that I would need to do this in a very concice way?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First, what exactly will you do with the transposed data?
The reason I ask is partially because of your statement:
" and that we're expecting to receive additional data later in the year that will require everything to be done a second time."
Any code you write to handle 8 sets of variables will likely have multiple places where you have to then handle 9 (or 10 or whatever).
And if you are doing this for the current year what about next year when you add yet more variables?
If the purpose is to have a one-line per StudentId report then Proc Report or Tabulate will do that directly with the current data format and would not require any additional coding with added data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The reason I want to avoid transposing each variable individually is that it sounds like it would create a great deal of additional code to revise if we get additional variables to add in later. I suppose I could build a macro that would loop transpose over a list of variables and merge the results, but I was hoping there would be a more straightforward solution.
If Proc Report or Proc Tabulate can produce an output dataset that looks like the above, then I'd be interested to hear more.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@cghost wrote:
The transposed data will be cleaned and then passed on to academic researchers for secondary use. The purpose of transposition is due to flat-file, single-observation/single-row datasets being the norm for us.
Talk to your researchers. This doesn't follow the tidy principles of data management and I HIGHLY suspect one of the first steps they do is normalize the data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, the researchers are actually the people who asked me to do this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm screaming on the inside*, but I get where you're coming from.
If it was me and I had to do this, I'd
1. create a macro that flips each variable. This would be a PROC TRANSPOSE essentially.
2. Create a data set with each variable that needs to be included
3. Call #1 based on data from #2 - Call Execute
4. Because you have #2 you can also use that to drive the final merge so it's fully automated
The next time you need to do this, since you likely provide data to researchers multiple times you would change #2 and the rest of the program would be the same. I'm not sure that's concise, but it would be dynamic and automated.
*Rant: Researchers who can't handle data management are more likely to make assumptions to simplify the analysis and end up doing bad research - and I've seen too much of this that it frustrates me a lot. Yes, I'm generalizing and a lot are good once they get the data into SPSS...but a lot isn't. /Rant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
http://support.sas.com/resources/papers/proceedings13/538-2013.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In my experience, this isn't a good storage mechanism for data. It makes it harder to work with in the long run.
Here's a tutorial on how to do this via a data step. I'm not sure it's concise but again, this isn't something I'd recommend in the first place. Long data is easier to work with.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The dataset won't be growing (it's capped after 2017) but the number of variables included may.
I don't see the tutorial you mention. Am I missing a link somewhere?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I might be tempted to see if this is exceptable;
data have; informat StudentID $5. AcademicYear best5. CumGPA best5. EdLevel $10. Major $25. AidAmount Dollar8.; input StudentID AcademicYear CumGPA EdLevel Major AidAmount; datalines; 1000 1 3.20 Freshman BasketWeaving $1,800 1002 1 2.77 Freshman SpaceAccounting $3,600 1002 2 2.93 Sophomore SpaceAccounting $933 1002 3 3.11 Junior SpaceAccounting $0 1002 4 3.03 Senior SpaceAccounting $1,200 1003 3 2.46 Junior Journalism $645 1003 4 2.75 Senior Journalism $4,300 1004 1 3.30 Freshman CryptoZoology $10,500 1004 2 2.73 Sophomore CryptoZoology $4,000 1004 3 2.34 Sophomore Biology $1,175 ; run; ods csv file='c:\path\junk.csv'; options missing=' '; proc report data = have; columns studentid academicyear,( cumgpa edlevel major aidamount); define studentid / group; define academicyear / across; run; options missing='.'; ods csv close;
The csv file will have the data in one row per student though the actual data will start on row 3. This will let the analysts pick the variable names when they read the data and simplifies the creation. Since the number of columns will change and they have to adust the import unless you are supposed to provide a SAS data set (not explicitly stated). If you are sending them a data set I second @Reeza's concerns about an analyst that can't reshape data or use it in the current form.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try MERGE skill. data have; input StudentID AcademicYear CumGPA ( EdLevel Major ) (: $20.) AidAmount : comma32.; cards; 1000 1 3.20 Freshman BasketWeaving $1,800 1002 1 2.77 Freshman SpaceAccounting $3,600 1002 2 2.93 Sophomore SpaceAccounting $933 1002 3 3.11 Junior SpaceAccounting $0 1002 4 3.03 Senior SpaceAccounting $1,200 1003 3 2.46 Junior Journalism $645 1003 4 2.75 Senior Journalism $4,300 1004 1 3.30 Freshman CryptoZoology $10,500 1004 2 2.73 Sophomore CryptoZoology $4,000 1004 3 2.34 Sophomore Biology $1,175 ; run; proc sql; create table x as select distinct AcademicYear from have; quit; data _null_; set x end=last; if _n_=1 then call execute('data want;merge '); call execute(catt('have(where=(AcademicYear=',AcademicYear,') rename=(CumGPA=CumGPA',AcademicYear ,' EdLevel=EdLevel',AcademicYear ,' Major=Major',AcademicYear,' AidAmount=AidAmount',AcademicYear,' ))')); if last then call execute(';by StudentID;drop AcademicYear;run;'); run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I see a number of choices in your problem statement.
My perspective is show in
Database Vocabulary: Is Your Data Set a Dimension (LookUp) Table, a. Fact Table or a Report?
My commentary has two parts: example, and critique of your problem
* example of database transaction table, snapshots: periodic(monthly) and accumulating
1. transaction table, of credit card purchases
DATA library.transactions_data_structure;
attrib id length = 8 datetime length = 8
vendor length = $16
amount length = 8;
stop;
DATA transactions_2017;
if 0 then set library.transactions_data_structure;
infile datalines;
datalines;
...
;
2. summary to grain = monthly
macro or %include?
PROC summary data = transactions_&year
(where =(&date_start <=datetime <=&date_end));
*out =library.snapshot_accum_&year._month
This paper
Macro_Loops_with_Dates has code to fetch the date_* information into macro variables.
3. accumulating snapshot, which I suspect is kind of what you want.
DATA library.snapshot_accum_data_structure;
if 0 then set library.transaction_data_structure(keep = id);
attrib sum_01 length = 8 ... sum_12 length = 8;stop;
DATA library.snapshot_accum_&year;
if 0 then set library.library.snapshot_accum_data_structure;
update library.snapshot_accum_&year
library.snapshot_accum_&year._&month;
by id;
summary: database theory is always about data structure and what it is used for
* transaction history is only appended to.
* snapshot-periodic (grain in(daily, weekly, monthly, quarter, annual))
is a summation of transaction history, between two date(-times)
run once after grain is closed
* snapshot-accumulating for the annual budget summary shown above
is run after the periodic-snapshot is complete
may be, probably is, in same program.
** my commentary on your problem.
your description kinda/sorta fits the model of snapshot-accumulating.
The choices I see for you are:
* I suggest two fields year and AcademicYear
I know you have several examples of people who were sophomores for many years.
* naming conventions of variables:
cum_GPA_year_2016
or cum_GPA_year_1
* text fields EdLevel and Major ought to be reduced to integers
the last thing you want to be doing is editing or worse, parsing text, for information
last advice:
design the data structure of your last report first
and move back up the line, with the example theory shown above.
hth
Ron Fehd recipe maven
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry to be so late to the party, but I just noticed @Reeza's post mentioning the macro that a group of us wrote in 2013.
I just updated that macro this afternoon. The latest version can be found at: http://www.sascommunity.org/wiki/A_Better_Way_to_Flip_(Transpose)_a_SAS_Dataset
I strongly recommend downloading the file from that site because trying to copy code from a pdf file is fraught with all kinds of problems.
The code runs 50 times or more faster than PROC TRANSPOSE, requires one to type a lot less, and does a lot of things that PROC TRANSPOSE doesn't do. It was designed, specifically, to make a wide file wider. What it does, additionally, is create transposed variables that maintain the original variables' types, lengths, formats and labels.
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try PROC TRANSPOSE for twice.
For example:
proc transpose data=aa out=bb;
by studentid academicyear;
var cumgpa edlevel major aidamout;
run;
proc transpose data=bb out=cc(drop=_name_);
by studentid;
var col1;
id academicyear _name_;
run;