Hi All,
I have a table with student enrollment data. So in my table, each student id can have multiple rows, one row for each semester (we have three semesters per academic year). So a student can enroll in Fall 2015, and his enrollment ends in Summer 2019, so there will be 12 terms and potentially 12 rows in the table. However, if the student does not enroll in 5 terms in between, the table will have only 7 rows.
I need to create the 5 missing rows based on the data from the row above for that student id (UID), so I have all 12 rows for that student. I have a column orig_row in the table which has all values as 1. When I copy the row, I need to mark this particular column value as 0 so I know that this row was copied from the previous row..
I have a second table with Time_periods that I can use as a lookup for missing semesters.
So my data table looks like following:
If a student enrolled only for one semester as uid2, then it will have only one row.
Dept Residency_Type UID NewUId EndUId Uidchange Time_Period OrigRow
Health | 2-Out-of-State | U1 | 1 | 0 | New Uid | 15-Sep-15 | 1 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-Jan-16 | 1 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-Jan-17 | 1 |
Health | 2-Out-of-State | U1 | 0 | 1 | End Uid | 15-Sep-17 | 1 |
Science | 1-In-State | U2 | 1 | 1 | End Uid | 15-Jan-13 | 1 |
A&L | 1-In-State | U3 | 1 | 0 | New Uid | 15-May-10 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Sep-10 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-May-11 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Sep-12 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Jan-13 | 1 |
A&L | 1-In-State | U3 | 0 | 1 | End Uid | 15-May-14 | 1 |
And I want it to look like below
Dept Residency_Type UID NewUId EndUId Uidchange Time_Period OrigRow
Health | 2-Out-of-State | U1 | 1 | 0 | New Uid | 15-Sep-15 | 1 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-Jan-16 | 1 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-May-16 | 0 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-Sep-16 | 0 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-Jan-17 | 1 |
Health | 2-Out-of-State | U1 | 0 | 0 | Enrolled | 15-May-17 | 0 |
Health | 2-Out-of-State | U1 | 0 | 1 | End Uid | 15-Sep-17 | 1 |
Science | 1-In-State | U2 | 1 | 1 | End Uid | 15-Jan-13 | 1 |
A&L | 1-In-State | U3 | 1 | 0 | New Uid | 15-May-10 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Sep-10 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Jan-11 | 0 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-May-11 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Sep-11 | 0 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Jan-12 | 0 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-May-12 | 0 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Sep-12 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Jan-13 | 1 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-May-13 | 0 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Sep-13 | 0 |
A&L | 1-In-State | U3 | 0 | 0 | Enrolled | 15-Jan-14 | 0 |
A&L | 1-In-State | U3 | 0 | 1 | End Uid | 15-May-14 | 1 |
I am new to programming and I tried to use set, merge, crossjoin, outer join etc but not able to get this result.
Thank you.
Let's say you have these five variables properly constructed: uid semester origrow (1=yes, 0=no) name. It's the other variables you need to carry forward:
data want;
set have (keep=uid semester name origrow);
if origrow=1 then set have (where=(origrow=1));
run;
All the variables in the conditional set statement (except uid/semester/name/origrow) will not be reset to missing until the next origrow=1. I.e. those variables (actually all variables in a set/merge statement) are automatically retained until the next time they are read in.
This program assumes data are sorted by uid/semester, and that the first instance of ech uid has a complete set of valid variable values.
The second set only read origrow=1 observations.
I have some code below, but I will not be testing it on your data, which takes more work than I am willing to do, to get into a sas data step. If you can provide your data in the form of a working SAS data step, I will test.
So this code is untested (it assumes your data are sorted by uid/date):
data want (drop=_:);
set have;
by uid;
origrow=1; /* Mark as original record */
output; /* Output original record */
/* Peek ahead to get date of next record */
if eod2=0 then set have (firstobs=2 keep=date rename=(date=_nxt_date)) end=eod2;
else _nxt_date=.;
/* If not at end of this UID (last.uid=0), then */
/* fill any upcoming holes greater than 4 months*/
if last.uid=0 then do while (intck('month',date,_nxt_date)>4);
origrow=0;
uidchange='Enrolled';
date=intnx('month',date,4,'same');
output;
end;
run;
You've mentioned that you want a dummy variable, origrow=1 when the output record comes from on input record, and origrow=0 when the output record is filling a hole.
But you also need to pay attention to the UIDCHANGE variable which is "New Uid" for the first record and "Enrolled" for all the others. But if the hole in your sequence is for the second school term for the UID, you have to also modify the current value of "New Uid" to "Enrolled". That's what you see above.
The "trick" here is the "firstobs=2" option in the second (conditional) SET statement which assigns _NXT_DATE to the date of the next record (or a missing value when the end of the data set has been reached).
Edit: changed
if eod2=0 then set (firstobs=2 ...
to
if eod2=0 then set have (firstobs=2 ...
Hi,
Thanks for your help. I was trying something also, that if I have an ID associated with the time period series then may be there is a "Do Missing loop" that I can use to fill in the missing time period.
So, yes, here is the sample data which creates the two tables and then join the table : Please let me know if it makes sense. Thank you.
data TestData;
infile datalines truncover;
input uID:$8. Semester:ddmmyy10. Code:$32. residence:32. OrigRow:32.;
format Semester date9.;
datalines;
u1 150111 G 12 1
u1 150112 G 12 1
u1 150913 G 12 1
u2 150111 B 23 1
u3 150912 B 31 1
u3 150514 G 31 1
u4 150511 G 14 1
u4 150112 G 14 1
u4 150113 G 14 1
u4 150913 G 14 1
;;;;;;;;;;
Data Time_Period_series;
%let start_date=15May2010;
%let end_date=15Sep2020;
Time_Period="&start_date"d;
Timeid = 1;
do while (Time_Period<="&end_date"d);
output;
Time_Period=intnx('month', Time_Period,4 , 's');
Timeid = Timeid +1;
end;
format Time_Period date9.;
run;
proc sql;
create table append1 as
select d.uID, t.Timeid, d.Semester, d.code, d.residence, d.OrigRow from TestData d, Time_Period_series t where d.Semester = t.Time_Period;
quit;
This gives me a table like
uID | Timeid | Semester | Code | residence | OrigRow |
u1 | 3 | 15-Jan-11 | G | 12 | 1 |
u1 | 6 | 15-Jan-12 | G | 12 | 1 |
u1 | 11 | 15-Sep-13 | G | 12 | 1 |
u2 | 3 | 15-Jan-11 | B | 23 | 1 |
u3 | 8 | 15-Sep-12 | B | 31 | 1 |
u3 | 13 | 15-May-14 | G | 31 | 1 |
u4 | 4 | 15-May-11 | G | 14 | 1 |
u4 | 6 | 15-Jan-12 | G | 14 | 1 |
u4 | 9 | 15-Jan-13 | G | 14 | 1 |
u4 | 11 | 15-Sep-13 | G | 14 | 1 |
And I will like a table as:
uID | Timeid | Semester | Code | residence | OrigRow |
u1 | 3 | 15-Jan-11 | G | 12 | 1 | |||
u1 | 4 | 15-May-11 | G | 12 | 0 | New Row created | ||
u1 | 4 | 15-Sep-11 | G | 12 | 0 | New Row created | ||
u1 | 6 | 15-Jan-12 | G | 12 | 1 | |||
u1 | 7 | 15-May-12 | G | 12 | 0 | New Row created | ||
u1 | 8 | 15-Sep-12 | G | 12 | 0 | New Row created | ||
u1 | 9 | 15-Jan-13 | G | 12 | 0 | New Row created | ||
u1 | 10 | 15-May-13 | G | 12 | 0 | New Row created | ||
u1 | 11 | 15-Sep-13 | G | 12 | 1 | |||
u2 | 3 | 15-Jan-11 | B | 23 | 1 | |||
u3 | 8 | 15-Sep-12 | B | 31 | 1 | |||
u3 | 9 | 15-Jan-13 | B | 31 | 0 | New Row created | ||
u3 | 10 | 15-May-13 | B | 31 | 0 | New Row created | ||
u3 | 11 | 15-Sep-13 | B | 31 | 0 | New Row created | ||
u3 | 12 | 15-Jan-14 | B | 31 | 0 | New Row created | ||
u3 | 13 | 15-May-14 | B | 31 | 1 | |||
u4 | 4 | 15-May-11 | G | 14 | 1 | |||
u4 | 5 | 15-Sep-11 | G | 14 | 0 | New Row created | ||
u4 | 6 | 15-Jan-12 | G | 14 | 1 | |||
u4 | 7 | 15-May-12 | G | 14 | 0 | New Row created | ||
u4 | 8 | 15-Sep-12 | G | 14 | 0 | New Row created | ||
u4 | 9 | 15-Jan-13 | G | 14 | 1 | |||
u4 | 10 | 15-May-13 | G | 14 | 0 | New Row created | ||
u4 | 11 | 15-Sep-13 | G | 14 | 1 |
Sorry, for the third row for uid U1, timeid is 5 and not 4. and timeid 5 corresponds to 15-SEP-11.
Hi All, I was able to create the table (not the most elegant way) but somehow, where now I have all the "wanted" rows for each uid. I can also identify which rows are original and which ones are created by code (orig_row column value).
My dataset is large and have more than 60 columns. When I insert a new row for the missing timeperiod, I am only populating 5 columns, so rest of the 55 columns are blank and I have to copy the value from the previous row. I read about copying multiple values when the column names are written as arrays but my column names are all different (Race, Ethnicity, Gender etc..). I have to copy , say 30 columns from the row above. I can arrange the columns so all the columns that need the value are at the end of the table. Is there a way to do it or do I have to do Retain statement for each variable/column. Thanks
Let's say you have these five variables properly constructed: uid semester origrow (1=yes, 0=no) name. It's the other variables you need to carry forward:
data want;
set have (keep=uid semester name origrow);
if origrow=1 then set have (where=(origrow=1));
run;
All the variables in the conditional set statement (except uid/semester/name/origrow) will not be reset to missing until the next origrow=1. I.e. those variables (actually all variables in a set/merge statement) are automatically retained until the next time they are read in.
This program assumes data are sorted by uid/semester, and that the first instance of ech uid has a complete set of valid variable values.
The second set only read origrow=1 observations.
Hi, Thank you so much. This is working great. Appreciate you sharing knowledge and spending your precious time in helping me.
I have a question - for my insert the missing terms in between the given terms for any student uid, I went around the dirty way - meaning I created a time series and a counter. I did a cross join and merge to insert all the rows for all the time periods and then I deleted the rows which were above and beyond the first and last row (with the data). It took 4 to 5 data steps but it works fine.
I am sure there is a better way, but I don't have a clue as to what function/procedure call might be useful.
Again, thanks for your help.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.