BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sangita1
Obsidian | Level 7

Hi All,

I have a table with student enrollment data.  So in my table, each student id can have multiple rows, one row for each semester (we have three semesters per academic year).  So a student can enroll in Fall 2015, and his enrollment ends in Summer 2019, so there will be 12 terms and potentially 12 rows in the table. However, if the student does not enroll in 5 terms in between, the table will have only 7 rows.

I need to create the 5 missing rows based on the data from the row above for that student id (UID), so I have all 12 rows for that student. I have a column orig_row in the table which has all values as 1. When I copy the row, I need to mark this particular column value as 0 so I know that this row was copied from the previous row..

I have a second table with Time_periods that I can use as a lookup for missing semesters. 

So my data table looks like following:

If a student enrolled only for one semester as uid2, then it will have only one row.

Dept          Residency_Type  UID           NewUId    EndUId      Uidchange Time_Period  OrigRow

Health 2-Out-of-StateU110New Uid15-Sep-151
Health 2-Out-of-StateU100Enrolled15-Jan-161
Health 2-Out-of-StateU100Enrolled15-Jan-171
Health 2-Out-of-StateU101End Uid15-Sep-171
Science1-In-StateU211End Uid15-Jan-131
A&L1-In-StateU310New Uid15-May-101
A&L1-In-StateU300Enrolled15-Sep-101
A&L1-In-StateU300Enrolled15-May-111
A&L1-In-StateU300Enrolled15-Sep-121
A&L1-In-StateU300Enrolled15-Jan-131
A&L1-In-StateU301End Uid15-May-141

 

And I want it to look like below

Dept        Residency_Type  UID      NewUId    EndUId   Uidchange Time_Period OrigRow

Health 2-Out-of-StateU110New Uid15-Sep-151
Health 2-Out-of-StateU100Enrolled15-Jan-161
Health 2-Out-of-StateU100Enrolled15-May-160
Health 2-Out-of-StateU100Enrolled15-Sep-160
Health 2-Out-of-StateU100Enrolled15-Jan-171
Health 2-Out-of-StateU100Enrolled15-May-170
Health 2-Out-of-StateU101End Uid15-Sep-171
Science1-In-StateU211End Uid15-Jan-131
A&L1-In-StateU310New Uid15-May-101
A&L1-In-StateU300Enrolled15-Sep-101
A&L1-In-StateU300Enrolled15-Jan-110
A&L1-In-StateU300Enrolled15-May-111
A&L1-In-StateU300Enrolled15-Sep-110
A&L1-In-StateU300Enrolled15-Jan-120
A&L1-In-StateU300Enrolled15-May-120
A&L1-In-StateU300Enrolled15-Sep-121
A&L1-In-StateU300Enrolled15-Jan-131
A&L1-In-StateU300Enrolled15-May-130
A&L1-In-StateU300Enrolled15-Sep-130
A&L1-In-StateU300Enrolled15-Jan-140
A&L1-In-StateU301End Uid15-May-141

 

I am new to programming and I tried to use set, merge, crossjoin, outer join etc but not able to get this result. 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Let's say you have these five variables properly constructed: uid semester origrow (1=yes, 0=no) name.  It's the other variables you need to carry forward:

 

data want;
   set have (keep=uid semester name origrow);
   if origrow=1 then set have (where=(origrow=1));
run;

All the variables in the conditional set statement (except uid/semester/name/origrow) will not be reset to missing until the next origrow=1.  I.e. those variables (actually all variables in a set/merge statement) are automatically retained until the next time they are read in.

 

This program assumes data are sorted by uid/semester, and that the first instance of ech uid has a complete set of valid variable values.

 

The second set only read origrow=1 observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

6 REPLIES 6
mkeintz
PROC Star

I have some code below, but I will not be testing it on your data, which takes more work than I am willing to do, to get into a sas data step.  If you can provide your data in the form of a working SAS data step, I will test.

 

So this code is untested (it assumes your data are sorted by uid/date):

 

data want (drop=_:);
  set have;
  by uid;
  origrow=1; /* Mark as original record */
  output;    /* Output original record */

  /* Peek ahead to get date of next record */
  if eod2=0 then set have (firstobs=2 keep=date rename=(date=_nxt_date)) end=eod2;
  else _nxt_date=.;

  /* If not at end of this UID (last.uid=0), then */
  /* fill any upcoming holes greater than 4 months*/
  if last.uid=0 then do while (intck('month',date,_nxt_date)>4);
    origrow=0;
    uidchange='Enrolled';
	date=intnx('month',date,4,'same');
	output;
  end;
run;

You've mentioned that you want a dummy variable, origrow=1 when the output record comes from on input record, and origrow=0 when the output record is filling a hole.

 

But you also need to pay attention to the UIDCHANGE variable which is "New Uid" for the first record and "Enrolled" for all the others.  But if the hole in your sequence is for the second school term for the UID, you have to also modify the current value of "New Uid" to "Enrolled".  That's what you see above.

 

The "trick" here is the "firstobs=2" option in the second (conditional) SET statement which assigns _NXT_DATE to the date of the next record (or a missing value when the end of the data set has been reached).

 

Edit:  changed

if eod2=0 then set (firstobs=2 ...

to

if eod2=0 then set have (firstobs=2 ...
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Sangita1
Obsidian | Level 7

Hi,

Thanks for your help.  I was trying something also, that if I have an ID associated with the time period series then may be there is a "Do Missing loop"  that I can use to fill in the missing time period. 

So, yes, here is the sample data which creates the two tables and then join the table :  Please let me know if it makes sense.  Thank you. 

 

data TestData;
infile datalines truncover;
input uID:$8. Semester:ddmmyy10. Code:$32. residence:32. OrigRow:32.;
format Semester date9.;
datalines;
u1 150111 G 12 1
u1 150112 G 12 1
u1 150913 G 12 1
u2 150111 B 23 1
u3 150912 B 31 1
u3 150514 G 31 1
u4 150511 G 14 1
u4 150112 G 14 1
u4 150113 G 14 1
u4 150913 G 14 1
;;;;;;;;;;


Data Time_Period_series;
%let start_date=15May2010;
%let end_date=15Sep2020;

Time_Period="&start_date"d;
Timeid = 1;
do while (Time_Period<="&end_date"d);
output;
Time_Period=intnx('month', Time_Period,4 , 's');
Timeid = Timeid +1;
end;
format Time_Period date9.;
run;

 

proc sql;
create table append1 as
select d.uID, t.Timeid, d.Semester, d.code, d.residence, d.OrigRow from TestData d, Time_Period_series t where d.Semester = t.Time_Period;

quit;

 

This gives me a table like

uIDTimeidSemesterCoderesidenceOrigRow

 

u1315-Jan-11G121
u1615-Jan-12G121
u11115-Sep-13G121
u2315-Jan-11B231
u3815-Sep-12B311
u31315-May-14G311
u4415-May-11G141
u4615-Jan-12G141
u4915-Jan-13G141
u41115-Sep-13G141

 

And I will like a table as:

uIDTimeidSemesterCoderesidenceOrigRow
u1315-Jan-11G121   
u1415-May-11G120 New Row created
u1415-Sep-11G120 New Row created
u1615-Jan-12G121   
u1715-May-12G120 New Row created
u1815-Sep-12G120 New Row created
u1915-Jan-13G120 New Row created
u11015-May-13G120 New Row created
u11115-Sep-13G121   
u2315-Jan-11B231   
u3815-Sep-12B311   
u3915-Jan-13B310 New Row created
u31015-May-13B310 New Row created
u31115-Sep-13B310 New Row created
u31215-Jan-14B310 New Row created
u31315-May-14B311   
u4415-May-11G141   
u4515-Sep-11G140 New Row created
u4615-Jan-12G141   
u4715-May-12G140 New Row created
u4815-Sep-12G140 New Row created
u4915-Jan-13G141   
u41015-May-13G140 New Row created
u41115-Sep-13G141   
Sangita1
Obsidian | Level 7

Sorry, for the third row for uid U1, timeid is 5 and not 4. and timeid 5 corresponds to 15-SEP-11.  

Sangita1
Obsidian | Level 7

Hi All,  I was able to create the table (not the most elegant way) but somehow, where now I have all the "wanted" rows for each uid.  I can also identify which rows are original and which ones are created by code (orig_row column value).

My dataset is large and have more than 60 columns.  When I insert a new row for the missing timeperiod, I am only populating 5 columns, so rest of the 55 columns are blank and I have to copy the value from the previous row.  I read about copying multiple values when the column names are written as arrays but my column names are all different (Race, Ethnicity, Gender etc..).  I have to copy , say 30 columns from the row above. I can arrange the columns so all the columns that need the value are at the end of the table.  Is there a way to do it or do I have to do Retain statement for each variable/column.  Thanks

 

 

mkeintz
PROC Star

Let's say you have these five variables properly constructed: uid semester origrow (1=yes, 0=no) name.  It's the other variables you need to carry forward:

 

data want;
   set have (keep=uid semester name origrow);
   if origrow=1 then set have (where=(origrow=1));
run;

All the variables in the conditional set statement (except uid/semester/name/origrow) will not be reset to missing until the next origrow=1.  I.e. those variables (actually all variables in a set/merge statement) are automatically retained until the next time they are read in.

 

This program assumes data are sorted by uid/semester, and that the first instance of ech uid has a complete set of valid variable values.

 

The second set only read origrow=1 observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Sangita1
Obsidian | Level 7

Hi, Thank you so much.  This is working great.  Appreciate you sharing knowledge and spending your precious time in helping me.  
I have a question - for my insert the missing terms in between the given terms for any student uid, I went around the dirty way - meaning I created a time series and a counter. I did a cross join and merge to insert all the rows for all the time periods and then I deleted the rows which were above and beyond the first and last row (with the data).  It took 4 to 5 data steps but it works fine.

 

I am sure there is a better way, but I don't have a clue as to what function/procedure call might be useful. 
Again, thanks for your help. 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2511 views
  • 1 like
  • 2 in conversation