Solved: Re: Inserting rows in a table based on missing column values.

Sangita1 · Posted 12-03-2020 09:15 PM

Hi All,

I have a table with student enrollment data. So in my table, each student id can have multiple rows, one row for each semester (we have three semesters per academic year). So a student can enroll in Fall 2015, and his enrollment ends in Summer 2019, so there will be 12 terms and potentially 12 rows in the table. However, if the student does not enroll in 5 terms in between, the table will have only 7 rows.

I need to create the 5 missing rows based on the data from the row above for that student id (UID), so I have all 12 rows for that student. I have a column orig_row in the table which has all values as 1. When I copy the row, I need to mark this particular column value as 0 so I know that this row was copied from the previous row..

I have a second table with Time_periods that I can use as a lookup for missing semesters.

So my data table looks like following:

If a student enrolled only for one semester as uid2, then it will have only one row.

Dept Residency_Type UID NewUId EndUId Uidchange Time_Period OrigRow

Health	2-Out-of-State	U1	1	0	New Uid	15-Sep-15	1
Health	2-Out-of-State	U1	0	0	Enrolled	15-Jan-16	1
Health	2-Out-of-State	U1	0	0	Enrolled	15-Jan-17	1
Health	2-Out-of-State	U1	0	1	End Uid	15-Sep-17	1
Science	1-In-State	U2	1	1	End Uid	15-Jan-13	1
A&L	1-In-State	U3	1	0	New Uid	15-May-10	1
A&L	1-In-State	U3	0	0	Enrolled	15-Sep-10	1
A&L	1-In-State	U3	0	0	Enrolled	15-May-11	1
A&L	1-In-State	U3	0	0	Enrolled	15-Sep-12	1
A&L	1-In-State	U3	0	0	Enrolled	15-Jan-13	1
A&L	1-In-State	U3	0	1	End Uid	15-May-14	1

And I want it to look like below

Dept Residency_Type UID NewUId EndUId Uidchange Time_Period OrigRow

Health	2-Out-of-State	U1	1	0	New Uid	15-Sep-15	1
Health	2-Out-of-State	U1	0	0	Enrolled	15-Jan-16	1
Health	2-Out-of-State	U1	0	0	Enrolled	15-May-16	0
Health	2-Out-of-State	U1	0	0	Enrolled	15-Sep-16	0
Health	2-Out-of-State	U1	0	0	Enrolled	15-Jan-17	1
Health	2-Out-of-State	U1	0	0	Enrolled	15-May-17	0
Health	2-Out-of-State	U1	0	1	End Uid	15-Sep-17	1
Science	1-In-State	U2	1	1	End Uid	15-Jan-13	1
A&L	1-In-State	U3	1	0	New Uid	15-May-10	1
A&L	1-In-State	U3	0	0	Enrolled	15-Sep-10	1
A&L	1-In-State	U3	0	0	Enrolled	15-Jan-11	0
A&L	1-In-State	U3	0	0	Enrolled	15-May-11	1
A&L	1-In-State	U3	0	0	Enrolled	15-Sep-11	0
A&L	1-In-State	U3	0	0	Enrolled	15-Jan-12	0
A&L	1-In-State	U3	0	0	Enrolled	15-May-12	0
A&L	1-In-State	U3	0	0	Enrolled	15-Sep-12	1
A&L	1-In-State	U3	0	0	Enrolled	15-Jan-13	1
A&L	1-In-State	U3	0	0	Enrolled	15-May-13	0
A&L	1-In-State	U3	0	0	Enrolled	15-Sep-13	0
A&L	1-In-State	U3	0	0	Enrolled	15-Jan-14	0
A&L	1-In-State	U3	0	1	End Uid	15-May-14	1

I am new to programming and I tried to use set, merge, crossjoin, outer join etc but not able to get this result.

Thank you.

mkeintz · Posted 12-04-2020 07:45 AM

Let's say you have these five variables properly constructed: uid semester origrow (1=yes, 0=no) name. It's the other variables you need to carry forward:

data want;
   set have (keep=uid semester name origrow);
   if origrow=1 then set have (where=(origrow=1));
run;

All the variables in the conditional set statement (except uid/semester/name/origrow) will not be reset to missing until the next origrow=1. I.e. those variables (actually all variables in a set/merge statement) are automatically retained until the next time they are read in.

This program assumes data are sorted by uid/semester, and that the first instance of ech uid has a complete set of valid variable values.

The second set only read origrow=1 observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 12-03-2020 11:28 PM

I have some code below, but I will not be testing it on your data, which takes more work than I am willing to do, to get into a sas data step. If you can provide your data in the form of a working SAS data step, I will test.

So this code is untested (it assumes your data are sorted by uid/date):

data want (drop=_:);
  set have;
  by uid;
  origrow=1; /* Mark as original record */
  output;    /* Output original record */

  /* Peek ahead to get date of next record */
  if eod2=0 then set have (firstobs=2 keep=date rename=(date=_nxt_date)) end=eod2;
  else _nxt_date=.;

  /* If not at end of this UID (last.uid=0), then */
  /* fill any upcoming holes greater than 4 months*/
  if last.uid=0 then do while (intck('month',date,_nxt_date)>4);
    origrow=0;
    uidchange='Enrolled';
	date=intnx('month',date,4,'same');
	output;
  end;
run;

You've mentioned that you want a dummy variable, origrow=1 when the output record comes from on input record, and origrow=0 when the output record is filling a hole.

But you also need to pay attention to the UIDCHANGE variable which is "New Uid" for the first record and "Enrolled" for all the others. But if the hole in your sequence is for the second school term for the UID, you have to also modify the current value of "New Uid" to "Enrolled". That's what you see above.

The "trick" here is the "firstobs=2" option in the second (conditional) SET statement which assigns _NXT_DATE to the date of the next record (or a missing value when the end of the data set has been reached).

Edit: changed

if eod2=0 then set (firstobs=2 ...

to

if eod2=0 then set have (firstobs=2 ...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sangita1 · Posted 12-04-2020 12:31 AM

Hi,

Thanks for your help. I was trying something also, that if I have an ID associated with the time period series then may be there is a "Do Missing loop" that I can use to fill in the missing time period.

So, yes, here is the sample data which creates the two tables and then join the table : Please let me know if it makes sense. Thank you.

data TestData;
infile datalines truncover;
input uID:$8. Semester:ddmmyy10. Code:$32. residence:32. OrigRow:32.;
format Semester date9.;
datalines;
u1 150111 G 12 1
u1 150112 G 12 1
u1 150913 G 12 1
u2 150111 B 23 1
u3 150912 B 31 1
u3 150514 G 31 1
u4 150511 G 14 1
u4 150112 G 14 1
u4 150113 G 14 1
u4 150913 G 14 1
;;;;;;;;;;

Data Time_Period_series;
%let start_date=15May2010;
%let end_date=15Sep2020;

Time_Period="&start_date"d;
Timeid = 1;
do while (Time_Period<="&end_date"d);
output;
Time_Period=intnx('month', Time_Period,4 , 's');
Timeid = Timeid +1;
end;
format Time_Period date9.;
run;

proc sql;
create table append1 as
select d.uID, t.Timeid, d.Semester, d.code, d.residence, d.OrigRow from TestData d, Time_Period_series t where d.Semester = t.Time_Period;

quit;

This gives me a table like

uID

Timeid

Semester

Code

residence

OrigRow

u1	3	15-Jan-11	G	12	1
u1	6	15-Jan-12	G	12	1
u1	11	15-Sep-13	G	12	1
u2	3	15-Jan-11	B	23	1
u3	8	15-Sep-12	B	31	1
u3	13	15-May-14	G	31	1
u4	4	15-May-11	G	14	1
u4	6	15-Jan-12	G	14	1
u4	9	15-Jan-13	G	14	1
u4	11	15-Sep-13	G	14	1

And I will like a table as:

uID

Timeid

Semester

Code

residence

OrigRow

u1	3	15-Jan-11	G	12	1
u1	4	15-May-11	G	12	0	New Row created
u1	4	15-Sep-11	G	12	0	New Row created
u1	6	15-Jan-12	G	12	1
u1	7	15-May-12	G	12	0	New Row created
u1	8	15-Sep-12	G	12	0	New Row created
u1	9	15-Jan-13	G	12	0	New Row created
u1	10	15-May-13	G	12	0	New Row created
u1	11	15-Sep-13	G	12	1
u2	3	15-Jan-11	B	23	1
u3	8	15-Sep-12	B	31	1
u3	9	15-Jan-13	B	31	0	New Row created
u3	10	15-May-13	B	31	0	New Row created
u3	11	15-Sep-13	B	31	0	New Row created
u3	12	15-Jan-14	B	31	0	New Row created
u3	13	15-May-14	B	31	1
u4	4	15-May-11	G	14	1
u4	5	15-Sep-11	G	14	0	New Row created
u4	6	15-Jan-12	G	14	1
u4	7	15-May-12	G	14	0	New Row created
u4	8	15-Sep-12	G	14	0	New Row created
u4	9	15-Jan-13	G	14	1
u4	10	15-May-13	G	14	0	New Row created
u4	11	15-Sep-13	G	14	1

Sangita1 · Posted 12-04-2020 12:36 AM

Sorry, for the third row for uid U1, timeid is 5 and not 4. and timeid 5 corresponds to 15-SEP-11.

Sangita1 · Posted 12-04-2020 07:21 AM

Hi All, I was able to create the table (not the most elegant way) but somehow, where now I have all the "wanted" rows for each uid. I can also identify which rows are original and which ones are created by code (orig_row column value).

My dataset is large and have more than 60 columns. When I insert a new row for the missing timeperiod, I am only populating 5 columns, so rest of the 55 columns are blank and I have to copy the value from the previous row. I read about copying multiple values when the column names are written as arrays but my column names are all different (Race, Ethnicity, Gender etc..). I have to copy , say 30 columns from the row above. I can arrange the columns so all the columns that need the value are at the end of the table. Is there a way to do it or do I have to do Retain statement for each variable/column. Thanks

mkeintz · Posted 12-04-2020 07:45 AM

Let's say you have these five variables properly constructed: uid semester origrow (1=yes, 0=no) name. It's the other variables you need to carry forward:

data want;
   set have (keep=uid semester name origrow);
   if origrow=1 then set have (where=(origrow=1));
run;

All the variables in the conditional set statement (except uid/semester/name/origrow) will not be reset to missing until the next origrow=1. I.e. those variables (actually all variables in a set/merge statement) are automatically retained until the next time they are read in.

This program assumes data are sorted by uid/semester, and that the first instance of ech uid has a complete set of valid variable values.

The second set only read origrow=1 observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sangita1 · Posted 12-05-2020 02:42 AM

Hi, Thank you so much. This is working great. Appreciate you sharing knowledge and spending your precious time in helping me.
I have a question - for my insert the missing terms in between the given terms for any student uid, I went around the dirty way - meaning I created a time series and a counter. I did a cross join and merge to insert all the rows for all the time periods and then I deleted the rows which were above and beyond the first and last row (with the data). It took 4 to 5 data steps but it works fine.

I am sure there is a better way, but I don't have a clue as to what function/procedure call might be useful.
Again, thanks for your help.

Registration is open

SAS Training: Just a Click Away