BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mvhoya
Obsidian | Level 7

Hi SAS Communities,

 

Prior to running any analyses on this project, I am working on cleaning up a very large dataset and would appreciate your guidance in regard to programming. Background information on my dataset: variables include an identifier (ID) for students enrolled in the university, demographic information, semester and year of student enrollment, and 1,300 variables of course names. The courses are listed in alphabetical order beginning with AFS100 and the last course in the dataset is UNIV 101. If a student is registered in a course, that is indicated with the letter "R". (i.e.  AFS100 would have an "R" for that student ID if a student took that course).

 

I need to create twelve new variables. The first variable named "Registered" will provide a count of how many courses the student was enrolled in that semester, in other words count the number of Rs for each student.  The remaining eleven variables called "Course1" - "Course11" will pull in all of the courses in which there was an "R" listed for the student. If a student was registered in AFS100 then it would go into Course1 and if the student took four more courses, then each course would need to go into columns Course2 - Course5. I am requesting suggestions/guidance in creating these twelve new variables.

 

I consider myself a novice in SAS programming as I am currently taking a SAS course and still learning. All suggestions will be greatly appreciated. Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

See Maxim 19 and Maxim 33 (first link of my footnotes).

The first thing you should do is this:

proc transpose
  data=have
  out=courses (
    rename=(_name_=course col1=registered)
    where=(registered = "R")
  )
;
by student_id;
var AFS1000--UNIV101;
run;

You will see that the new dataset is considerably smaller; in a subsequent step you can even remove the variable registered, as it is redundant.

The dataset after transpose would look like this:

data courses;
input student_id $ course $ registered :$1.;
datalines;
A AFS1000 R
A AFS2000 R
A UNIV100 R
;

A count of courses per student comes down to

proc sql;
create table want as
  select student_id, count(*) as registered
  from courses
  group by student_id
;
quit;

Now, if I add a nice "Course counter":

data report;
set courses;
by student_id;
if first.student_id
then count = 1;
else count + 1;
coursenum = catx(" ","Course",count);
drop count;
run;

PROC REPORT lets me create your intended result from the long dataset:

proc report data=report;
column student_id n coursenum,course dummy;
define student_id / "Student" group;
define n / "Registered";
define coursenum / across;
define course / "" display;
define dummy / computed noprint;
compute dummy;
  dummy = 1;
endcomp;
run;

The NOPRINT DUMMY is necessary to get around a peculiarity in REPORT which wants something to calculate when ACROSS is used.

You can also always use a transpose to get a wide dataset containing only the registered courses:

proc transpose
  data=courses
  out=wide (drop=_name_)
;
by student_id;
var course;
run;

and merge that with the "want" dataset to incorporate the counts.

 

Bottom line: always use a long layout for storage and work; transpose to a wide layout only for reporting, or where such is explicitly needed (e.g. regression analysis).

 

View solution in original post

4 REPLIES 4
RichardDeVen
Barite | Level 11

Such a wide data set structure is not a great situation to be in.  There will be mostly missing values because, as you say, a student can only take 11 of the possible 1,300 courses.

 

Given the structure of variable for first course is AFS100 and the last course is UNIV101.  You can use the double dash (--) naming list construct to define a variable based array that you can iterate, and populate the course1-course11 variables (also arrayed).  The VNAME function is used to retrieve the variable name of an indexed array reference.  Very useful because the course name is the same as the variable name containing those 'R' flags.

 

Example:

data want;
  set have;
  array offered_courses AFS1000--UNIV101;
  array course(11) $32;  * variables course1-course11 for up to 11 registered courses;

  registered_index = 0;
  do index = 1 to dim(offered_courses);
    if offered_courses(index) ne 'R' then continue; * check next index;
    registered_index+1;
    if registered_index > dim(course) then leave; * prevent errors if student registered for more than 11 courses;
    * store variable name (which is the course name) of registered course;
    course(registered_index) = vname(offered_courses(index));
  end;
run;
mvhoya
Obsidian | Level 7
Thank you for your help!
Kurt_Bremser
Super User

See Maxim 19 and Maxim 33 (first link of my footnotes).

The first thing you should do is this:

proc transpose
  data=have
  out=courses (
    rename=(_name_=course col1=registered)
    where=(registered = "R")
  )
;
by student_id;
var AFS1000--UNIV101;
run;

You will see that the new dataset is considerably smaller; in a subsequent step you can even remove the variable registered, as it is redundant.

The dataset after transpose would look like this:

data courses;
input student_id $ course $ registered :$1.;
datalines;
A AFS1000 R
A AFS2000 R
A UNIV100 R
;

A count of courses per student comes down to

proc sql;
create table want as
  select student_id, count(*) as registered
  from courses
  group by student_id
;
quit;

Now, if I add a nice "Course counter":

data report;
set courses;
by student_id;
if first.student_id
then count = 1;
else count + 1;
coursenum = catx(" ","Course",count);
drop count;
run;

PROC REPORT lets me create your intended result from the long dataset:

proc report data=report;
column student_id n coursenum,course dummy;
define student_id / "Student" group;
define n / "Registered";
define coursenum / across;
define course / "" display;
define dummy / computed noprint;
compute dummy;
  dummy = 1;
endcomp;
run;

The NOPRINT DUMMY is necessary to get around a peculiarity in REPORT which wants something to calculate when ACROSS is used.

You can also always use a transpose to get a wide dataset containing only the registered courses:

proc transpose
  data=courses
  out=wide (drop=_name_)
;
by student_id;
var course;
run;

and merge that with the "want" dataset to incorporate the counts.

 

Bottom line: always use a long layout for storage and work; transpose to a wide layout only for reporting, or where such is explicitly needed (e.g. regression analysis).

 

mvhoya
Obsidian | Level 7
Thank you very much for your detailed and helpful response!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 947 views
  • 2 likes
  • 3 in conversation