Solved: Creating New Variables in a Large Dataset

mvhoya · Posted 11-20-2020 11:32 PM

Hi SAS Communities,

Prior to running any analyses on this project, I am working on cleaning up a very large dataset and would appreciate your guidance in regard to programming. Background information on my dataset: variables include an identifier (ID) for students enrolled in the university, demographic information, semester and year of student enrollment, and 1,300 variables of course names. The courses are listed in alphabetical order beginning with AFS100 and the last course in the dataset is UNIV 101. If a student is registered in a course, that is indicated with the letter "R". (i.e. AFS100 would have an "R" for that student ID if a student took that course).

I need to create twelve new variables. The first variable named "Registered" will provide a count of how many courses the student was enrolled in that semester, in other words count the number of Rs for each student. The remaining eleven variables called "Course1" - "Course11" will pull in all of the courses in which there was an "R" listed for the student. If a student was registered in AFS100 then it would go into Course1 and if the student took four more courses, then each course would need to go into columns Course2 - Course5. I am requesting suggestions/guidance in creating these twelve new variables.

I consider myself a novice in SAS programming as I am currently taking a SAS course and still learning. All suggestions will be greatly appreciated. Thank you in advance!

Kurt_Bremser · Posted 11-21-2020 01:04 AM

See Maxim 19 and Maxim 33 (first link of my footnotes).

The first thing you should do is this:

proc transpose
  data=have
  out=courses (
    rename=(_name_=course col1=registered)
    where=(registered = "R")
  )
;
by student_id;
var AFS1000--UNIV101;
run;

You will see that the new dataset is considerably smaller; in a subsequent step you can even remove the variable registered, as it is redundant.

The dataset after transpose would look like this:

data courses;
input student_id $ course $ registered :$1.;
datalines;
A AFS1000 R
A AFS2000 R
A UNIV100 R
;

A count of courses per student comes down to

proc sql;
create table want as
  select student_id, count(*) as registered
  from courses
  group by student_id
;
quit;

Now, if I add a nice "Course counter":

data report;
set courses;
by student_id;
if first.student_id
then count = 1;
else count + 1;
coursenum = catx(" ","Course",count);
drop count;
run;

PROC REPORT lets me create your intended result from the long dataset:

proc report data=report;
column student_id n coursenum,course dummy;
define student_id / "Student" group;
define n / "Registered";
define coursenum / across;
define course / "" display;
define dummy / computed noprint;
compute dummy;
  dummy = 1;
endcomp;
run;

The NOPRINT DUMMY is necessary to get around a peculiarity in REPORT which wants something to calculate when ACROSS is used.

You can also always use a transpose to get a wide dataset containing only the registered courses:

proc transpose
  data=courses
  out=wide (drop=_name_)
;
by student_id;
var course;
run;

and merge that with the "want" dataset to incorporate the counts.

Bottom line: always use a long layout for storage and work; transpose to a wide layout only for reporting, or where such is explicitly needed (e.g. regression analysis).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

RichardDeVen · Posted 11-20-2020 11:55 PM

Such a wide data set structure is not a great situation to be in. There will be mostly missing values because, as you say, a student can only take 11 of the possible 1,300 courses.

Given the structure of variable for first course is AFS100 and the last course is UNIV101. You can use the double dash (--) naming list construct to define a variable based array that you can iterate, and populate the course1-course11 variables (also arrayed). The VNAME function is used to retrieve the variable name of an indexed array reference. Very useful because the course name is the same as the variable name containing those 'R' flags.

Example:

data want;
  set have;
  array offered_courses AFS1000--UNIV101;
  array course(11) $32;  * variables course1-course11 for up to 11 registered courses;

  registered_index = 0;
  do index = 1 to dim(offered_courses);
    if offered_courses(index) ne 'R' then continue; * check next index;
    registered_index+1;
    if registered_index > dim(course) then leave; * prevent errors if student registered for more than 11 courses;
    * store variable name (which is the course name) of registered course;
    course(registered_index) = vname(offered_courses(index));
  end;
run;

mvhoya · Posted 11-25-2020 12:41 PM

Thank you for your help!

Kurt_Bremser · Posted 11-21-2020 01:04 AM

See Maxim 19 and Maxim 33 (first link of my footnotes).

The first thing you should do is this:

proc transpose
  data=have
  out=courses (
    rename=(_name_=course col1=registered)
    where=(registered = "R")
  )
;
by student_id;
var AFS1000--UNIV101;
run;

You will see that the new dataset is considerably smaller; in a subsequent step you can even remove the variable registered, as it is redundant.

The dataset after transpose would look like this:

data courses;
input student_id $ course $ registered :$1.;
datalines;
A AFS1000 R
A AFS2000 R
A UNIV100 R
;

A count of courses per student comes down to

proc sql;
create table want as
  select student_id, count(*) as registered
  from courses
  group by student_id
;
quit;

Now, if I add a nice "Course counter":

data report;
set courses;
by student_id;
if first.student_id
then count = 1;
else count + 1;
coursenum = catx(" ","Course",count);
drop count;
run;

PROC REPORT lets me create your intended result from the long dataset:

proc report data=report;
column student_id n coursenum,course dummy;
define student_id / "Student" group;
define n / "Registered";
define coursenum / across;
define course / "" display;
define dummy / computed noprint;
compute dummy;
  dummy = 1;
endcomp;
run;

The NOPRINT DUMMY is necessary to get around a peculiarity in REPORT which wants something to calculate when ACROSS is used.

You can also always use a transpose to get a wide dataset containing only the registered courses:

proc transpose
  data=courses
  out=wide (drop=_name_)
;
by student_id;
var course;
run;

and merge that with the "want" dataset to incorporate the counts.

Bottom line: always use a long layout for storage and work; transpose to a wide layout only for reporting, or where such is explicitly needed (e.g. regression analysis).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mvhoya · Posted 11-25-2020 12:41 PM

Thank you very much for your detailed and helpful response!

Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

Re: Creating New Variables in a Large Dataset

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!