Hello, in question 3 I am trying to left join the data on student ID. In the Data.Grades table students can have more then 1 grade listed separately in several rows. When I join it to the Data.master_After2001 data set I am looking for the sum of the grades for each student. How can I accomplish this? My code right now is erroring on "(SUM(D.Grade)) AS Sum_of_Grades". I have provided all of my code for reference on how I got to question 3.
/*Question 1:*/
PROC SQL;
CREATE TABLE Data.master_WithSchoolInfo AS
SELECT A.studentID, A.Birthdate, A.Gender, B.SchoolID, B.SchoolName
FROM Data.master A LEFT JOIN Data.school B
ON A.SchoolID = B.SchoolID;
RUN;
PROC CONTENTS DATA = Data.master_WithSchoolInfo;
RUN;
/*Question 2:*/
PROC SQL;
CREATE TABLE Data.master_After2001 AS
SELECT *
FROM Data.master_WithSchoolInfo
WHERE Birthdate > '01Jan2001'd;
RUN;
PROC CONTENTS DATA = Data.master_After2001;
RUN;
/*Question 3:*/
PROC SQL;
CREATE TABLE Data.master_After2001WithGrades AS
SELECT C.studentID, C.Birthdate, C.Gender, C.SchoolID, C.SchoolName, D.studentID, (SUM(D.Grade)) AS Sum_of_Grades
FROM Data.master_After2001 C Left Join Data.Grades D
ON C.studentID = D.studentID
GROUP BY D.Grade;
RUN;
PROC CONTENTS DATA = Data.master_After2001WithGrades;
RUN;
Still showing the error "the SUM summary function requires a numeric argument." I am trying to join the data.grades into the data.master_After2001 table, however, a student in the data file could have more then 1 grade. I am just looking for the amount of grades a student has to be entered into a new row in the Data.master_After2001 file.
Data.master_After2001:
Data.grades:
It would be helpful if you showed us the LOG that gives the error (the code as seen in the log AND the errors as seen in the log).
It would be helpful if you showed us a portion of your data.
As a wild guess, if you have SUM(D.GRADE) and GROUP BY (D.GRADE), I'm thinking that might be a problem.
Here Is what the error log is showing:
Also, here is a picture of the Data.grades table.
Should be
PROC SQL;
CREATE TABLE Data.master_After2001WithGrades AS
SELECT
C.studentID,
C.Birthdate,
C.Gender,
C.SchoolID,
C.SchoolName,
SUM(D.Grade) AS Sum_of_Grades
FROM
Data.master_After2001 as C Left Join
Data.Grades as D
ON C.studentID = D.studentID
GROUP BY C.studentID, C.Birthdate, C.Gender, C.SchoolID, C.SchoolName;
QUIT;
(unntested)
Still showing the error "the SUM summary function requires a numeric argument." I am trying to join the data.grades into the data.master_After2001 table, however, a student in the data file could have more then 1 grade. I am just looking for the amount of grades a student has to be entered into a new row in the Data.master_After2001 file.
Data.master_After2001:
Data.grades:
Aha!
You are trying to sum letters A B C D F. Do you see why it doesn't work?
As a simple remedy, you could just insert DISTINCT between SELECT and C.studentID, and you would not have the dupes. The query offered by @PGStats results in the same (in his query, though, SUM should be replaced by COUNT, or else it won't work). However, neither is efficient. It makes more sense to aggregate the Grades file first, which will collapse it to unique StudentID key-values, and then join the result with Master:
proc sql ;
create table data.master_after2001withgrades as
select m.studentid
, m.birthdate
, m.gender
, m.schoolid
, m.schoolname
, g.grade_count
from data.master_after2001 c
left join
(select studentid
, count (grade) as grade_count
from grades group studentid
) g
on m.studentid = g.studentid
;
quit ;
Alternatively, it can be achieved by applying the same concept by first aggregating Grades via a hash table:
data want (drop = grade) ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("studentid") ;
h.definedata ("grade_count") ;
h.definedone () ;
do until (z) ;
set grades end = z ;
if h.find() ne 0 then grade_count = 1 ;
else grade_count + 1 ;
h.replace() ;
end ;
end ;
set master (keep = studentid birthdate gender schoolid schoolname) ;
if h.find() ne 0 then call missing (grade_count) ;
run ;
Neither SQL nor hash require the input data sets to be sorted by StudentID. However, if they are already sorted, the easiest and fastest way of getting what you want is a simple merge:
data want (drop = grade) ;
do until (last.studentid) ;
merge master (keep = studentid birthdate gender schoolid schoolname) grades (in = g) ;
by studentid ;
if g then grade_count = sum (grade_count, 1) ;
end ;
run ;
Of course, there're other ways of achieving the same result using more than a single step.
Kind regards
Paul D.
The query offered by @PGStats results in the same (in his query, though, SUM should be replaced by COUNT, or else it won't work)
I don't see how COUNT works any better than SUM, as it does not compute frequency of A and frequency of B, etc. if that's what is being requested — clearly SUM makes no sense here. But @cneed really needs to state clearly the output he wants.
Grade is a character column here, hence SUM would generate:
ERROR: The SUM summary function requires a numeric argument.
SUM (grade ne "") could be used but why when COUNT will do. On the other note, it does not follow from the OP's description that the count distinct is requested rather than the straight count. If so, it's easy to change, whether it's SQL, hash, or merge.
Kind regards
Paul D.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.