I came across this question from someone which I'm not sure how to address it in data step. Question is I have a dataset which has variables like - Student_Name, Subject, Marks. Now I want to identify the
student name who scored high mark in each subject.
Excepted Result:
James Maths 95
David English 98
Kiran Social 97
data class;
set sashelp.class;
run;
proc sort data=class;
by sex descending age;
run;
*Does not deal with ties correctly;
data oldest_by_sex;
set class;
by sex descending age;
if first.sex ;
run;
*accounts for ties;
proc sql;
create table want as
select *
from class
group by sex
having age=max(age);
quit;
Two basic solutions above, if you rephrase question to oldest of each gender so we can use sashelp.class (too lazy to make fake data). Note the different results if you use data step because ties are not accounted for...
The word "Aggregate" is used in the title, but nothing in your question is finding an aggregate (at least the way I understand the word). So, I'm confused.
Also, why does this have to be done in a DATA step? There are plenty of other ways to do this that are not DATA step that might be more efficient and easier to program. It is certainly my philosophy that I choose to provide answers using the tool that does the job the best, and I do not feel bound by a user who says this has to be done in a DATA step.
But there is nothing in your question about the data that could be called "aggregate".
As I implied earlier, doing this in a DATA step is probably not the best way to achieve this, there are better/easier ways, so why test the user's knowledge of doing this in a DATA step?
Now you mention SQL. What have you tried so far?
@David_Billa wrote:
This is the interview question to my team member to test the knowledge in
data step.
I said aggregate as we want to use max and group by in SQL to achieve this.
Which is specifically stating use of something other than a data step, so still not getting what "aggregate in data step" is related.
Three example values, each with a separate value for Subject and saying the requirement is to identify the Student with the highest score for each subject is a pretty lame example data set.
And from the extremely limited information it looks like a SORT should be involved, which is not really a data step concept.
Given a realistic example of data with multiple students, multiple subjects and multiple scores here is a "data step" solution:
Proc sort data=have; by subject descending score; run; data want; set have; by subject; if first.subject; run;
data class;
set sashelp.class;
run;
proc sort data=class;
by sex descending age;
run;
*Does not deal with ties correctly;
data oldest_by_sex;
set class;
by sex descending age;
if first.sex ;
run;
*accounts for ties;
proc sql;
create table want as
select *
from class
group by sex
having age=max(age);
quit;
Two basic solutions above, if you rephrase question to oldest of each gender so we can use sashelp.class (too lazy to make fake data). Note the different results if you use data step because ties are not accounted for...
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.