Hi,
I have a dataset where I am trying to sort out specific duplicates. I am creating an email list from an appended dataset and because certain people are labeled as both staff and students, we have duplicate records. Here is one such case:
This is not the case for everyone but I am trying to sort out the duplicate records that are labeled as student since their staff label takes precedence.
Any help? Thanks!
Another option would be to transpose the data.
DATA part1;
SET have;
x=1;
RUN;
PROC TRANSPOSE data=part1 out=want;
BY email (add all other variables except group and x);
ID group;
VAR x;
RUN;
This would then give a dataset with flags for each department and actually is prob a better way.
Assuming that email is held in variable email, and the staff student is held in a column called staff_student.
PROC SQL;
CREATE TABLE part1 AS SELECT
*, count(distinct(staff_student)) as nrc
FROM have
GROUP BY email;
QUIT;
DATA part2;
SET part1;
IF nrc=2 THEN staff_studuent="Staff/Student";
DROP nrc;
RUN;
PROC SQL;
CREATE TABLE want AS SELECT
distinct *
FROM part2;
QUIT;
You would then end with a record labeled Staff/Student for these, assuming these are the only things that are different in the record.
You're right where the emails are under a variable called email. However, staff and student are apart of a variable called group that has 4 options (Student, Staff, Faculty, and Lib_Faculty).
DATA part1;
SET have;
IF group in("Staff" "Student") THEN cnt=1;
ELSE cnt=0;
RUN;
PROC SQL;
CREATE TABLE part2 AS SELECT
*, sum(staff_student) as nrc
FROM part1
GROUP BY email;
QUIT;
DATA part3;
SET part2;
IF nrc=2 THEN staff_studuent="Staff/Student";
DROP nrc cnt;
RUN;
PROC SQL;
CREATE TABLE want AS SELECT
distinct *
FROM part3;
QUIT;
Does this fix it?
Another option would be to transpose the data.
DATA part1;
SET have;
x=1;
RUN;
PROC TRANSPOSE data=part1 out=want;
BY email (add all other variables except group and x);
ID group;
VAR x;
RUN;
This would then give a dataset with flags for each department and actually is prob a better way.
Yes! Thank you. Now, I can easily sort out based off these flags.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.