Solved: How to Sort Out Certain Duplicates

dpachorek · Posted 09-29-2020 10:27 AM

Hi,

I have a dataset where I am trying to sort out specific duplicates. I am creating an email list from an appended dataset and because certain people are labeled as both staff and students, we have duplicate records. Here is one such case:

This is not the case for everyone but I am trying to sort out the duplicate records that are labeled as student since their staff label takes precedence.

Any help? Thanks!

SwissC · Posted 09-29-2020 11:17 AM

Another option would be to transpose the data.

DATA part1;
  SET have;
  x=1;
RUN;

PROC TRANSPOSE data=part1 out=want;
  BY email (add all other variables except group and x);
  ID group;
  VAR x;
RUN;

This would then give a dataset with flags for each department and actually is prob a better way.

View solution in original post

SwissC · Posted 09-29-2020 10:38 AM

Assuming that email is held in variable email, and the staff student is held in a column called staff_student.

PROC SQL;
  CREATE TABLE part1 AS SELECT
  *, count(distinct(staff_student)) as nrc
  FROM have
  GROUP BY email;
QUIT;
DATA part2;
  SET part1;
  IF nrc=2 THEN staff_studuent="Staff/Student";
  DROP nrc;
RUN;
PROC SQL;
  CREATE TABLE want AS SELECT
  distinct *
  FROM part2;
QUIT;

You would then end with a record labeled Staff/Student for these, assuming these are the only things that are different in the record.

dpachorek · Posted 09-29-2020 11:01 AM

You're right where the emails are under a variable called email. However, staff and student are apart of a variable called group that has 4 options (Student, Staff, Faculty, and Lib_Faculty).

SwissC · Posted 09-29-2020 11:11 AM

DATA part1;
  SET have;
  IF group in("Staff" "Student") THEN cnt=1;
    ELSE cnt=0;
RUN;

PROC SQL;
  CREATE TABLE part2 AS SELECT
  *, sum(staff_student) as nrc
  FROM part1
  GROUP BY email;
QUIT;
DATA part3;
  SET part2;
  IF nrc=2 THEN staff_studuent="Staff/Student";
  DROP nrc cnt;
RUN;
PROC SQL;
  CREATE TABLE want AS SELECT
  distinct *
  FROM part3;
QUIT;

Does this fix it?

SwissC · Posted 09-29-2020 11:17 AM

Another option would be to transpose the data.

DATA part1;
  SET have;
  x=1;
RUN;

PROC TRANSPOSE data=part1 out=want;
  BY email (add all other variables except group and x);
  ID group;
  VAR x;
RUN;

This would then give a dataset with flags for each department and actually is prob a better way.

dpachorek · Posted 09-29-2020 11:40 AM

Yes! Thank you. Now, I can easily sort out based off these flags.

How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

Re: How to Sort Out Certain Duplicates

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away