DATA Step, Macro, Functions and more

Finding discrepancies in multiple entries

Reply
Frequent Contributor
Posts: 77

Finding discrepancies in multiple entries

I'm doing some data cleaning on a dataset that includes dates, ID, and gender. For certain subsequent dates, the gender sometimes is miscoded. 1 is male and 2 is female. How do I find in the entire dataset all the IDs that are associated with this gender coding discrepancy? Thank you

 

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
Super User
Super User
Posts: 7,977

Re: Finding discrepancies in multiple entries

What exactly do you want the output to look like?  You can pull up discrepancies quite simply with a proc freq if you want to know how many of each type, or if you just want a list of subjects and the coding then proc sort nodupkey by id sex.  Or is the first record the right one, and anything different to that should be flagged?  Need to show what the output should look like.

Frequent Contributor
Posts: 77

Re: Finding discrepancies in multiple entries

Something that looks like this, but also a code that will count how many times by unique ID this has happened in the dataset.

 

Gender.png

Super User
Posts: 7,833

Re: Finding discrepancies in multiple entries

data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;

proc sort data=have;
by id date;
run;

data lookup (keep=id);;
set have;
by id;
retain checkgen flag;
if first.id
then do;
  checkgen = gender;
  flag = 0;
end;
if gender ne checkgen then flag = 1;
if last.id and flag then output;
run;

data want;
merge have lookup (in=check);
by id;
if check;
run;

proc print data=want noobs;
by id;
run;

proc sql;
select count(*) from lookup;
quit;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 77

Re: Finding discrepancies in multiple entries

Posted in reply to KurtBremser

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?

 

proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

 

 

Super User
Posts: 7,833

Re: Finding discrepancies in multiple entries


byeh2017 wrote:

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?

 

proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

 

 


Use the var statement. var is used in many procedures to select which variables are used in the procedure.

proc print data=mydata.ODgenderlook noobs;
by nationalid;
var dcdeathdate gender;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 77

Re: Finding discrepancies in multiple entries

Posted in reply to KurtBremser

Is there a way I can do this that looks like this output? It is simply just a listing out of the entries that have gender discrepancies.

 

gendercheck.png

 

Here's the sample dataset again. Thank you:

 

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2;;;;

 

Super User
Posts: 7,833

Re: Finding discrepancies in multiple entries

There is no entry for ID=2 and date=12/1/2016 in your sample dataset.

This SQL finds multiple gender values per id:

proc sql;
create table result as
select date, id, gender
from samplegender
where id in (
  select id
  from samplegender
  group by id
  having count(distinct gender) > 1
)
order by id
;
quit;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,977

Re: Finding discrepancies in multiple entries

Well for counts, proc freq is simple enough.  For your output below, sorry, its not clear.  That just looks like a proc print of the data you have by ID?  What is the logic, do you take the first record as being correct, and then output any that don't match that?  Do you just want an output of distinct id/sex, then proc sort would work.

Super User
Posts: 10,041

Re: Finding discrepancies in multiple entries


data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;
proc sql;
create table want as
 select id,gender,count(*) as count
  from (
select * from have group by id having count(distinct gender) ne 1 
)
group by id,gender;
quit;

Ask a Question
Discussion stats
  • 9 replies
  • 201 views
  • 0 likes
  • 4 in conversation