topic Re: Finding discrepancies in multiple entries in SAS Programming

Finding discrepancies in multiple entries

byeh2017 — Thu, 16 Mar 2017 08:59:23 GMT

I'm doing some data cleaning on a dataset that includes dates, ID, and gender. For certain subsequent dates, the gender sometimes is miscoded. 1 is male and 2 is female. How do I find in the entire dataset all the IDs that are associated with this gender coding discrepancy? Thank you

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;

Re: Finding discrepancies in multiple entries

RW9 — Thu, 16 Mar 2017 09:14:17 GMT

What exactly do you want the output to look like? You can pull up discrepancies quite simply with a proc freq if you want to know how many of each type, or if you just want a list of subjects and the coding then proc sort nodupkey by id sex. Or is the first record the right one, and anything different to that should be flagged? Need to show what the output should look like.

Re: Finding discrepancies in multiple entries

byeh2017 — Thu, 16 Mar 2017 09:29:02 GMT

Something that looks like this, but also a code that will count how many times by unique ID this has happened in the dataset.

Re: Finding discrepancies in multiple entries

Kurt_Bremser — Thu, 16 Mar 2017 09:38:35 GMT

data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;

proc sort data=have;
by id date;
run;

data lookup (keep=id);;
set have;
by id;
retain checkgen flag;
if first.id
then do;
  checkgen = gender;
  flag = 0;
end;
if gender ne checkgen then flag = 1;
if last.id and flag then output;
run;

data want;
merge have lookup (in=check);
by id;
if check;
run;

proc print data=want noobs;
by id;
run;

proc sql;
select count(*) from lookup;
quit;

Re: Finding discrepancies in multiple entries

RW9 — Thu, 16 Mar 2017 09:48:35 GMT

Well for counts, proc freq is simple enough. For your output below, sorry, its not clear. That just looks like a proc print of the data you have by ID? What is the logic, do you take the first record as being correct, and then output any that don't match that? Do you just want an output of distinct id/sex, then proc sort would work.

Re: Finding discrepancies in multiple entries

byeh2017 — Thu, 16 Mar 2017 10:15:59 GMT

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?

proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

Re: Finding discrepancies in multiple entries

Ksharp — Thu, 16 Mar 2017 10:42:52 GMT


data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;
proc sql;
create table want as
 select id,gender,count(*) as count
  from (
select * from have group by id having count(distinct gender) ne 1 
)
group by id,gender;
quit;

Re: Finding discrepancies in multiple entries

Kurt_Bremser — Thu, 16 Mar 2017 14:14:32 GMT

@byeh2017 wrote:

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?
proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

Use the var statement. var is used in many procedures to select which variables are used in the procedure.

proc print data=mydata.ODgenderlook noobs;
by nationalid;
var dcdeathdate gender;
run;

Re: Finding discrepancies in multiple entries

byeh2017 — Mon, 03 Apr 2017 11:18:58 GMT

Is there a way I can do this that looks like this output? It is simply just a listing out of the entries that have gender discrepancies.

Here's the sample dataset again. Thank you:

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2;;;;

Re: Finding discrepancies in multiple entries

Kurt_Bremser — Thu, 06 Apr 2017 09:12:48 GMT

There is no entry for ID=2 and date=12/1/2016 in your sample dataset.

This SQL finds multiple gender values per id:

proc sql;
create table result as
select date, id, gender
from samplegender
where id in (
  select id
  from samplegender
  group by id
  having count(distinct gender) > 1
)
order by id
;
quit;