Re: Finding discrepancies in multiple entries

byeh2017 · Posted 03-16-2017 04:59 AM

I'm doing some data cleaning on a dataset that includes dates, ID, and gender. For certain subsequent dates, the gender sometimes is miscoded. 1 is male and 2 is female. How do I find in the entire dataset all the IDs that are associated with this gender coding discrepancy? Thank you

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;

RW9 · Posted 03-16-2017 05:14 AM

What exactly do you want the output to look like? You can pull up discrepancies quite simply with a proc freq if you want to know how many of each type, or if you just want a list of subjects and the coding then proc sort nodupkey by id sex. Or is the first record the right one, and anything different to that should be flagged? Need to show what the output should look like.

byeh2017 · Posted 03-16-2017 05:29 AM

Something that looks like this, but also a code that will count how many times by unique ID this has happened in the dataset.

Kurt_Bremser · Posted 03-16-2017 05:38 AM

data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;

proc sort data=have;
by id date;
run;

data lookup (keep=id);;
set have;
by id;
retain checkgen flag;
if first.id
then do;
  checkgen = gender;
  flag = 0;
end;
if gender ne checkgen then flag = 1;
if last.id and flag then output;
run;

data want;
merge have lookup (in=check);
by id;
if check;
run;

proc print data=want noobs;
by id;
run;

proc sql;
select count(*) from lookup;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

byeh2017 · Posted 03-16-2017 06:15 AM

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?

proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

Kurt_Bremser · Posted 03-16-2017 10:14 AM

@byeh2017 wrote:

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?
proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

Use the var statement. var is used in many procedures to select which variables are used in the procedure.

proc print data=mydata.ODgenderlook noobs;
by nationalid;
var dcdeathdate gender;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

byeh2017 · Posted 04-03-2017 07:18 AM

Is there a way I can do this that looks like this output? It is simply just a listing out of the entries that have gender discrepancies.

Here's the sample dataset again. Thank you:

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2;;;;

Kurt_Bremser · Posted 04-06-2017 05:12 AM

There is no entry for ID=2 and date=12/1/2016 in your sample dataset.

This SQL finds multiple gender values per id:

proc sql;
create table result as
select date, id, gender
from samplegender
where id in (
  select id
  from samplegender
  group by id
  having count(distinct gender) > 1
)
order by id
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RW9 · Posted 03-16-2017 05:48 AM

Well for counts, proc freq is simple enough. For your output below, sorry, its not clear. That just looks like a proc print of the data you have by ID? What is the logic, do you take the first record as being correct, and then output any that don't match that? Do you just want an output of distinct id/sex, then proc sort would work.

Ksharp · Posted 03-16-2017 06:42 AM


data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;
proc sql;
create table want as
 select id,gender,count(*) as count
  from (
select * from have group by id having count(distinct gender) ne 1 
)
group by id,gender;
quit;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away