BookmarkSubscribeRSS Feed
byeh2017
Quartz | Level 8

I'm doing some data cleaning on a dataset that includes dates, ID, and gender. For certain subsequent dates, the gender sometimes is miscoded. 1 is male and 2 is female. How do I find in the entire dataset all the IDs that are associated with this gender coding discrepancy? Thank you

 

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
9 REPLIES 9
RW9
Diamond | Level 26 RW9
Diamond | Level 26

What exactly do you want the output to look like?  You can pull up discrepancies quite simply with a proc freq if you want to know how many of each type, or if you just want a list of subjects and the coding then proc sort nodupkey by id sex.  Or is the first record the right one, and anything different to that should be flagged?  Need to show what the output should look like.

byeh2017
Quartz | Level 8

Something that looks like this, but also a code that will count how many times by unique ID this has happened in the dataset.

 

Gender.png

Kurt_Bremser
Super User
data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;

proc sort data=have;
by id date;
run;

data lookup (keep=id);;
set have;
by id;
retain checkgen flag;
if first.id
then do;
  checkgen = gender;
  flag = 0;
end;
if gender ne checkgen then flag = 1;
if last.id and flag then output;
run;

data want;
merge have lookup (in=check);
by id;
if check;
run;

proc print data=want noobs;
by id;
run;

proc sql;
select count(*) from lookup;
quit;
byeh2017
Quartz | Level 8

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?

 

proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

 

 

Kurt_Bremser
Super User

@byeh2017 wrote:

Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?

 

proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;

 

 


Use the var statement. var is used in many procedures to select which variables are used in the procedure.

proc print data=mydata.ODgenderlook noobs;
by nationalid;
var dcdeathdate gender;
run;
byeh2017
Quartz | Level 8

Is there a way I can do this that looks like this output? It is simply just a listing out of the entries that have gender discrepancies.

 

gendercheck.png

 

Here's the sample dataset again. Thank you:

 

data MYDATA.SAMPLEGENDER;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2;;;;

 

Kurt_Bremser
Super User

There is no entry for ID=2 and date=12/1/2016 in your sample dataset.

This SQL finds multiple gender values per id:

proc sql;
create table result as
select date, id, gender
from samplegender
where id in (
  select id
  from samplegender
  group by id
  having count(distinct gender) > 1
)
order by id
;
quit;
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well for counts, proc freq is simple enough.  For your output below, sorry, its not clear.  That just looks like a proc print of the data you have by ID?  What is the logic, do you take the first record as being correct, and then output any that don't match that?  Do you just want an output of distinct id/sex, then proc sort would work.

Ksharp
Super User

data have;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:BEST. Gender:BEST.;
  format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;
proc sql;
create table want as
 select id,gender,count(*) as count
  from (
select * from have group by id having count(distinct gender) ne 1 
)
group by id,gender;
quit;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2190 views
  • 0 likes
  • 4 in conversation