I'm doing some data cleaning on a dataset that includes dates, ID, and gender. For certain subsequent dates, the gender sometimes is miscoded. 1 is male and 2 is female. How do I find in the entire dataset all the IDs that are associated with this gender coding discrepancy? Thank you
data MYDATA.SAMPLEGENDER;
infile datalines dsd truncover;
input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
What exactly do you want the output to look like? You can pull up discrepancies quite simply with a proc freq if you want to know how many of each type, or if you just want a list of subjects and the coding then proc sort nodupkey by id sex. Or is the first record the right one, and anything different to that should be flagged? Need to show what the output should look like.
Something that looks like this, but also a code that will count how many times by unique ID this has happened in the dataset.
data have;
infile datalines dsd truncover;
input Date:MMDDYY10. ID:BEST. Gender:BEST.;
format date mmddyy10.;
datalines4;
11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2
;;;;
run;
proc sort data=have;
by id date;
run;
data lookup (keep=id);;
set have;
by id;
retain checkgen flag;
if first.id
then do;
checkgen = gender;
flag = 0;
end;
if gender ne checkgen then flag = 1;
if last.id and flag then output;
run;
data want;
merge have lookup (in=check);
by id;
if check;
run;
proc print data=want noobs;
by id;
run;
proc sql;
select count(*) from lookup;
quit;
Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?
proc print data=mydata.ODgenderlook noobs;
by nationalid;
run;
@byeh2017 wrote:
Thank you. I'm trying to apply it to my main dataset. I noticed that the proc print is printing everything associated with the dataset. What is the line to limit it only to dcdeathdate and the gender?
proc print data=mydata.ODgenderlook noobs; by nationalid; run;
Use the var statement. var is used in many procedures to select which variables are used in the procedure.
proc print data=mydata.ODgenderlook noobs;
by nationalid;
var dcdeathdate gender;
run;
Is there a way I can do this that looks like this output? It is simply just a listing out of the entries that have gender discrepancies.
Here's the sample dataset again. Thank you:
data MYDATA.SAMPLEGENDER;
infile datalines dsd truncover;
input Date:MMDDYY10. ID:BEST. Gender:BEST.;
datalines4;11/01/2016,1,1
11/01/2016,2,2
11/01/2016,3,1
11/02/2016,1,2
11/04/2016,5,2
11/03/2016,6,2
11/04/2016,3,2
11/04/2016,8,1
11/01/2016,9,2
11/01/2016,10,2
11/01/2016,11,1
11/01/2016,12,2
11/01/2016,13,1
11/01/2016,14,2
11/10/2016,14,1
11/11/2016,14,2;;;;
There is no entry for ID=2 and date=12/1/2016 in your sample dataset.
This SQL finds multiple gender values per id:
proc sql;
create table result as
select date, id, gender
from samplegender
where id in (
select id
from samplegender
group by id
having count(distinct gender) > 1
)
order by id
;
quit;
Well for counts, proc freq is simple enough. For your output below, sorry, its not clear. That just looks like a proc print of the data you have by ID? What is the logic, do you take the first record as being correct, and then output any that don't match that? Do you just want an output of distinct id/sex, then proc sort would work.
data have; infile datalines dsd truncover; input Date:MMDDYY10. ID:BEST. Gender:BEST.; format date mmddyy10.; datalines4; 11/01/2016,1,1 11/01/2016,2,2 11/01/2016,3,1 11/02/2016,1,2 11/04/2016,5,2 11/03/2016,6,2 11/04/2016,3,2 11/04/2016,8,1 11/01/2016,9,2 11/01/2016,10,2 11/01/2016,11,1 11/01/2016,12,2 11/01/2016,13,1 11/01/2016,14,2 11/10/2016,14,1 11/11/2016,14,2 ;;;; run; proc sql; create table want as select id,gender,count(*) as count from ( select * from have group by id having count(distinct gender) ne 1 ) group by id,gender; quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.