- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am working with a dataset that contains demographics data that does not always align (i.e. someone can have multiple records and they do not always match). I am trying to code my program so that I can count how many times someone is reported as sex=female or sex=male and then I need to make a new variable for sex based on which outcome occurs the most.
For example, this is kind of like what I have
SSN | Sex |
111111 | M |
111111111 | M |
111111111 | M |
111111111 | F |
And then I would want something like this (could look different, but this gist):
SSN | Sex1 | N1 | Sex2 | N2 | NewSex |
111111111 | M | 3 | F | 1 | M |
Does anyone have any ideas on how I can achieve this?
TIA
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc freq data=sashelp.heart noprint;
table BP_Status*sex/out=have list;
run;
proc sql noprint;
select max(n) into : n
from (select count(*) as n from have group by BP_Status);
quit;
proc summary data=have;
by BP_Status ;
output out=temp(drop=_:) idgroup(out[&n] (sex count)=);
run;
data want;
set temp;
array x{*} sex:;
array y{*} count:;
new=x{ whichn(max(of y{*}),of y{*}) };
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc freq data=sashelp.heart noprint;
table BP_Status*sex/out=have list;
run;
proc sql noprint;
select max(n) into : n
from (select count(*) as n from have group by BP_Status);
quit;
proc summary data=have;
by BP_Status ;
output out=temp(drop=_:) idgroup(out[&n] (sex count)=);
run;
data want;
set temp;
array x{*} sex:;
array y{*} count:;
new=x{ whichn(max(of y{*}),of y{*}) };
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- What if there is a tie?
- Your sample seems to assume that SSN 111111 (first row) is the same person as SSN 1111111111 rows 2 through 4). Is that just a typo?
- Is your data already sorted by ssn?
- Do you want to reduce multiple records to one as in your example? This would be a problem if your original records have variables besides SEX that vary over a set of matching SSN's.
Assuming the data are sorted, and there are no other variables that vary per #4, then
data want;
do until (last.ssn);
set have;
by ssn;
if sex='M' then nmales=sum(nmales,1); else
if sex='F' then nfemales=sum(nfemales,1);
end;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- What if there is a tie? If there is a tie I need to check another database
- Your sample seems to assume that SSN 111111 (first row) is the same person as SSN 1111111111 rows 2 through 4). Is that just a typo? Yes this was a typo - it should be all the same SSN for the example
- Is your data already sorted by ssn? Yes
- Do you want to reduce multiple records to one as in your example? This would be a problem if your original records have variables besides SEX that vary over a set of matching SSN's. -- I need to check multiple variables that have the same issue (take the most recurring value for each variable I need). My idea was to do this in segments to make it less complicated and then create a new dataset with one line per person with all of the necessary demographic information
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input SSN $ Sex :$1.;
datalines;
111111111 M
111111111 M
111111111 M
111111111 F
222222222 F
222222222 F
333333333 M
333333333 F
;
proc sql;
create table max as
select
ssn,
sex
from sum
group by ssn
having count = max(count);
quit;
data
want (drop=ct)
check
;
do until (last.ssn);
set max;
by ssn;
ct = sum(ct,1);
end;
do until (last.ssn);
set max;
by ssn;
if ct > 1
then output check;
else output want;
end;
run;
I avoided to create a wide dataset, as you will do this for several variables; you can always create a wide report with PROC REPORT if needed.