Hi
I have a dataset1 where columns are basically managers' names attached to various dates. This is my master/main dataset. The other dataset2 has a gender column of these managers, so one column is names of these managers and parallel column is the gender (converted in numeric i.e. 0 and 1).
I want to create a "New_dataset" which should be similar to Dataset1, but the names (character variable) must be replaced with the gender variable from Dataset2. Please see a sample below:
Data Dataset1;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
Date_Month : mmddyy10.
Manager1 : $60.
Manager2 : $60.
Manager3 : $60. ;
format Date_Month mmddyy10. ;
datalines;
AB00046,06-30-2016,Ronald Baron,Bill F. Baron,
AB00046,07-31-2016,Ronald Baron,Bill F. Baron,
AB00046,08-31-2016,Ronald Baron,Bill F. Baron,
AB00046,09-30-2016,Ronald Baron,Bill F. Baron,
AB00046,10-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,11-30-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,12-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,01-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,02-28-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,03-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,04-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,05-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,06-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,07-31-2017,Ronald Baron,Bill F. Baron,
AB00046,08-31-2017,Ronald Baron,Bill F. Baron,
AB00050,04-30-2016,
AB00050,05-31-2016,
AB00050,06-30-2016,Sharon
AB00050,07-31-2016,Sharon,Tim S.
AB00050,08-31-2016,Sharon,Tim S.
AB00050,09-30-2016,Sharon,Tim S.
; run;
Data Dataset2;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
Name_Manager : $60.
Gender ;
datalines;
AB00046,Ronald Baron,0
AB00046,Bill F. Baron,0
AB00046,Tim S.,0
AB00050,Sharon,1
AB00050,Tim S.,0
;
run;
Data Dataset_Want;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
Date_Month : mmddyy10.
Manager1
Manager2
Manager3 ;
format Date_Month mmddyy10. ;
datalines;
AB00046,06-30-2016,0,0,
AB00046,07-31-2016,0,0,
AB00046,08-31-2016,0,0,
AB00046,09-30-2016,0,0,
AB00046,10-31-2016,0,0,0
AB00046,11-30-2016,0,0,0
AB00046,12-31-2016,0,0,0
AB00046,01-31-2017,0,0,0
AB00046,02-28-2017,0,0,0
AB00046,03-31-2017,0,0,0
AB00046,04-30-2017,0,0,0
AB00046,05-31-2017,0,0,0
AB00046,06-30-2017,0,0,0
AB00046,07-31-2017,0,0,
AB00046,08-31-2017,0,0,
AB00050,04-30-2016,
AB00050,05-31-2016,
AB00050,06-30-2016,1
AB00050,07-31-2016,1,0
AB00050,08-31-2016,1,0
AB00050,09-30-2016,1,0
;
run;
Please Guide me in this regard. Thanks.
Here a hash approach which also looks for a matching ID column and not only names.
data want(drop=_:);
set Dataset1;
array Managers {*} Manager:;
if _n_=1 then
do;
if 0 then set dataset2(keep=Name_Manager);
drop Name_Manager Gender;
dcl hash h1(dataset:'dataset2');
h1.defineKey('id','Name_Manager');
h1.defineData('Gender');
h1.defineDone();
end;
do _i=1 to dim(managers);
if h1.find(key:id, key:managers[_i]) = 0 then managers[_i]=put(Gender,f1.);
else call missing(managers[_i]);
end;
run;
Create a format from dataset2, and apply it:
data cntlin;
set dataset2 end=eof;
fmtname = 'mygender';
type = 'C';
start = name_manager;
label = put(gender,best.);
output;
if eof
then do;
name_manager = 'other';
hlo = 'O';
label = '';
output;
end;
drop id name_manager gender;
run;
proc format cntlin=cntlin;
run;
data want;
set dataset1;
array manager {*} manager:;
do i = 1 to dim(manager);
manager {i} = put(manager{i},$mygender.);
end;
drop i;
run;
@Saba1 wrote:
@Kurt_Bremser : Thanks for your reply. But when I run "proc format cntlin=cntlin;" statement, the Error says: "ERROR: For format $MYGENDER, this range is repeated, or values overlap". Therefore, for "Want" dataset when I run data step, the Error is as follows: " The format $MYGENDER was not found or could not be loaded".
Actually, several managers' names are repeated in dataset2. So there can be cases with same manager name but different ID or in few cases even with the same ID the name is repeated twice. Is there a way to solve this issue?
Please post test-data that matches your real data. With the data posted creating the format does not throw any error message.
@andreas_ldsPlease see the updated data sample. Thanks
@Saba1 wrote:
@Kurt_Bremser : Thanks for your reply. But when I run "proc format cntlin=cntlin;" statement, the Error says: "ERROR: For format $MYGENDER, this range is repeated, or values overlap". Therefore, for "Want" dataset when I run data step, the Error is as follows: " The format $MYGENDER was not found or could not be loaded".
Actually, several managers' names are repeated in dataset2. So there can be cases with same manager name but different ID or in few cases even with the same ID the name is repeated twice. Is there a way to solve this issue?
With the data presented, the code works. Please post complete example data that illustrates your issues sufficiently.
Ok, so we need to expand the format solution.
One way would be to combine the two values that determine the outcome into a single variable and create the format for this.
Or we create multiple formats for the ID's, which I have done here, and use the putc() instead of the put() function (putc() and putn() accept expressions as formats, while put() only accepts a literal format name):
Data Dataset1;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
Date_Month : mmddyy10.
Manager1 : $60.
Manager2 : $60.
Manager3 : $60. ;
format Date_Month mmddyy10. ;
datalines;
AB00046,06-30-2016,Ronald Baron,Bill F. Baron,
AB00046,07-31-2016,Ronald Baron,Bill F. Baron,
AB00046,08-31-2016,Ronald Baron,Bill F. Baron,
AB00046,09-30-2016,Ronald Baron,Bill F. Baron,
AB00046,10-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,11-30-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,12-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,01-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,02-28-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,03-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,04-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,05-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,06-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,07-31-2017,Ronald Baron,Bill F. Baron,
AB00046,08-31-2017,Ronald Baron,Bill F. Baron,
AB00050,04-30-2016,
AB00050,05-31-2016,
AB00050,06-30-2016,Sharon
AB00050,07-31-2016,Sharon,Tim S.
AB00050,08-31-2016,Sharon,Tim S.
AB00050,09-30-2016,Sharon,Tim S.
;
run;
Data Dataset2;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
Name_Manager : $60.
Gender ;
datalines;
AB00046,Ronald Baron,0
AB00046,Bill F. Baron,0
AB00046,Tim S.,0
AB00050,Sharon,1
AB00050,Tim S.,0
;
run;
proc sort data=dataset2;
by id;
run;
data cntlin;
set dataset2;
by id;
fmtname = strip(id) !! '_';
type = 'C';
start = name_manager;
label = put(gender,best.);
output;
if last.id
then do;
name_manager = 'other';
hlo = 'O';
label = '';
output;
end;
drop id name_manager gender;
run;
proc format cntlin=cntlin;
run;
data want;
set dataset1;
array manager {*} manager:;
do i = 1 to dim(manager);
manager {i} = putc(manager{i},strip(id) !! '_');
end;
drop i;
run;
The appended underline is necessary because your ID values end with numbers. Similarly, you would neet to prepend an underline or a character if ID started with a digit.
There are many way to solve a problem like this. Here is a PROC FORMAT Approach
data fmt;
set Dataset2(rename=(Name_Manager=start Gender=label)) end=eof;
retain fmtname "MngFmt" type "c";
output;
if eof then do;
start=""; Label="";
HLO="O"; output;
end;
run;
proc format library=work cntlin=fmt;
run;
data want;
set Dataset1;
Manager1=input(put(Manager1, $MngFmt.), 8.);
Manager2=input(put(Manager2, $MngFmt.), 8.);
Manager3=input(put(Manager3, $MngFmt.), 8.);
run;
Here a hash approach which also looks for a matching ID column and not only names.
data want(drop=_:);
set Dataset1;
array Managers {*} Manager:;
if _n_=1 then
do;
if 0 then set dataset2(keep=Name_Manager);
drop Name_Manager Gender;
dcl hash h1(dataset:'dataset2');
h1.defineKey('id','Name_Manager');
h1.defineData('Gender');
h1.defineDone();
end;
do _i=1 to dim(managers);
if h1.find(key:id, key:managers[_i]) = 0 then managers[_i]=put(Gender,f1.);
else call missing(managers[_i]);
end;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.