Solved: Re: Replacing observations of one dataset with the values of another d...

Saba1 · Posted 01-08-2019 03:17 AM

Hi

I have a dataset1 where columns are basically managers' names attached to various dates. This is my master/main dataset. The other dataset2 has a gender column of these managers, so one column is names of these managers and parallel column is the gender (converted in numeric i.e. 0 and 1).

I want to create a "New_dataset" which should be similar to Dataset1, but the names (character variable) must be replaced with the gender variable from Dataset2. Please see a sample below:

Data Dataset1;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
	Date_Month : mmddyy10.
	Manager1 : $60.
	Manager2 : $60.
	Manager3 : $60. ;
format Date_Month mmddyy10. ;
datalines;
AB00046,06-30-2016,Ronald Baron,Bill F. Baron,
AB00046,07-31-2016,Ronald Baron,Bill F. Baron,
AB00046,08-31-2016,Ronald Baron,Bill F. Baron,
AB00046,09-30-2016,Ronald Baron,Bill F. Baron,
AB00046,10-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,11-30-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,12-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,01-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,02-28-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,03-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,04-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,05-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,06-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,07-31-2017,Ronald Baron,Bill F. Baron,
AB00046,08-31-2017,Ronald Baron,Bill F. Baron,
AB00050,04-30-2016,
AB00050,05-31-2016,
AB00050,06-30-2016,Sharon
AB00050,07-31-2016,Sharon,Tim S.
AB00050,08-31-2016,Sharon,Tim S.
AB00050,09-30-2016,Sharon,Tim S. 
; run;

Data Dataset2;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
	Name_Manager : $60.
	Gender ;
datalines;
AB00046,Ronald Baron,0
AB00046,Bill F. Baron,0
AB00046,Tim S.,0
AB00050,Sharon,1
AB00050,Tim S.,0
;
run;

Data Dataset_Want;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
	Date_Month : mmddyy10.
	Manager1
	Manager2
	Manager3 ;
format Date_Month mmddyy10. ;
datalines;
AB00046,06-30-2016,0,0,
AB00046,07-31-2016,0,0,
AB00046,08-31-2016,0,0,
AB00046,09-30-2016,0,0,
AB00046,10-31-2016,0,0,0
AB00046,11-30-2016,0,0,0
AB00046,12-31-2016,0,0,0
AB00046,01-31-2017,0,0,0
AB00046,02-28-2017,0,0,0
AB00046,03-31-2017,0,0,0
AB00046,04-30-2017,0,0,0
AB00046,05-31-2017,0,0,0
AB00046,06-30-2017,0,0,0
AB00046,07-31-2017,0,0,
AB00046,08-31-2017,0,0,
AB00050,04-30-2016,
AB00050,05-31-2016,
AB00050,06-30-2016,1
AB00050,07-31-2016,1,0
AB00050,08-31-2016,1,0
AB00050,09-30-2016,1,0
;
run;

Please Guide me in this regard. Thanks.

Patrick · Posted 01-08-2019 06:55 PM

@Saba1

Here a hash approach which also looks for a matching ID column and not only names.

data want(drop=_:);
  set Dataset1;
  array Managers {*} Manager:;

  if _n_=1 then 
    do;
      if 0 then set dataset2(keep=Name_Manager);
      drop Name_Manager Gender;
      dcl hash h1(dataset:'dataset2');
      h1.defineKey('id','Name_Manager');
      h1.defineData('Gender');
      h1.defineDone();
    end;

  do _i=1 to dim(managers);
    if h1.find(key:id, key:managers[_i]) = 0 then managers[_i]=put(Gender,f1.);
    else call missing(managers[_i]);
  end;
run;

View solution in original post

Saba1 · Posted 01-08-2019 03:18 AM

@Ksharp: your help will be really appreciated.

Kurt_Bremser · Posted 01-08-2019 04:10 AM

Create a format from dataset2, and apply it:

data cntlin;
set dataset2 end=eof;
fmtname = 'mygender';
type = 'C';
start = name_manager;
label = put(gender,best.);
output;
if eof
then do;
  name_manager = 'other';
  hlo = 'O';
  label = '';
  output;
end;
drop id name_manager gender;
run;

proc format cntlin=cntlin;
run;

data want;
set dataset1;
array manager {*} manager:;
do i = 1 to dim(manager);
  manager {i} = put(manager{i},$mygender.);
end;
drop i;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Saba1 · Posted 01-08-2019 05:48 PM

@Kurt_Bremser : Thanks for your reply. But when I run "proc format cntlin=cntlin;" statement, the Error says: "ERROR: For format $MYGENDER, this range is repeated, or values overlap". Therefore, for "Want" dataset when I run data step, the Error is as follows: " The format $MYGENDER was not found or could not be loaded".
Actually, several managers' names are repeated in dataset2. So there can be cases with same manager name but different ID or in few cases even with the same ID the name is repeated twice. Is there a way to solve this issue?

andreas_lds · Posted 01-09-2019 02:10 AM

@Saba1 wrote:
@Kurt_Bremser : Thanks for your reply. But when I run "proc format cntlin=cntlin;" statement, the Error says: "ERROR: For format $MYGENDER, this range is repeated, or values overlap". Therefore, for "Want" dataset when I run data step, the Error is as follows: " The format $MYGENDER was not found or could not be loaded".
Actually, several managers' names are repeated in dataset2. So there can be cases with same manager name but different ID or in few cases even with the same ID the name is repeated twice. Is there a way to solve this issue?

Please post test-data that matches your real data. With the data posted creating the format does not throw any error message.

Saba1 · Posted 01-09-2019 09:00 PM

@andreas_ldsPlease see the updated data sample. Thanks

Kurt_Bremser · Posted 01-09-2019 04:52 AM

@Saba1 wrote:
@Kurt_Bremser : Thanks for your reply. But when I run "proc format cntlin=cntlin;" statement, the Error says: "ERROR: For format $MYGENDER, this range is repeated, or values overlap". Therefore, for "Want" dataset when I run data step, the Error is as follows: " The format $MYGENDER was not found or could not be loaded".
Actually, several managers' names are repeated in dataset2. So there can be cases with same manager name but different ID or in few cases even with the same ID the name is repeated twice. Is there a way to solve this issue?

With the data presented, the code works. Please post complete example data that illustrates your issues sufficiently.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Saba1 · Posted 01-09-2019 09:01 PM

@Kurt_Bremser: I have updated the sample data. You can see now. Thanks

Kurt_Bremser · Posted 01-10-2019 02:42 AM

Ok, so we need to expand the format solution.

One way would be to combine the two values that determine the outcome into a single variable and create the format for this.

Or we create multiple formats for the ID's, which I have done here, and use the putc() instead of the put() function (putc() and putn() accept expressions as formats, while put() only accepts a literal format name):

Data Dataset1;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
	Date_Month : mmddyy10.
	Manager1 : $60.
	Manager2 : $60.
	Manager3 : $60. ;
format Date_Month mmddyy10. ;
datalines;
AB00046,06-30-2016,Ronald Baron,Bill F. Baron,
AB00046,07-31-2016,Ronald Baron,Bill F. Baron,
AB00046,08-31-2016,Ronald Baron,Bill F. Baron,
AB00046,09-30-2016,Ronald Baron,Bill F. Baron,
AB00046,10-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,11-30-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,12-31-2016,Ronald Baron,Bill F. Baron,Tim S.
AB00046,01-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,02-28-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,03-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,04-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,05-31-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,06-30-2017,Ronald Baron,Bill F. Baron,Tim S.
AB00046,07-31-2017,Ronald Baron,Bill F. Baron,
AB00046,08-31-2017,Ronald Baron,Bill F. Baron,
AB00050,04-30-2016,
AB00050,05-31-2016,
AB00050,06-30-2016,Sharon
AB00050,07-31-2016,Sharon,Tim S.
AB00050,08-31-2016,Sharon,Tim S.
AB00050,09-30-2016,Sharon,Tim S. 
;
run;

Data Dataset2;
infile datalines
dlm=","
missover
DSD;
input ID : $10.
	Name_Manager : $60.
	Gender ;
datalines;
AB00046,Ronald Baron,0
AB00046,Bill F. Baron,0
AB00046,Tim S.,0
AB00050,Sharon,1
AB00050,Tim S.,0
;
run;


proc sort data=dataset2;
by id;
run;

data cntlin;
set dataset2;
by id;
fmtname = strip(id) !! '_';
type = 'C';
start = name_manager;
label = put(gender,best.);
output;
if last.id
then do;
  name_manager = 'other';
  hlo = 'O';
  label = '';
  output;
end;
drop id name_manager gender;
run;

proc format cntlin=cntlin;
run;

data want;
set dataset1;
array manager {*} manager:;
do i = 1 to dim(manager);
  manager {i} = putc(manager{i},strip(id) !! '_');
end;
drop i;
run;

The appended underline is necessary because your ID values end with numbers. Similarly, you would neet to prepend an underline or a character if ID started with a digit.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 01-08-2019 04:16 AM

There are many way to solve a problem like this. Here is a PROC FORMAT Approach

data fmt;
   set Dataset2(rename=(Name_Manager=start Gender=label)) end=eof;
   retain fmtname "MngFmt" type "c";
   output;
   if eof then do;
      start=""; Label="";
      HLO="O"; output;
   end;
run;

proc format library=work cntlin=fmt;
run;

data want;
   set Dataset1;
   Manager1=input(put(Manager1, $MngFmt.), 8.);
   Manager2=input(put(Manager2, $MngFmt.), 8.);
   Manager3=input(put(Manager3, $MngFmt.), 8.);
run;

The DATA to DATA Step Macro
Blog: SASnrd

Saba1 · Posted 01-08-2019 05:55 PM

@PeterClemmensen: Thanks but in my dataset2 managers' names are repeating so when I run "proc format" statement, I receive the following Error: "ERROR: For format $MNGFMT, this range is repeated, or values overlap:". Also is there a way to replace all the statements like "Manager1=input(put(Manager1, $MngFmt.), 8.);" with something else, because I have around 100 columns in Dataset1 i.e. Manager1 - Manager100.

Patrick · Posted 01-08-2019 06:55 PM

@Saba1

Here a hash approach which also looks for a matching ID column and not only names.

data want(drop=_:);
  set Dataset1;
  array Managers {*} Manager:;

  if _n_=1 then 
    do;
      if 0 then set dataset2(keep=Name_Manager);
      drop Name_Manager Gender;
      dcl hash h1(dataset:'dataset2');
      h1.defineKey('id','Name_Manager');
      h1.defineData('Gender');
      h1.defineDone();
    end;

  do _i=1 to dim(managers);
    if h1.find(key:id, key:managers[_i]) = 0 then managers[_i]=put(Gender,f1.);
    else call missing(managers[_i]);
  end;
run;

Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Re: Replacing observations of one dataset with the values of another dataset

Registration is open

SAS Training: Just a Click Away