Create a new variable from three other variables

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Create a new variable from three other variables

Hej SAS Experts, I hope someone can help me on this problem... I use SAS 9.4 

 

I want to create a new variable from other variables in an existing data-set.

In this data-set I have three IDs, the ID of an individual, and its parents (father, mother). The problem is, the data-set gives no information about the sex of my individual, but the individual appears again e.g. as a father. I need to check the sex of my individuals...

 

If I have an example data-set like this:

ID_I   ID_F   ID_M 

11      21      31     

12      22      32    

13      23      33  

14      11      12      

15      11      13

16      11      12

17      14      12

18      14      15

19      14      16

 

 What I want to achieve with a code is an additional variable (1 for Mother, 2 for Father and 0 for unknown): 

ID_I    ID_F    ID_M    Sex

11        21        31         2

12        22        32        1

13        23        33        1

14        11        12         2

15        11        13        1

16        11        12         1

17        14        12        0

18        14        15        0

19        14        16       0

 

Does SAS have any function which can do this? 


My idea was to do it with the IN() Function but then I would need to have arrays not variables? Anyway this was my failed try:

Data want;

Set Data Have;

 

Sex= ID_I in ID_F;

put Sex=2;

 

Sex=ID_I in ID_D;

put Sex=1;

RUN;

 

I have more than 10.000 observations for every variable, so I don´t know if it is possible to create arrays from my variables? Is there any other smart solution to this?

 

I would appreciate your help a lot, I am still new to SAS and this is just too complicated for me.

Thank you in advance Anni


Accepted Solutions
Solution
‎05-15-2017 07:37 AM
PROC Star
Posts: 7,428

Re: Create a new variable from three other variables

[ Edited ]

10,000 is really small for a SAS array. You can use them for many millions.

 

If I correctly understand what you're trying to do, try the following:

data want (drop=n);
array fathers(100000) _temporary_;
array mothers(100000) _temporary_;
do until (last);
set have end=last;
n+1;
fathers(n)=ID_F;
mothers(n)=ID_M;
end;
do until (last2);
set have end=last2;
if id_i in fathers then sex=2;
else if id_i in mothers then sex=1;
else sex=0;
output;
end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Solution
‎05-15-2017 07:37 AM
PROC Star
Posts: 7,428

Re: Create a new variable from three other variables

[ Edited ]

10,000 is really small for a SAS array. You can use them for many millions.

 

If I correctly understand what you're trying to do, try the following:

data want (drop=n);
array fathers(100000) _temporary_;
array mothers(100000) _temporary_;
do until (last);
set have end=last;
n+1;
fathers(n)=ID_F;
mothers(n)=ID_M;
end;
do until (last2);
set have end=last2;
if id_i in fathers then sex=2;
else if id_i in mothers then sex=1;
else sex=0;
output;
end;
run;

Art, CEO, AnalystFinder.com

 

Valued Guide
Posts: 947

Re: Create a new variable from three other variables

@art297

 

The "do until (last);" won't increment _N_, so you need to change the code inside the loop.

 

Mark

New Contributor
Posts: 4

Re: Create a new variable from three other variables

Thank you Mark the code first gave the wrong result, but now it works perfectly!!

Anni

New Contributor
Posts: 4

Re: Create a new variable from three other variables

Dear Art, yes that is exactly what I wanted to do.

The results seem to be correct. Thank you very much for your help!!

Kindest regards Anni

Super User
Posts: 18,997

Re: Create a new variable from three other variables

This is a bit of a longer approach, but more introductory level solution that you should be able to modify as necessary.

 

1. Create sample data - not required in your actual code. Use the data you already have.

2. Create master list of gender data from original data by appending the two datasets and assigning F/M as indicated.

3. Ensure list above has one record per ID and gender - see comments in code below.

4. Merge results back in and keep only records from the original file. 

 

*input sample data;

data have;
	input ID_I ID_F ID_M;
	cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;
run;

*Get gender from file originally - should make sure this is unique
one record per ID;

data gender;
	set have (rename=ID_F=ID in=Males) have (rename=ID_M=ID in=Females);

	if Males then
		sex=2;

	if Females then
		sex=1;
	keep ID SEX;
run;

*Sort for merging;

proc sort data=gender nodupkey;
	by id sex;
run;

*CHECK FOR duplicates - if the log shows duplicates you have a problem with your data
mainly ID identified as both male and female; 
proc sort data=gender out=checkme nodupkey;
by id;
run;

proc sort data=have;
	by ID_I;
run;


*merge data;

data want;
	merge have (in=original) gender (rename=ID=ID_I);
	by ID_I;

	if original;
run;
Trusted Advisor
Posts: 1,459

Re: Create a new variable from three other variables

[ Edited ]

Another solution similar logic to @Reeza's, but with a format:

 

/* step 1 - create a format */

data ID_sex(rename=(sex=label));

  set have;

        retain fmtname 'sexf';

        start = id_f; sex='2'; output;

        start = id_m; sex='1'; output;

       keep fmtname start sex;

run;

proc sort data=id_sex out=ID_sex nodupkey; by start; run;

proc format cntlin=ID_sex; run;

 

data want;

  set have;

        length sex $1;

        sex = put(id_i,sexf.);

run;

 

 

       

  

PROC Star
Posts: 7,428

Re: Create a new variable from three other variables

[ Edited ]

I revised my code based on @mkeintz's comment. He was totally correct and here is the new code:

data have;
  input id_I ID_F ID_M;
  cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;

data want (drop=n);
  array fathers(100000) _temporary_;
  array mothers(100000) _temporary_;
  do until (last);
    set have end=last;
    n+1;
    fathers(n)=ID_F;
    mothers(n)=ID_M;
  end;
  do until (last2);
    set have end=last2;
    if id_i in fathers then sex=2;
    else if id_i in mothers then sex=1;
    else sex=0;
    output;
  end;
run;
PROC Star
Posts: 276

Re: Create a new variable from three other variables

Another easy solution:

 

data have;

     input ID_I ID_F ID_M;

     cards;

11     21      31    

12     22      32   

13     23      33 

14     11      12     

15     11      13

16     11      12

17     14      12

18     14      15

19     14      16

;

run;

 

 

 

data want;

if _N_ = 1 then do;

     if 0 then set have;

     declare hash myhash(dataset:'have(keep=ID_F)',multidata:'YES');

      myhash.defineKey('ID_F');

      myhash.defineDone( );

     declare hash myhash2(dataset:'have(keep=ID_M)',multidata:'YES');

      myhash2.defineKey('ID_M');

      myhash2.defineDone( );

end;

set have;

if myhash.check(key:ID_I)=0 then Sex=2;

else if myhash2.check(key:ID_I)=0 then Sex=1;

else Sex=0;

run;

 

Regards,

Naveen Srinivasan

Super User
Posts: 9,854

Re: Create a new variable from three other variables

I should not jump into this hive.
Since I have some time, I would like to introduce SQL solution.



data have;
	input ID_I ID_F ID_M;
	cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;
run;

proc sql;
create table want as
 select *,case when id_i in (select id_f from have) then 2
               when id_i in (select id_m from have) then 1
               else 0 end as sex 
  from have;
quit;


PROC Star
Posts: 7,428

Re: Create a new variable from three other variables

[ Edited ]

@Ksharp: What hive? Alternative and potentially better solutions, in my opinion, are always welcome. And, in this case, using sql subquerries, is far more efficient than the data step/array solution I had suggested.

@Anni2017: I would change the accepted solution to @Ksharp's solution. While both work, the SQL approach is far more efficient and will work with much larger data sets.

 

Anyone who might be interested: The main reason sql out performed the data step approach was that the size declared for the two arrays has a big impact on performance.  That can be adjusted by first running a sql step to create macro variables containing the needed array sizes. I'd love to compare the three approaches on an actual large file, but did a pseudo comparison by just expanding the size of the example data.

 

In my example, the revised data step approach out performed the sql approach, and the sql approach out performed the hash approach:

data have;
	input ID_I ID_F ID_M;
	do i=1 to 500000;
	  output;
	end;
	cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;
run;

proc sql noprint;
  select count(distinct id_m),count(distinct id_f)
    into :m_count, :f_count
      from have
  ;
quit;

data want1 (drop=n:);
  array fathers(&f_count) _temporary_;
  array mothers(&m_count) _temporary_;
  do until (last);
    set have end=last;
    if ID_F not in fathers then do;
      nF+1;
      fathers(nF)=ID_F;
    end;
    if ID_M not in mothers then do;
      nM+1;
      mothers(nM)=ID_M;
    end;
  end;
  do until (last2);
    set have end=last2;
    if id_i in fathers then sex=2;
    else if id_i in mothers then sex=1;
    else sex=0;
    output;
  end;
run;

proc sql;
create table want2 as
 select *,case when id_i in (select id_f from have) then 2
               when id_i in (select id_m from have) then 1
               else 0 end as sex 
  from have;
quit;

data want3;
if _N_ = 1 then do;
     if 0 then set have;
     declare hash myhash(dataset:'have(keep=ID_F)',multidata:'YES');
      myhash.defineKey('ID_F');
      myhash.defineDone( );
     declare hash myhash2(dataset:'have(keep=ID_M)',multidata:'YES');
      myhash2.defineKey('ID_M');
      myhash2.defineDone( );
end;
set have;
if myhash.check(key:ID_I)=0 then Sex=2;
else if myhash2.check(key:ID_I)=0 then Sex=1;
else Sex=0;
run;

 

Art, CEO, AnalystFinder.com

 

PROC Star
Posts: 276

Re: Create a new variable from three other variables

Thank you @art297 , @Ksharp for the very valuable insights and extending the knowledge for somebody like me to learn. I can't appreciate enough, not just this post, many of yours. I have shamelessly used your codes as a verbatim to many of my work. Thank you so much indeed.

 

Regards,

Naveen Srinivasan

Super User
Posts: 9,854

Re: Create a new variable from three other variables

Arthur.T ,
We are not competitor , we are partner .


PROC Star
Posts: 7,428

Re: Create a new variable from three other variables

Update for anyone who might be interested: In the code, below, I tried to create a somewhat realistic looking fairly large (n=~100000 records) file. Using it, both the SQL and hash approaches were almost equivalent (about 0.1 seconds cpu time), while the datastep/array solution took almost 80 times longer to run (about 8 seconds) than either the SQL or hash approaches:

/* Create some sample data */
  /* SAS macro that duplicates the Excel RANDBETWEEN function */
  %macro RandBetween(min, max);
     (&min + floor((1+&max-&min)*rand("uniform")))
  %mend;

  data Mothers(drop=i rename=ID_F=ID_M) Fathers(drop=i);
    /* initialize random seed to ensure results are reproducible */
    call streaminit(123);
    do i = 1 to 100000;
      ID_F = %RandBetween(1, 50000);
      if mod(ID_F,2) then output Mothers;
      else output fathers;
    end;
  run;

  data mothers;
    set mothers;
    if _n_ le 49447;
  run;

  data have;
    call streaminit(123);
    set mothers mothers;
    set fathers fathers;
    array parents(*) ID_F ID_M;
    do until (ID_I not in parents);
      ID_I=%RandBetween(1, 50000);
    end;
  run;

proc sql noprint;
  select count(distinct id_m),count(distinct id_f)
    into :m_count, :f_count
      from have
  ;
quit;

data want1 (drop=n:);
  array fathers(&f_count) _temporary_;
  array mothers(&m_count) _temporary_;
  do until (last);
    set have end=last;
    if ID_F not in fathers then do;
      nF+1;
      fathers(nF)=ID_F;
    end;
    if ID_M not in mothers then do;
      nM+1;
      mothers(nM)=ID_M;
    end;
  end;
  do until (last2);
    set have end=last2;
    if id_i in fathers then sex=2;
    else if id_i in mothers then sex=1;
    else sex=0;
    output;
  end;
run;

proc sql;
create table want2 as
 select *,case when id_i in (select id_f from have) then 2
               when id_i in (select id_m from have) then 1
               else 0 end as sex 
  from have;
quit;

data want3;
  if _N_ = 1 then do;
     if 0 then set have;
     declare hash myhash(dataset:'have(keep=ID_F)',multidata:'YES');
      myhash.defineKey('ID_F');
      myhash.defineDone( );
     declare hash myhash2(dataset:'have(keep=ID_M)',multidata:'YES');
      myhash2.defineKey('ID_M');
      myhash2.defineDone( );
  end;
  set have;
  if myhash.check(key:ID_I)=0 then Sex=2;
  else if myhash2.check(key:ID_I)=0 then Sex=1;
  else Sex=0;
run;

Art, CEO, AnalystFinder.com

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 272 views
  • 1 like
  • 7 in conversation