BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anni2017
Calcite | Level 5

Hej SAS Experts, I hope someone can help me on this problem... I use SAS 9.4 

 

I want to create a new variable from other variables in an existing data-set.

In this data-set I have three IDs, the ID of an individual, and its parents (father, mother). The problem is, the data-set gives no information about the sex of my individual, but the individual appears again e.g. as a father. I need to check the sex of my individuals...

 

If I have an example data-set like this:

ID_I   ID_F   ID_M 

11      21      31     

12      22      32    

13      23      33  

14      11      12      

15      11      13

16      11      12

17      14      12

18      14      15

19      14      16

 

 What I want to achieve with a code is an additional variable (1 for Mother, 2 for Father and 0 for unknown): 

ID_I    ID_F    ID_M    Sex

11        21        31         2

12        22        32        1

13        23        33        1

14        11        12         2

15        11        13        1

16        11        12         1

17        14        12        0

18        14        15        0

19        14        16       0

 

Does SAS have any function which can do this? 


My idea was to do it with the IN() Function but then I would need to have arrays not variables? Anyway this was my failed try:

Data want;

Set Data Have;

 

Sex= ID_I in ID_F;

put Sex=2;

 

Sex=ID_I in ID_D;

put Sex=1;

RUN;

 

I have more than 10.000 observations for every variable, so I don´t know if it is possible to create arrays from my variables? Is there any other smart solution to this?

 

I would appreciate your help a lot, I am still new to SAS and this is just too complicated for me.

Thank you in advance Anni

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

10,000 is really small for a SAS array. You can use them for many millions.

 

If I correctly understand what you're trying to do, try the following:

data want (drop=n);
array fathers(100000) _temporary_;
array mothers(100000) _temporary_;
do until (last);
set have end=last;
n+1;
fathers(n)=ID_F;
mothers(n)=ID_M;
end;
do until (last2);
set have end=last2;
if id_i in fathers then sex=2;
else if id_i in mothers then sex=1;
else sex=0;
output;
end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

13 REPLIES 13
art297
Opal | Level 21

10,000 is really small for a SAS array. You can use them for many millions.

 

If I correctly understand what you're trying to do, try the following:

data want (drop=n);
array fathers(100000) _temporary_;
array mothers(100000) _temporary_;
do until (last);
set have end=last;
n+1;
fathers(n)=ID_F;
mothers(n)=ID_M;
end;
do until (last2);
set have end=last2;
if id_i in fathers then sex=2;
else if id_i in mothers then sex=1;
else sex=0;
output;
end;
run;

Art, CEO, AnalystFinder.com

 

mkeintz
PROC Star

@art297

 

The "do until (last);" won't increment _N_, so you need to change the code inside the loop.

 

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Anni2017
Calcite | Level 5

Thank you Mark the code first gave the wrong result, but now it works perfectly!!

Anni

Anni2017
Calcite | Level 5

Dear Art, yes that is exactly what I wanted to do.

The results seem to be correct. Thank you very much for your help!!

Kindest regards Anni

Reeza
Super User

This is a bit of a longer approach, but more introductory level solution that you should be able to modify as necessary.

 

1. Create sample data - not required in your actual code. Use the data you already have.

2. Create master list of gender data from original data by appending the two datasets and assigning F/M as indicated.

3. Ensure list above has one record per ID and gender - see comments in code below.

4. Merge results back in and keep only records from the original file. 

 

*input sample data;

data have;
	input ID_I ID_F ID_M;
	cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;
run;

*Get gender from file originally - should make sure this is unique
one record per ID;

data gender;
	set have (rename=ID_F=ID in=Males) have (rename=ID_M=ID in=Females);

	if Males then
		sex=2;

	if Females then
		sex=1;
	keep ID SEX;
run;

*Sort for merging;

proc sort data=gender nodupkey;
	by id sex;
run;

*CHECK FOR duplicates - if the log shows duplicates you have a problem with your data
mainly ID identified as both male and female; 
proc sort data=gender out=checkme nodupkey;
by id;
run;

proc sort data=have;
	by ID_I;
run;


*merge data;

data want;
	merge have (in=original) gender (rename=ID=ID_I);
	by ID_I;

	if original;
run;
Shmuel
Garnet | Level 18

Another solution similar logic to @Reeza's, but with a format:

 

/* step 1 - create a format */

data ID_sex(rename=(sex=label));

  set have;

        retain fmtname 'sexf';

        start = id_f; sex='2'; output;

        start = id_m; sex='1'; output;

       keep fmtname start sex;

run;

proc sort data=id_sex out=ID_sex nodupkey; by start; run;

proc format cntlin=ID_sex; run;

 

data want;

  set have;

        length sex $1;

        sex = put(id_i,sexf.);

run;

 

 

       

  

art297
Opal | Level 21

I revised my code based on @mkeintz's comment. He was totally correct and here is the new code:

data have;
  input id_I ID_F ID_M;
  cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;

data want (drop=n);
  array fathers(100000) _temporary_;
  array mothers(100000) _temporary_;
  do until (last);
    set have end=last;
    n+1;
    fathers(n)=ID_F;
    mothers(n)=ID_M;
  end;
  do until (last2);
    set have end=last2;
    if id_i in fathers then sex=2;
    else if id_i in mothers then sex=1;
    else sex=0;
    output;
  end;
run;
novinosrin
Tourmaline | Level 20

Another easy solution:

 

data have;

     input ID_I ID_F ID_M;

     cards;

11     21      31    

12     22      32   

13     23      33 

14     11      12     

15     11      13

16     11      12

17     14      12

18     14      15

19     14      16

;

run;

 

 

 

data want;

if _N_ = 1 then do;

     if 0 then set have;

     declare hash myhash(dataset:'have(keep=ID_F)',multidata:'YES');

      myhash.defineKey('ID_F');

      myhash.defineDone( );

     declare hash myhash2(dataset:'have(keep=ID_M)',multidata:'YES');

      myhash2.defineKey('ID_M');

      myhash2.defineDone( );

end;

set have;

if myhash.check(key:ID_I)=0 then Sex=2;

else if myhash2.check(key:ID_I)=0 then Sex=1;

else Sex=0;

run;

 

Regards,

Naveen Srinivasan

Ksharp
Super User
I should not jump into this hive.
Since I have some time, I would like to introduce SQL solution.



data have;
	input ID_I ID_F ID_M;
	cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;
run;

proc sql;
create table want as
 select *,case when id_i in (select id_f from have) then 2
               when id_i in (select id_m from have) then 1
               else 0 end as sex 
  from have;
quit;


art297
Opal | Level 21

@Ksharp: What hive? Alternative and potentially better solutions, in my opinion, are always welcome. And, in this case, using sql subquerries, is far more efficient than the data step/array solution I had suggested.

@Anni2017: I would change the accepted solution to @Ksharp's solution. While both work, the SQL approach is far more efficient and will work with much larger data sets.

 

Anyone who might be interested: The main reason sql out performed the data step approach was that the size declared for the two arrays has a big impact on performance.  That can be adjusted by first running a sql step to create macro variables containing the needed array sizes. I'd love to compare the three approaches on an actual large file, but did a pseudo comparison by just expanding the size of the example data.

 

In my example, the revised data step approach out performed the sql approach, and the sql approach out performed the hash approach:

data have;
	input ID_I ID_F ID_M;
	do i=1 to 500000;
	  output;
	end;
	cards;
11      21      31     
12      22      32    
13      23      33  
14      11      12      
15      11      13
16      11      12
17      14      12
18      14      15
19      14      16
;
run;

proc sql noprint;
  select count(distinct id_m),count(distinct id_f)
    into :m_count, :f_count
      from have
  ;
quit;

data want1 (drop=n:);
  array fathers(&f_count) _temporary_;
  array mothers(&m_count) _temporary_;
  do until (last);
    set have end=last;
    if ID_F not in fathers then do;
      nF+1;
      fathers(nF)=ID_F;
    end;
    if ID_M not in mothers then do;
      nM+1;
      mothers(nM)=ID_M;
    end;
  end;
  do until (last2);
    set have end=last2;
    if id_i in fathers then sex=2;
    else if id_i in mothers then sex=1;
    else sex=0;
    output;
  end;
run;

proc sql;
create table want2 as
 select *,case when id_i in (select id_f from have) then 2
               when id_i in (select id_m from have) then 1
               else 0 end as sex 
  from have;
quit;

data want3;
if _N_ = 1 then do;
     if 0 then set have;
     declare hash myhash(dataset:'have(keep=ID_F)',multidata:'YES');
      myhash.defineKey('ID_F');
      myhash.defineDone( );
     declare hash myhash2(dataset:'have(keep=ID_M)',multidata:'YES');
      myhash2.defineKey('ID_M');
      myhash2.defineDone( );
end;
set have;
if myhash.check(key:ID_I)=0 then Sex=2;
else if myhash2.check(key:ID_I)=0 then Sex=1;
else Sex=0;
run;

 

Art, CEO, AnalystFinder.com

 

novinosrin
Tourmaline | Level 20

Thank you @art297 , @Ksharp for the very valuable insights and extending the knowledge for somebody like me to learn. I can't appreciate enough, not just this post, many of yours. I have shamelessly used your codes as a verbatim to many of my work. Thank you so much indeed.

 

Regards,

Naveen Srinivasan

Ksharp
Super User
Arthur.T ,
We are not competitor , we are partner .


art297
Opal | Level 21

Update for anyone who might be interested: In the code, below, I tried to create a somewhat realistic looking fairly large (n=~100000 records) file. Using it, both the SQL and hash approaches were almost equivalent (about 0.1 seconds cpu time), while the datastep/array solution took almost 80 times longer to run (about 8 seconds) than either the SQL or hash approaches:

/* Create some sample data */
  /* SAS macro that duplicates the Excel RANDBETWEEN function */
  %macro RandBetween(min, max);
     (&min + floor((1+&max-&min)*rand("uniform")))
  %mend;

  data Mothers(drop=i rename=ID_F=ID_M) Fathers(drop=i);
    /* initialize random seed to ensure results are reproducible */
    call streaminit(123);
    do i = 1 to 100000;
      ID_F = %RandBetween(1, 50000);
      if mod(ID_F,2) then output Mothers;
      else output fathers;
    end;
  run;

  data mothers;
    set mothers;
    if _n_ le 49447;
  run;

  data have;
    call streaminit(123);
    set mothers mothers;
    set fathers fathers;
    array parents(*) ID_F ID_M;
    do until (ID_I not in parents);
      ID_I=%RandBetween(1, 50000);
    end;
  run;

proc sql noprint;
  select count(distinct id_m),count(distinct id_f)
    into :m_count, :f_count
      from have
  ;
quit;

data want1 (drop=n:);
  array fathers(&f_count) _temporary_;
  array mothers(&m_count) _temporary_;
  do until (last);
    set have end=last;
    if ID_F not in fathers then do;
      nF+1;
      fathers(nF)=ID_F;
    end;
    if ID_M not in mothers then do;
      nM+1;
      mothers(nM)=ID_M;
    end;
  end;
  do until (last2);
    set have end=last2;
    if id_i in fathers then sex=2;
    else if id_i in mothers then sex=1;
    else sex=0;
    output;
  end;
run;

proc sql;
create table want2 as
 select *,case when id_i in (select id_f from have) then 2
               when id_i in (select id_m from have) then 1
               else 0 end as sex 
  from have;
quit;

data want3;
  if _N_ = 1 then do;
     if 0 then set have;
     declare hash myhash(dataset:'have(keep=ID_F)',multidata:'YES');
      myhash.defineKey('ID_F');
      myhash.defineDone( );
     declare hash myhash2(dataset:'have(keep=ID_M)',multidata:'YES');
      myhash2.defineKey('ID_M');
      myhash2.defineDone( );
  end;
  set have;
  if myhash.check(key:ID_I)=0 then Sex=2;
  else if myhash2.check(key:ID_I)=0 then Sex=1;
  else Sex=0;
run;

Art, CEO, AnalystFinder.com

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2961 views
  • 1 like
  • 7 in conversation