Solved: Re: create a new column based on another column's value

glz · Posted 10-24-2019 02:10 PM

Hello,

I have an example data in below, the column status (highlight with yellow) is what I want.

For each subject, once "Positive" occur in Test, then the status for this subject is "Positive";

If no Positive, there is "Negative" then the status for this subject is "Negative";

If no positive, there are "Unknown" and one other character string such as "<5" then the status is "<5";

if no positive, there are "Unknown" and many other character string, such as "<5" and "<10", then the status is "NA";

Can anyone help me with this problem?

Thanks.

mkeintz · Posted 10-24-2019 03:28 PM

Here are your rules as I understand them;

For each ID

If there is any test='Positive' then status='Positive'.
Otherwise if there is any test='Negative' then status='Negative'
Otherwise if all of the tests='Unknown' then status='Unknown'
Otherwise if there is a mix of 'Unknown' and a single test value, then status=that test value
Otherwise if there is more than one test value, the status='NA'

The program below has not been tested, because your picture of the data is an image, not text. I'm not going to manually type in your data.

data want (drop=any_:  n_unknown n_tests status_1);
  set have (in=first_pass)
      have (in=second_pass);
  by id;
  retain status;
  if 0 then status_1=status;  /* Edited addition */
  retain any_pos any_neg n_unknown n_tests status_1;
  if first.id then call missing(status,any_pos,any_neg,n_unknown,n_tests,status_1);
  if first_pass=1 then do;
    n_tests+1;
    if test='Positive' then any_pos=1; else
    if test='Negative' then any_net=1; else
    if test='Unknown'  then n_unknown+1; else
    do;
      if status_1=' ' then status_1=test;
      else if status_1^=test then status_1='NA';
    end;
    if any_pos=1         then status='Positive'; else
    if any_neg=1         then status='Negative'; else
    if n_unknown=n_tests then status='Unknown';  else
    status=status_1;
  end;
  if second_pass=1;
run;

This works by passing through each ID twice

The first pass reads each record once. It determines the presence of any Positive, or Negative tests, and it counts the number of 'Unknown' results, and maintains a STATUS_1 variable. STATUS_1 will get the first measured result (i.e. not Negative, Positive, or Unknown), and it will be compared to all subsequent measured results. If any of then are different then STATUS_1 will be set to 'NA'.

The desired STATUS variable is updated with each first pass record.

The second pass does nothing but inherit the value of STATUS determined in the first pass.

Edited change: I added the statement

if 0 then status_1=status;

so that STATUS_1 would be a character variable matching the length of STATUS.

The condition "if 0" is never true, so the actual value of STATUS_1 is not changed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 10-24-2019 03:28 PM

Here are your rules as I understand them;

For each ID

If there is any test='Positive' then status='Positive'.
Otherwise if there is any test='Negative' then status='Negative'
Otherwise if all of the tests='Unknown' then status='Unknown'
Otherwise if there is a mix of 'Unknown' and a single test value, then status=that test value
Otherwise if there is more than one test value, the status='NA'

The program below has not been tested, because your picture of the data is an image, not text. I'm not going to manually type in your data.

data want (drop=any_:  n_unknown n_tests status_1);
  set have (in=first_pass)
      have (in=second_pass);
  by id;
  retain status;
  if 0 then status_1=status;  /* Edited addition */
  retain any_pos any_neg n_unknown n_tests status_1;
  if first.id then call missing(status,any_pos,any_neg,n_unknown,n_tests,status_1);
  if first_pass=1 then do;
    n_tests+1;
    if test='Positive' then any_pos=1; else
    if test='Negative' then any_net=1; else
    if test='Unknown'  then n_unknown+1; else
    do;
      if status_1=' ' then status_1=test;
      else if status_1^=test then status_1='NA';
    end;
    if any_pos=1         then status='Positive'; else
    if any_neg=1         then status='Negative'; else
    if n_unknown=n_tests then status='Unknown';  else
    status=status_1;
  end;
  if second_pass=1;
run;

This works by passing through each ID twice

The first pass reads each record once. It determines the presence of any Positive, or Negative tests, and it counts the number of 'Unknown' results, and maintains a STATUS_1 variable. STATUS_1 will get the first measured result (i.e. not Negative, Positive, or Unknown), and it will be compared to all subsequent measured results. If any of then are different then STATUS_1 will be set to 'NA'.

The desired STATUS variable is updated with each first pass record.

The second pass does nothing but inherit the value of STATUS determined in the first pass.

Edited change: I added the statement

if 0 then status_1=status;

so that STATUS_1 would be a character variable matching the length of STATUS.

The condition "if 0" is never true, so the actual value of STATUS_1 is not changed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

glz · Posted 10-25-2019 10:51 AM

Hello mkeintz,

Thanks very much, your solution and idea is very valuable.

If we define status_1 and status as character value first, then it solved my question! otherwise it will cause error such as "invalid numeric data, "Unknown" ".

Thanks again.

hashman · Posted 10-24-2019 11:27 PM

@glz: As @mkeintz has mentioned, it would be nice of you if you provided a step for creating you sample input data, thus ridding those kind enough to help you from the need to reconcoct it themselves (this what the first step below does). Having said that, try this:

data have (keep = id test) ;                                                                                                            
  array v [5  ] $8 _temporary_ ("Positive","Negative","Unknown","<5","<10") ;                                                           
  array t [8,4]  8 _temporary_                                                                                                          
  (3 3 1 1  3 3 2 2  1 1 2 2  1 1 4 4  3 3 4 4  3 3 4 5  3 3 3 3  1 2 3 1) ;                                                            
  do i = 1 to 8 ;                                                                                                                       
    ID = "A" || put (i,z3.) ;                                                                                                           
    do j = 1 to 4 ;                                                                                                                     
      Test = v [ t[i,j]] ;                                                                                                              
      output ;                                                                                                                          
    end ;                                                                                                                               
  end ;                                                                                                                                 
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  do _n_ = 1 by 1 until (last.id) ;                                                                                                     
    set have ;                                                                                                                          
    by id ;                                                                                                                             
    array f[0:4] _f0-_f4 ;                                                                                                              
    f [find ("PNU5", char (compress (test, "<"), 1))] = 1 ;                                                                             
  end ;                                                                                                                                 
  do _n_ = 1 to _n_ ;                                                                                                                   
    set have ;                                                                                                                          
    if      _f1 then status = "Positive" ;                                                                                              
    else if _f2 then status = "Negative" ;                                                                                              
    else if _f3 then do ;                                                                                                               
      if      _f0 then status = "N/A" ;                                                                                                 
      else if _f4 then status = "<5" ;                                                                                                  
      else status = "Unknown" ;                                                                                                         
    end ;                                                                                                                               
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;

create a new column based on another column's value

Re: create a new column based on another column's value

Re: create a new column based on another column's value

Re: create a new column based on another column's value

Re: create a new column based on another column's value

Registration is open

SAS Training: Just a Click Away