Updating variables

Reply
Contributor
Posts: 62

Updating variables

Hello,

Example:

Data ONE:

ID Diag1 Diag2 Diag3 Overall

1   O1                           2

2   O2                           2

3   O1     O2                  4

4                                   0

Data Two:

ID Diag1 Diag2 Diag3 Overall

1                                 0

2   O2                           2

3   O1     O2       O3       6

4    O4                           2

I want to have just one data set with only those observations that has the most information, in example case, I want to have this:

ID Diag1 Diag2 Diag3 Overall

1    O1                          2

2   O2                           2

3   O1     O2       O3       6

4   O4                           2

Super User
Super User
Posts: 7,948

Re: Updating variables

Hi,

Well, guessing your rules:

proc sql;

     create table WANT as

     select     COALESCE(A.ID,B.ID) as ID,

                    COALESCE(A.DIAG1,B.DIAG1) as DIAG1,

                    COALESCE(A.DIAG2,B.DIAG2) as DIAG2,

                    COALESCE(A.DIAG3,B.DIAG3) as DIAG3,

                    case     when A.OVERALL > B.OVERALL then A.OVERALL

                                  else B.OVERALL as OVERALL

     from     DATA_ONE A

     full join DATA_TWO B

     on         A.ID=B.ID;

quit;

Contributor
Posts: 62

Re: Updating variables

My example is very very simple as my data is much larger and my table ONE has much more variables in it, and i want to update diag in ONE selecting from table TWO only those that has more info

Super User
Super User
Posts: 7,948

Re: Updating variables

Can you clarify what you mean by "more info", so from your example we can acertain that if one contains a missing and th other a non missing then use the non-missing.  What about Overall, is this the largest value from either, the sum???  For any other columns will the logic be the same, for instance if you have text fields (comments maybe) would they get concatenated. 

Contributor
Posts: 62

Re: Updating variables

More info means more variables and table TWO consist just ID and list of Diagnosis (20 variables). Variable overall was created in this way: added all diagnosis into one string and calculated the length of this new string (this is overall).

I want to find which observations (using ID) from table ONE have more information in diagnosis variables in table TWO and to update that information in table ONE.

Super User
Super User
Posts: 7,948

Re: Updating variables

Mmm, so:

proc sql;

     create table WANT as

     select     COALESCE(A.ID,B.ID) as ID,

                    COALESCE(A.DIAG1,B.DIAG1) as DIAG1,

                    COALESCE(A.DIAG2,B.DIAG2) as DIAG2,

                    COALESCE(A.DIAG3,B.DIAG3) as DIAG3,

                    case     when A.DIAGL > B.DIAGL then A.DIAGL

                                  else B.DIAGL end as OVERALL

     from     (select *,length(compress(DIAG1||DIAG2||DIAG3)) as DIAGL from ADATA_ONE) A

     full join (select *,length(compress(DIAG1||DIAG2||DIAG3)) as DIAGL from DATA_TWO) B

     on         A.ID=B.ID;

quit;

Super User
Super User
Posts: 7,948

Re: Updating variables

Oh, and you mention that you have multiple DIAGs, so maybe (slight change to the above SQL to allow for any number of variables) (untested):
data _null_;

     num_diags=3; /* update per your number */

     call execute('proc sql;

                               create table WANT as

                               select COALESCE(A.ID,B.ID) as ID');

     do I=1 to num_diags;

          call execute(',COALESCE(A.DIAG'||strip(put(I,best.))||',B.DIAG'||strip(put(I,best.))||') as DIAG'||strip(put(I,best.)));

     end;

     call execute('case when A.DIAGL > B.DIAGL then A.DIAGL else B.DIAGL as OVERALL');

     call execute(' from (select *,length(cats(of DIAG1-DIAG'||strip(put(num_diags,best.))||')) as DIAGL from DATA_ONE) A

                          full join (select *,length(cats(of DIAG1-DIAG'||strip(put(num_diags,best.))||')) as DIAGL from DATA_TWO) B

                         on A.ID=B.ID;

                         quit;');

run;

Respected Advisor
Posts: 3,156

Re: Updating variables

If you have tons of variables, you probably want to convert your '0's to missing first, then leverage Update statement. Below is an example using custom informat, but you could also use Array achieving the same goal.

proc format;

invalue $ cvert

'0'=' '

      other=[$8.];

run;

Data ONE;

input (ID Diag1 Diag2 Diag3 Overall) (:$cvert.);

cards;

1   O1  . .                          2

2   O2 . .                     2

3   O1 O2 .                  4

4              . . .                     0

;

Data Two;

input (ID Diag1 Diag2 Diag3 Overall) (:$cvert.);

cards;

1     . . .                            0

2   O2 . .                      2

3   O1 O2       O3       6

4    O4 . .                    2

;

data want;

update one two;

by id;

run;

Regards,

Haikuo

Super User
Super User
Posts: 7,948

Re: Updating variables

Nice idea Hai.kuo.  Would this however work with the assumed rules from the post.  Prinicipally the overall column as it looks like whichever is greater for first or second table should go into the output, maybe I misunderstand that?

Respected Advisor
Posts: 3,156

Re: Updating variables

I have assumed 0s would be treated as missing values. Agreed that OP's rule is too vague Smiley Happy

Super User
Posts: 10,023

Re: Updating variables

I don't think so. OP want the most info , I guess she need the maximum value . if 4 in two ,6 in one ,then HaiKuo will get wrong result. Dictionary table is a good choice to wrap them all into a macro variable and rename them all.

Data ONE;
input ID Diag1 $ Diag2 $ Diag3 $ Overall;
cards;
1   O1  . .                          2
2   O2 . .                     2
3   O1 O2 .                  6
4              . . .                     0
;
 
Data Two;
input ID Diag1 $ Diag2 $ Diag3 $ Overall;
cards;
1     . . .                            0
2   O2 . .                      2
3   O1 O2       O3       4
4    O4 . .                    2
;
run;
data want;
 merge one two(rename=(overall=_overall diag1-diag3=_diag1-_diag3));
 by id;
 array x{*} $ diag1-diag3 ;
 array _x{*} $ _diag1-_diag3 ;
 do i=1 to dim(x);
  x{i}=coalescec(x{i},_x{i});
 end;
 overall=max(overall,_overall);
 drop _: i;
run;


Xia Keshan

Respected Advisor
Posts: 3,156

Re: Updating variables

Well, I have an updated version while the different opinion remains. And you have a lots of "rename" to do when OP has 20+ diags. Smiley Wink

Haikuo

Hint: using Hash may save you the trouble of renaming that much.

Super User
Posts: 10,023

Re: Updating variables

Actually, if the variables like OP described ,that would be very easy to rename.

have ( rename=( d1-d20= _d1-_d20) ) ;

If there are too many different variable name, dictionary table is good helper I believe.

Ksharp

Trusted Advisor
Posts: 1,228

Re: Updating variables

data all;
merge one two (rename=(diag1=diag1_ diag2=diag2_ diag3=diag3_ overall=overall_));
by id;
run;

data want (keep=diag1 diag2 diag3 overall);
set all;
array d {3} diag1 diag2 diag3;
array d_ {3} diag1_ diag2_ diag3_;
do i=1 to dim(d);
if missing(d{i}) then d{i}=d_{i};
if overall<overall_ then overall=overall_;
end;
run;

Respected Advisor
Posts: 3,156

Re: Updating variables

Let me rephrase OP's rule, OP please correct me if I am wrong: The only thing need to check is the 'overall'. You want to keep the obs of which the 'overall' has greater value, indicating a longer length, then indicating more info in term of Diags.

If that is the case, then the task is a fairly straightforward SQL union:

Data ONE;

input (ID Diag1 Diag2 Diag3)(:$) Overall ;

cards;

1   O1  . .                          2

2   O2 . .                     2

3   O1 O2 .                  4

4              . . .                     0

;

Data Two;

input (ID Diag1 Diag2 Diag3)(:$) Overall ;

cards;

1     . . .                            0

2   O2 . .                      2

3   O1 O2       O3       6

4    O4 . .                    2

;

proc sql;

create table want as 

select one.* from one,two where one.id=two.id and one.overall ge two.overall

       union all corr

select two.* from one,two where one.id=two.id and one.overall lt two.overall

;

quit;

Same task can be done using Data step with or without Hash, or inefficiently, using Dosubl().

Good Luck,

Haikuo

Ask a Question
Discussion stats
  • 19 replies
  • 444 views
  • 0 likes
  • 7 in conversation