Re: Merge Statement

Nasya · Posted 08-22-2018 04:23 PM

Dataset a: Dataset b:

ID Recpos AE ID Rec AE_adj

001 1 fever,chills 001 1 rejection,fever,chills

001 2 artery rupture 002 1 rejection

002 0

ID-Subject ID-Character ID-Subject ID-Character

Recpos-Record-Numeric Rec-Record-Numeric

AE-AEterm-character AE_adj-AEterm-character

Output Expected:

ID Recpos Rec AE AE_adj

001 1 1 fever,chills rejection,fever,chills

001 2 artery rupture

002 0

002 1 rejection

Output that I got after running the below code:

ID Recpos AE AE_adj

001 1 fever,chills rejection,fever,chills

001 2 artery rupture

002 0

002 1 rejection

Can you please suggest what is going wrong with in the merging? Is it the renaming thing that is causing this issue?

Thank you,

data step_01_a;
  set a;
  keep ID Recpos AE;
run;

proc sort data = step_01_a;
by ID Recpos;
run;

data step_01_b;
  set b;
  keep ID Rec AE_adj;
run;

proc sort data=step_01_b out=step_01_b(rename=(Rec=Recpos));
by ID Recpos;
run;

data final;
   merge step_01_a(in=a) step_01_b(in=b);
   if a or b;
run;

ballardw · Posted 08-22-2018 04:56 PM

You post your example data as dataset a and dataset b, and not as usable data in the form of data step, and then do not use those names in the code. So there is some possibility of confusion right there.

Second, we do not have any actual surety of the contents of any of your data sets, names of variables or such.

You renamed REC to RECPOSE in the second sort so REC went away and unless there is a REC variable in the A set then you shifted the value to the RECPOSE column.

Did you actually intend to match the records by sequence or by the value of ID?

Nasya · Posted 08-22-2018 05:05 PM

Hi,

Sorting is done by both ID and Record which has different labels in dataset a and b. Dataset a has Record as Recpos and b as Rec and hence I have done the renaming.

Is this causing any issue of Record information being over written or something?

Is there a way we can merge two different dataset and still have all the values in the output, even the missing values?

Thank you.

ballardw · Posted 08-22-2018 05:29 PM

@Nasya wrote:

Hi,

Sorting is done by both ID and Record which has different labels in dataset a and b. Dataset a has Record as Recpos and b as Rec and hence I have done the renaming.

Is this causing any issue of Record information being over written or something?

Is there a way we can merge two different dataset and still have all the values in the output, even the missing values?

Thank you.

From your wanted and actual code the difference is that REC does not appear as a variable and the value appears in the wrong variable:

proc sort data=step_01_b 
     out=step_01_b(rename=(Rec=Recpos));
by ID Recpos;
run;

which is why the variable REC is not in the output. There is no variable. How can it "have a missing value"?

I think you need to expand on the logic of rules involving the role of ID RECPOS and REC in the merge process.

Your merge did not nave any BY statement, which was somewhat odd in that you sorted two data sets before merging, made a common variable (recpos). Without a BY statement MERGE will match the first record from one set with the first of the second, only using the order of the records in the two sets.

Without matching to ensure a common sort order there may be other issues that can arise. And since missing comes before 0 in sort orders you will have some difficulty getting the order you show with 0 before missing for recpos on id 002

Reeza · Posted 08-22-2018 05:12 PM

Yes, its a rename issue. Don't rename it if you don't want it merged.

Your MERGE is also missing a BY statement which you likely want.

See the coloured text below. If you want it merged by REC and ID then you will need a different approach than outlined here.

One thing that can help avoid these issues is if you comment your code. Before each step add a single comment that explains the logic, ie:

1. create a smaller data set to merge and test

2. Sort before a merge

3. join data from adverse events to the main trial data

data step_01_a;
	set a;
	keep ID Recpos AE;
run;

proc sort data = step_01_a;
	by ID Recpos;
run;

data step_01_b;
	set b;
	keep ID Rec AE_adj;
run;

proc sort data=step_01_b out=step_01_b (rename=(Rec=Recpos));
	by ID Recpos;
run;

data final;
	merge step_01_a(in=a) step_01_b(in=b);
    BY ID;
	if a or b;
run;

Nasya · Posted 08-22-2018 05:16 PM

Yes, merging is expected with both ID and REC(from a and b).

Can you please suggest/guide me so that I can try that approach?

Thank you

Reeza · Posted 08-22-2018 05:51 PM

This gives what you want, but I'd double check that logic, it seems a bit weird, specifically the REC field as indicated by @ballardw

*create sample data1;
data tableA;
infile cards dlm=',' truncover dsd;
informat ID $4. RecPos 8. ae $50.;
input ID  $    Recpos    AE  $;
cards;
001, 1, "fever,chills" 
001, 2, "artery rupture"  
002, 0,
;;;;
run; 

*create sample data2;
data tableB;
infile cards dlm=',' truncover dsd;
informat ID $4. Rec 8. ae_adj $50.;
input ID  $    Rec    AE_adj  $;
cards;
001, 1, "rejection, fever,chills" 
002, 1,   "rejection"
;;;;
run; 


*create a new join ID for table A;
data tableA2;
set tableA;
RecJoin=RecPOs;
run;

*create a new join ID for table B;
data tableB2;
set tableB;
RecJoin=Rec;
run;

*sort to merge;
proc sort data=tableA2;
by ID recJoin;
run;

proc sort data=TableB2;
by ID RecJoin;
run;

*merge;
data want;
merge tableA2 (in= a) 
      tableB2 (in= b) ;
by ID recJoin;
run;

PS I really really hate typing out data, so if you could provide data as a sample, similar to what I did at the top of my code in future questions that's helpful.

Merge Statement