Solved: Re: How to prevent overwriting on a merge

Mscarboncopy · Posted 12-20-2018 05:24 PM

Hi

I have 2 data files that I am trying to merge by id.

Variables from one are not in the other and it is overwriting or populating the variables.

Example:

Data X;

Merge Y Z;

by ID;

Run;

Variables a b and c are only in file Y

d e and f are only in file Z.

when I merge them, variables a b and c repeats the values that were in file Y as if they were in file Z.

Why aren't they missing ? I thought they would be missing.

Is there a command that I can use to make sure they remain missing if NOT IN file Z ?

Thank you

Tom · Posted 12-21-2018 10:06 AM

If you want to merge then setup your data so that it can be merged.

data want;
  merge a(rename=(age_a=age)) b(rename=(age_b=age)) ;
  by id age;
run;

If you want more precise help then post a workable example of data that demonstrates your problem.

View solution in original post

Reeza · Posted 12-20-2018 05:27 PM

You may need to change your BY logic, if you aren't joining by ID. If the ID is the same, the same information is copied over, it is not left blank.

Mscarboncopy · Posted 12-20-2018 10:16 PM

Thank you. I changed to merge by id and age and it worked.

Now I have another issue:

I need to be able to merge the data when the age is the same.

I am not sure if I can explain but I will try.

When I merge by 2 variables if prevents the overwrite.

But if I have data for the same age for example I have measurements and diagnoses with age 10.

They are in 2 different lines when they should be in one line only.

so ID A B C D E F AGE

1 10 5 5 . . . 14

ID

1 . . . 1 1 1 14

How do I tell sas to make this entry only one entry ?

This happens in several instances in the data.

Thank you so much

Reeza · Posted 12-20-2018 10:19 PM

Post this as a new question and post your sample data as legible as possible. As shown I wouldn't be able to tell you anything.

It's likely a two step process, first consolidate your data and then merge.

Instructions on providing sample data is here, it can be fake data but needs to be legible and representative of your problem.

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

ChrisNZ · Posted 12-20-2018 05:36 PM

This makes no sense.

data Y; 
  input ID A B C;
cards;
1 12 13 14
;
data Z; 
  input ID D E F;
cards;
1 22 23 24
1 32 33 34
2 1 2 3
;
data X;
  merge Y Z;
  by ID;
run;

ID	A	B	C	D	E	F
1	12	13	14	22	23	24
1	12	13	14	32	33	34
2	.	.	.	1	2	3

No value for ABC when ID=2, since that ID is not on table Y.

High-Performance SAS Coding - Third Edition

Mscarboncopy · Posted 12-20-2018 08:41 PM

In the example you gave (thank you) it is ID 1 that does not have ABC not ID2 so

so 12 13 14 on the second line of ID1 is the overwrite. because I am pulling 32 33 34 from another data file, and when I do so

ID 1 (line 2) gets 12 13 14 when it needed to be missing (like ID2).

ChrisNZ · Posted 12-20-2018 08:51 PM

It looks like you want

data X;
  merge Y Z;
  by ID;
  if ^first.ID then call missing(A,B,C);
run;

ID	A	B	C	D	E	F
1	12	13	14	22	23	24
1	.	.	.	32	33	34
2	.	.	.	1	2	3

The original join works as any join is supposed to though: If the key is present the data for that key is used.

Clearing the data when the key is present is NOT a standard join.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 12-20-2018 07:40 PM

Show us, in a data step:

What Y and Z look like
What X should look like

Help us help you.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Mscarboncopy · Posted 12-20-2018 08:54 PM

Y has height Body Weight variables

Z has diagnoses (URI ASTHMA etc) and drugs variables

X needs to have all variables from Y and Z.

It is also longitudinal so one ID may have 5 entries in File Y and 10 in File Z.

another ID might have 2 entries in File Y and 5 entries in file Z. All different ages.

The variable that they both have in common is AGE and I found a way to prevent the ages from overwriting. So if I have one entry from Z and 10 from Y for ID 1- I have 11 entries with different ages for ID 1 but then all of the diagnosis that were 1 in 1 ENTRY are repeated as 1 in all other 10 entries even though it should have been 1 only in ONE ENTRY and 0 in all the other 10 (from Y). I thought they would be missing and then I could tell sas to do if dx =. then dx=0. for example.

But with all diagnoses = 1 I don't have a way to give a command to transform it to 0. It will transform others (that must remain 1 to 0) - it will overwrite my correct data in Z. Every ID is different: different ages and different amount of entries and diagnoses.

ChrisNZ · Posted 12-20-2018 09:16 PM

See @mkeintz's reply.

Provide BEFORE and AFTER sample data.

High-Performance SAS Coding - Third Edition

Reeza · Posted 12-20-2018 09:53 PM

@Mscarboncopy wrote:

Y has height Body Weight variables

Z has diagnoses (URI ASTHMA etc) and drugs variables

X needs to have all variables from Y and Z.

It is also longitudinal so one ID may have 5 entries in File Y and 10 in File Z.

another ID might have 2 entries in File Y and 5 entries in file Z. All different ages.

The variable that they both have in common is AGE and I found a way to prevent the ages from overwriting. So if I have one entry from Z and 10 from Y for ID 1- I have 11 entries with different ages for ID 1 but then all of the diagnosis that were 1 in 1 ENTRY are repeated as 1 in all other 10 entries even though it should have been 1 only in ONE ENTRY and 0 in all the other 10 (from Y). I thought they would be missing and then I could tell sas to do if dx =. then dx=0. for example.

But with all diagnoses = 1 I don't have a way to give a command to transform it to 0. It will transform others (that must remain 1 to 0) - it will overwrite my correct data in Z. Every ID is different: different ages and different amount of entries and diagnoses.

The purpler portion is what you expect, but your expectations are incorrect. If you want this, you need to be able to join by more than ID, ie ID + Event or another variable to uniquely identify each event. If you cannot, a merge links each record by the identified variable and repeats.

So show us a better example of what you want and we can maybe offer some suggestions to get there, but a MERGE/JOIN as shown will not get you there because your logic is more complicated.

Mscarboncopy · Posted 12-20-2018 10:19 PM

yes.

Thank you. I changed the merge by id and age and it worked.

it is not overwriting.

Now I have another issue:

I need to be able to merge the data when the age is the same.

I am not sure if I can explain but I will try.

When I merge by 2 variables if prevents the overwrite and this was possible because I was talking about it with you here 🙂

But if I have data for the same age for example say I have measurements (from the one file) and diagnoses (from the other) with same age 10.

They are in 2 different lines when they should be in one line only.

so ID A B C D E F AGE

1 10 5 5 . . . 14

1 . . . 1 1 1 14

How do I tell sas to make this entry only one entry for those cases only?

This happens in several instances in the data.

Thank you so much

Kurt_Bremser · Posted 12-21-2018 03:16 AM

PROVIDE.EXAMPLE.DATA.IN.A.DATA.STEP

and your code.

@ChrisNZ has given you a very beautiful example how to do it, and the macro linked to by @Reeza even automates it. Help us to help you.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Mscarboncopy · Posted 12-21-2018 08:32 AM

Data want;

Merge Y Z;

by id visit_type;

Run;

This gives me a data set with 13.000 observations. Multiple observations for each ID.

Different variables, different ages. Only a couple of ages match from one file to the other.

I need to make sure the ones that do match in age are only in one line instead of 2.

Variables in Y are not the same as in Z aside from id visit_type and age.

but age var has two different names (depending on the file) so it would not overwrite.

one is age_yr the other is visit_age.

I am trying to do a comparison to see which ones match but then what ?

So this way I can merge them but prevent the overwriting of the other variables... so what is not in Y has to be missing and what is not in Z has to be missing.

I don't have a data step for that, it is what I am trying to figure out.

I hope that helps.

Thanks

Mari

Kurt_Bremser · Posted 12-21-2018 10:05 AM

@Mscarboncopy wrote:

Data want;

Merge Y Z;

by id visit_type;

Run;

This gives me a data set with 13.000 observations. Multiple observations for each ID.

Different variables, different ages. Only a couple of ages match from one file to the other.

I need to make sure the ones that do match in age are only in one line instead of 2.

Variables in Y are not the same as in Z aside from id visit_type and age.

but age var has two different names (depending on the file) so it would not overwrite.

one is age_yr the other is visit_age.

I am trying to do a comparison to see which ones match but then what ?

So this way I can merge them but prevent the overwriting of the other variables... so what is not in Y has to be missing and what is not in Z has to be missing.

I don't have a data step for that, it is what I am trying to figure out.

I hope that helps.

Thanks

Mari

I do not see any usable example data, against which we could test code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!