Re: Proc Merge on Existing Datasets

finans_sas · Posted 01-12-2013 08:11 AM

Dear Community Members,

I have a quick question on proc merge. Suppose, I have the following datasets: state.a and state.b with a common id. When I apply data step merge, I generally create a new dataset, but recently I have been doing something like:

data state.a;

merge state.a(in=tag) state.b

by id;

if tag=1;

run;

The reason for not creating a new dataset was to reduce the number of sets in my sas library, but later I become concerned about such a procedure leading errors that I did not notice before. In small datasets, this procedure seems to work, but in I am not sure how it really performs in large datasets (in millions). I was wondering if you could offer some guidance on this issue. Another related question is: will the answer to my question change if these datasets were in the WORK library?

Thank you for your time and help in advance.

AncaTilea · Posted 01-12-2013 10:40 AM

Hi.

I will try to answer to some/all of your questions, but first there is no PROC Merge, rather a DATA step. MERGE is the statement.

When you run the code

data state.a;

merge state.a(in=tag) state.b

by id;

if tag=1;

run;

two things will happen: 1. you only keep the observations that are in A (I assume that is what you want) and 2. for any observation in both A & B, if your two data sets have same variables, then state.b will overwrite the values in state.a

For example, state.a has

ID Age Height

1 23 5.3

2 . 5.3

And state.b has

ID Age Height

1 23 5.3

2 34 5.3

then when you merge these two data sets, the final data set

state.a will have

ID Age Height

1 23 5.3

2 34 5.3

Which may or may not be right.

I would suggest renaming all of your common variables to make sure that you do get the results you want and not overwrite values.

Working in WORK library or not has no impact on your data step.

Also, I assume that you are merging 1-to-1 or 1-to-many, right?

Anyway, I hope this helps a bit?

Good luck and let me know if you have more questions.

Anca.

finans_sas · Posted 01-12-2013 11:12 AM

Thank you Anca tilea for your answer. Actually, these two datasets are composed of different variables except for the common id. Yes, I am merging one-to-many. My concern was that when I merged state.a and state.b and called the new dataset state.a again, there could be some errors in the matching process. In other words, in the presence of two different datasets with a common id (state.a and state.b), am I allowed to call their merger state.a if state.a is all what I care about? Or should I always create a new dataset as a result of this merge operation such as:

data state.c;

merge state.a(in=tag) state.b

by id;

if tag=1;

run;

Thank you for your help.

art297 · Posted 01-12-2013 11:38 AM

My 2 cents: The only problem that I'm aware of in reusing filenames in such cases is that you rewrite the original file. Thus, if the merge doesn't work as expected you no longer have the original file unless you had maintained a backup of it prior to running the code.

Tom · Posted 01-12-2013 12:00 PM

There is a potential problem in the type or merge you describe.

There is a problem for One to Many merge when the extra variables from the dataset with a single observation (your lookup table) already exist on the dataset with the many observations. In that case only the first observation will have the new variables corrected.

If your process will always re-create or start with a version of A that does not have the variables from B that you want to merge onto it then you could do this. For example if your process always was:

1) Create A from raw data or query.

2) Overwrite A with Merge with B so that new variables are populated.

3) Use A for its purpose (with out ever needing to merge those variables in from B again).

Astounding · Posted 01-12-2013 03:06 PM

I think the real question you are asking about is the effects of switching from this:

data state.c;

merge state.a state.b;

to this:

data state.a;

merge state.a state.b;

So if we assume that the first MERGE is working correctly, what difference does it make to switch to the second MERGE? Virtually none. There is no difference in speed or accuracy. (That doesn't mean the result is correct, it merely means that if the first result was correct then the second result will be correct as well.) And no difference if you are talking about WORK data sets instead of permanent ones. Art's point should be noted ... if you notice problems later you won't have the original state.a to utilize. And your first point is also correct: you will end up being able to re-use disk space without cleaning up via PROC DATASETS, because the original state.a will become reusable space once your second DATA step completes.

There is no impact on the resources needed to complete the DATA step. The only impact is on the resources available once the DATA step is complete.

Good luck.

Merge on Existing Datasets