Solved: Re: question about merge

comeon2012 · Posted 07-26-2012 10:58 PM

Hi,

I have several datasets with each one containing the data of a variable. The file name and its no. of observations are as follows:

var1(2900 obs), var2(2800 obs), var3(2500 obs).

Note that IDs of var1(i.e. 2900) may or may not cover IDs of var2/var3.

What I want is to merge them together to a whole file and the program is like below:

data whole;

merge var2

var1

var3

;

by ID;

run;

My questions:

Is it the length(no. of rows) of whole determined by the length of the FIRST input dataset or the LONGEST one or the sum of the non-duplicated IDs in these three datasets?

Thanks.

Haikuo · Posted 07-26-2012 11:43 PM

Hi,

Assumption: if ID is unique (with no duplicates) within either of 3 data sets

Answer : 'the total number of non-duplicate ids'.

Haikuo

View solution in original post

art297 · Posted 07-26-2012 11:05 PM

You didn't provide enough info. Do all three files only contain one variable, each, and do they all share the same variable name?

What are you trying to do? What do you hope to accomlish? At least the initial answers will be needed for anyone to answer your question without making a number of assumptions.

comeon2012 · Posted 07-26-2012 11:35 PM

Hi Arthur,

Thanks for your reply and questions.

file var1 inlcudes variables ID and var1 and there are 2900 obs.

file var2 inlcudes variables ID and var2 and there are 2800 obs.

file var3 inlcudes variables ID and var3 and there are 2500 obs.

I want to merage them together to a file with variables ID, var1, var2, and var3.

How is the length of the output dataset determined?

Haikuo · Posted 07-26-2012 11:43 PM

Hi,

Assumption: if ID is unique (with no duplicates) within either of 3 data sets

Answer : 'the total number of non-duplicate ids'.

Haikuo

art297 · Posted 07-27-2012 08:59 AM

I would clarify Haikuo's response a bit. No assumption is needed as long as there aren't duplicate ids within any of the three datasets (Note: this statement was clarified based on Linlin's subsequent post)! You will obtain one record for each unique ID across the 3 datasets.

One thing to be concerned about, however, is the definition of "unique". If the ID field has different lengths across the three files, IDs that appear to be the same may not be considered to be unique.

Linlin · Posted 07-27-2012 09:20 AM

Hi Art,

Can I disagree with your statement "No assumption is needed!" ?

data have1;

input id @@;

cards;

1 2 3 3 4 5

;

data have2;

input id @@;

cards;

1 6 6

;

data have3;

input id @@;

cards;

1 2 3

;

data want;

merge have1 have2 have3;

by id;

run;

title with dupkey;

proc print;run;

proc sort nodupkey;

by id;

title without dupkey;

proc print;run;

art297 · Posted 07-27-2012 10:08 AM

@Linlin: Since you are correct of course you can disagree! The possibility of one to many, or many-to-many, could easily make it so that merge within a datastep couldn't even be used without extra coding.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!