turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Tough one: Merge 2 datasets, many to many Match Me...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-27-2011 12:40 AM

Hello,

I know I can do a one-to-many match merge, but can I do a many to many match merge?

Essentially, i have 3 variables to match by, but here's where things get tricky: some of the observations from one set will not have corresponding matches in the second, while other observations will have multiple matches to the second set. The following three sets represent what i'm referring to, with the first two the sets to merge and the last what i want the final set to look like. I think i can do this with proc sql, but am not sure how.

With this first set, the first 2 variables don't appear in the second set. Then come the 3 variables to match by.

The next set starts with the 3 variables to match by, then the variables to add to the final set.

The final set shows how the observations in the first set, where PRP='Out' have no values from fbp and Percent from the second. The other observations, where PRP='Hit' then duplicate themselves to account for the 3 situations which occur when PRP is a 'Hit'.

I know this is a touch confusing, so please feel free to ask questions.

[pre]

data set1;

input Hit Z zn BSi$ PRP$;

datalines;

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

14 87.5 23 R Out

14 87.5 23 R Hit

14 87.5 23 R Hit

14 87.5 23 R Out

run;

data set2;

input zn BPT BBV DP$ BSi$ PRP$ fbp Percent;

datalines;

23 L Hit 3 .8

23 L Hit 4 .1

23 L Hit 9 .1

23 R Hit 3 .6

23 R Hit 4 .3

23 R Hit 9 .1

run;

data set3;

input Hit Z zn BSi$ PRP$ fbp Percent;

datalines;

5 100 23 L Out . .

5 100 23 L Hit 3 .8

5 100 23 L Hit 4 .1

5 100 23 L Hit 9 .1

5 100 23 L Out . .

5 100 23 L Hit 3 .8

5 100 23 L Hit 4 .1

5 100 23 L Hit 9 .1

5 100 23 L Out . .

14 87.5 23 R Out . .

14 87.5 23 R Hit 3 .6

14 87.5 23 R Hit 4 .3

14 87.5 23 R Hit 9 .1

14 87.5 23 R Hit 3 .6

14 87.5 23 R Hit 4 .3

14 87.5 23 R Hit 9 .1

14 87.5 23 R Out . .

run;

[/pre]

Thanks,

C

I know I can do a one-to-many match merge, but can I do a many to many match merge?

Essentially, i have 3 variables to match by, but here's where things get tricky: some of the observations from one set will not have corresponding matches in the second, while other observations will have multiple matches to the second set. The following three sets represent what i'm referring to, with the first two the sets to merge and the last what i want the final set to look like. I think i can do this with proc sql, but am not sure how.

With this first set, the first 2 variables don't appear in the second set. Then come the 3 variables to match by.

The next set starts with the 3 variables to match by, then the variables to add to the final set.

The final set shows how the observations in the first set, where PRP='Out' have no values from fbp and Percent from the second. The other observations, where PRP='Hit' then duplicate themselves to account for the 3 situations which occur when PRP is a 'Hit'.

I know this is a touch confusing, so please feel free to ask questions.

[pre]

data set1;

input Hit Z zn BSi$ PRP$;

datalines;

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

14 87.5 23 R Out

14 87.5 23 R Hit

14 87.5 23 R Hit

14 87.5 23 R Out

run;

data set2;

input zn BPT BBV DP$ BSi$ PRP$ fbp Percent;

datalines;

23 L Hit 3 .8

23 L Hit 4 .1

23 L Hit 9 .1

23 R Hit 3 .6

23 R Hit 4 .3

23 R Hit 9 .1

run;

data set3;

input Hit Z zn BSi$ PRP$ fbp Percent;

datalines;

5 100 23 L Out . .

5 100 23 L Hit 3 .8

5 100 23 L Hit 4 .1

5 100 23 L Hit 9 .1

5 100 23 L Out . .

5 100 23 L Hit 3 .8

5 100 23 L Hit 4 .1

5 100 23 L Hit 9 .1

5 100 23 L Out . .

14 87.5 23 R Out . .

14 87.5 23 R Hit 3 .6

14 87.5 23 R Hit 4 .3

14 87.5 23 R Hit 9 .1

14 87.5 23 R Hit 3 .6

14 87.5 23 R Hit 4 .3

14 87.5 23 R Hit 9 .1

14 87.5 23 R Out . .

run;

[/pre]

Thanks,

C

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to CharlesR

04-27-2011 07:06 AM

I am curious that in dataset set2 there is only 5 columns data but your dataset set2 have 8 variables?

OK.

The biggest problem you faced is the order of data you want keep,So it have to force me to

use data step to make a Cartesian Product.

And I also think sql can not do it ,because sql will sort it automatically.

To see the following result is whether you want.

[pre]

data set1;

infile datalines expandtabs dlm=' ';

input Hit Z zn BSi$ PRP$;

datalines;

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

14 87.5 23 R Out

14 87.5 23 R Hit

14 87.5 23 R Hit

14 87.5 23 R Out

;

run;

data set2;

infile datalines expandtabs dlm=' ';

input _zn _BSi$ _PRP$ fbp Percent;

datalines;

23 L Hit 3 .8

23 L Hit 4 .1

23 L Hit 9 .1

23 R Hit 3 .6

23 R Hit 4 .3

23 R Hit 9 .1

;

run;

data want(drop=_: found);

set set1;

do i=1 to nobs;

set set2 point=i nobs=nobs;

if zn=_zn and bsi=_bsi and prp=_prp then do; output; found=1; end;

if i=nobs and not found then do;

call missing(fbp,percent);

output;

end;

end;

run;

[/pre]

I have to leave now, It is the biggest challenge I face in recent days.

Ksharp

OK.

The biggest problem you faced is the order of data you want keep,So it have to force me to

use data step to make a Cartesian Product.

And I also think sql can not do it ,because sql will sort it automatically.

To see the following result is whether you want.

[pre]

data set1;

infile datalines expandtabs dlm=' ';

input Hit Z zn BSi$ PRP$;

datalines;

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

14 87.5 23 R Out

14 87.5 23 R Hit

14 87.5 23 R Hit

14 87.5 23 R Out

;

run;

data set2;

infile datalines expandtabs dlm=' ';

input _zn _BSi$ _PRP$ fbp Percent;

datalines;

23 L Hit 3 .8

23 L Hit 4 .1

23 L Hit 9 .1

23 R Hit 3 .6

23 R Hit 4 .3

23 R Hit 9 .1

;

run;

data want(drop=_: found);

set set1;

do i=1 to nobs;

set set2 point=i nobs=nobs;

if zn=_zn and bsi=_bsi and prp=_prp then do; output; found=1; end;

if i=nobs and not found then do;

call missing(fbp,percent);

output;

end;

end;

run;

[/pre]

I have to leave now, It is the biggest challenge I face in recent days.

Ksharp

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to CharlesR

04-27-2011 11:55 AM

As KSharp pointed out, the biggest problem with SQL is maintaining the sort order. If this is important, and SET2 is small enough to load into memory, consider using a multidata hash object solution.

[pre]

data set1;

input Hit Z zn BSi$ PRP$;

datalines;

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

14 87.5 23 R Out

14 87.5 23 R Hit

14 87.5 23 R Hit

14 87.5 23 R Out

run;

data set2;

input zn BSi$ PRP$ fbp Percent;

datalines;

23 L Hit 3 .8

23 L Hit 4 .1

23 L Hit 9 .1

23 R Hit 3 .6

23 R Hit 4 .3

23 R Hit 9 .1

run;

data set3(drop=rc next);

set set1;

if _n_=1 then do;

if 0 then set set2;

declare hash set2(dataset:'set2', multidata:'YES');

set2.definekey('zn', 'BSi', 'PRP');

set2.definedata('fbp', 'Percent');

set2.definedone();

end;

rc=set2.find();

if rc ne 0 then do;

call missing(fbp, Percent);

output;

end;

else do;

output;

set2.has_next(result:next);

do while (next ne 0 );

set2.find_next();

output;

set2.has_next(result:next);

end;

end;

run;

[/pre]

[pre]

data set1;

input Hit Z zn BSi$ PRP$;

datalines;

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

5 100 23 L Hit

5 100 23 L Out

14 87.5 23 R Out

14 87.5 23 R Hit

14 87.5 23 R Hit

14 87.5 23 R Out

run;

data set2;

input zn BSi$ PRP$ fbp Percent;

datalines;

23 L Hit 3 .8

23 L Hit 4 .1

23 L Hit 9 .1

23 R Hit 3 .6

23 R Hit 4 .3

23 R Hit 9 .1

run;

data set3(drop=rc next);

set set1;

if _n_=1 then do;

if 0 then set set2;

declare hash set2(dataset:'set2', multidata:'YES');

set2.definekey('zn', 'BSi', 'PRP');

set2.definedata('fbp', 'Percent');

set2.definedone();

end;

rc=set2.find();

if rc ne 0 then do;

call missing(fbp, Percent);

output;

end;

else do;

output;

set2.has_next(result:next);

do while (next ne 0 );

set2.find_next();

output;

set2.has_next(result:next);

end;

end;

run;

[/pre]