Solved: Merging based on more conditions

rmrsr · Posted 08-19-2012 06:49 AM

Hi,

I am working on a dataset consisting of joining a series of quiestionaires to a database. I would therefore like to know how I could join the datasets together based on a common variable, while taking care to preserve the initial data in the database - e.g. merging based by a common variable while keeping all observation related with another, uncommon, variable a.

Obs A B C Obs X Z

1 Yellow Bear Why 1 Bear 100

2 Yellow Bear When 2 Bear 250

3 Yellow Bear Where 3 Wolf 50

4 Blue Wolf Why 4 Lion 1000

5 Blue Wolf When 5 Lion 5000

6 Blue Wolf Where

7 Green Lion Where

8 Green Lion How many

The common variable is obviously here B & X, which I have merged using the normal merge statement (data-merge-by-run). However I do now want to join these two datasets in a manner that datafile ABC is repeated per Z i.e.

1 Yellow Bear Why 100

2 Yellow Bear When 100

3 Yellow Bear Where 100

4 Yellow Bear Why 250

5 Yellow Bear When 250

6 Yellow Bear Where 250

7 Blue Wolf Why 50

8 Blue Wolf When 50

9 Blue Wolf Where 50

10 Green Lion Where 1000

11 Green Lion How many 1000

12 Green Lion Where 5000

13 Green Lion How many 5000

I hope my description was not too confusing, hoping for a quick reply.

-R

Tom · Posted 08-19-2012 02:29 PM

Seems like you just want a simple join based on the common variables? SQL is very good at this, DATA STEP is not.

If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.

proc sql noprint;

create table want as

select * from ds1 left join ds2

where ds1.b = ds2.b

;

quit;

View solution in original post

art297 · Posted 08-19-2012 09:38 AM

You could do it by using proc sql. e.g.,

proc sql;

create table want as

select one.*,two.Z

from one,two

having one.B=two.X

order by B,Z

;

quit;

rmrsr · Posted 08-19-2012 11:38 AM

Hi Arthur and thanks for your reply.

This was insightful, however I have some further questions as this did not work out for me;
- The problem I have now is that not all observations are created to the new set (50% are lost!). I would like to point out that my datafile has a lot more variables than specified above (500+, and about 12,5 million observations) which may enlarge the possibilites of mess-ups - especially since I am new to SAS.

I see you are using a "Having" statment, how does this compare to the normal "where" statement"?

art297 · Posted 08-19-2012 01:46 PM

In your test case where and having would have the same effect. However, there are differences between the two. Take a look at:

http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_sqlproc_6992.pdf

If you go to the index (at the end), and click on having, it will bring up the page that describes the basic differences between the two.

Haikuo · Posted 08-19-2012 11:21 AM

Of course Art's SQL approach is intuitive, direct and efficient. Here is just to show with the help of hash(), data step is able to mimic Cartesian products that SQL used to edge on:

data h1;

input (A B C) (:$);

cards;

Yellow Bear Why

Yellow Bear When

Yellow Bear Where

Blue Wolf Why

Blue Wolf When

Blue Wolf Where

Green Lion Where

Green Lion How many

;

data h2;

input X$ Z;

cards;

Bear 100

Bear 250

Wolf 50

Lion 1000

Lion 5000

;

data want;

if _n_=1 then do;

if 0 then set h1 ;

dcl hash h1(dataset:'h1',multidata:'y');

h1.definekey('b');

h1.definedata(all:'y');

h1.definedone();

end;

set h2;

rc=h1.find(key:x);

do rc=0 by 0 while (rc=0);

output;

rc=h1.find_next(key:x);

end;

drop x rc;

run;

proc print;run;.

Haikuo

Tom · Posted 08-19-2012 11:54 AM

Can you explain more what you are doing?

Which of the datasets is the answers to the questions? What is the other dataset? Who is answering the questions and how do you uniquely identify individual respondents?

Why is it that you do not want all of the values of Z? How do you pick which value to use? For example when the answer is "Bear" when would you use 100 and when would you want to use 250 for the value of "Z"? Would you like and average of 100 and 250? A sum ?

rmrsr · Posted 08-19-2012 12:26 PM

Hi Tom,

I do want to keep all of Z. This is the essence of my problem. From the example in the first post I have two Bear observations, a sum for a bear hide if you want. What I want is the questionaire for each of the individual animal observation (e.g. for each bear hide price) - and as the hide price is not a common variable I am finding this hard to achieve.

I am sorry if my explanation of the problem was bad. I'll use another example, more closely related to my problem below:

I have 2 datasets, one (I) with data on customers, and another (A) with a questionaire about our services. These two may be linked through a common variable, customer number. Each question on the questionaire is represented with one observation (one line in SAS), hence each individual customer questionaire may have multiple observations in dataset A. Similarly the data may have one or more observation for each customer number in the customer data set (B)may have one or more observations for each customer number in the second data set B. Luckily this is no problem using the Merge statement if the dataset is properly sorted. However, I have one other, uncommon variable, in one of the datasets (II) which is importaint is not lost.

As for now my problem using the merge statement is that the data are not properly merged together as SAS just jumps to the next observation with the same customer ID (in the example below; Customer Name)

SET A

x y r s

Obs#1 Question 1 Jake Obs#1 Jake 100

Obs#2 Question 2 Jake Obs#1 Jake 250

Obs#3 Question 3 Jake

yielding

obs # 1 question 1 Jake 100

obs # 2 question 2 Jake 250

Obs# 3 question 3 Jake 250

Wheras I want to have

Obs#1 Question 1 Jake 100

Obs#2 Question 2 Jake 100

Obs#3 Question 3 Jake 100

Obs#4 Question 1 Jake 250

Obs#5 Question 2 Jake 250

Obs#6 Question 3 Jake 250

------

Arthur's procedure above did partially acomplish this, however when I control for #observations by filtering away the uncommon variables, leaving me with only the common variables and using NODUPLICATES to sort out repeated customer# the resulting number of observations deviates severly (50% lost!) from what "I started" out with... E.g. something goes severly wrong using this procedure.

(When I use the MERGE BY statement I am able to retrieve the same number of customer# as I started out with.)

Tom · Posted 08-19-2012 02:29 PM

Seems like you just want a simple join based on the common variables? SQL is very good at this, DATA STEP is not.

If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.

proc sql noprint;

create table want as

select * from ds1 left join ds2

where ds1.b = ds2.b

;

quit;

rmrsr · Posted 08-19-2012 03:09 PM

Thank you

Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Re: Merging based on more conditions

Classroom Training Available!