DATA Step, Macro, Functions and more

Merging based on more conditions

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Merging based on more conditions

Hi,

I am working on a dataset consisting of joining a series of quiestionaires to a database. I would therefore like to know how I could join the datasets together based on a common variable, while taking care to preserve the initial data in the database - e.g. merging based by a common variable while keeping all observation related with another, uncommon, variable a.

Obs     A             B           C                                      Obs     X                           Z

1     Yellow          Bear       Why                                   1       Bear                    100

2     Yellow          Bear       When                                 2       Bear                     250

3     Yellow          Bear       Where                                3       Wolf                       50

4     Blue             Wolf       Why                                   4        Lion                    1000

5     Blue             Wolf       When                                 5        Lion                     5000

6     Blue             Wolf       Where                          

7     Green          Lion        Where                          

8     Green          Lion       How many                      

The common variable is obviously here B & X, which I have merged using the normal merge statement (data-merge-by-run). However I do now want to join these two datasets in a manner that datafile ABC is repeated per Z i.e.

1     Yellow          Bear       Why         100                          

2     Yellow          Bear       When       100                        

3     Yellow          Bear       Where       100

4     Yellow          Bear       Why         250                          

5     Yellow          Bear       When       250                        

6     Yellow          Bear       Where      250                    

7     Blue             Wolf       Why         50                          

8     Blue             Wolf       When       50                         

9     Blue             Wolf       Where      50                         

10     Green          Lion        Where       1000                         

11     Green          Lion       How many   1000

12     Green          Lion        Where       5000

13     Green          Lion       How many   5000

I hope my description was not too confusing, hoping for a quick reply.

-R


Accepted Solutions
Solution
‎08-19-2012 02:29 PM
Super User
Super User
Posts: 6,502

Re: Merging based on more conditions

Seems like you just want a simple join based on the common variables?  SQL is very good at this, DATA STEP is not.

If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.

proc sql noprint;

create table want as

   select * from ds1 left join ds2

   where ds1.b = ds2.b

;

quit;

View solution in original post


All Replies
PROC Star
Posts: 7,363

Re: Merging based on more conditions

You could do it by using proc sql.  e.g.,

proc sql;

  create table want as

    select one.*,two.Z

      from one,two

        having one.B=two.X

          order by B,Z

  ;

quit;

Occasional Contributor
Posts: 7

Re: Merging based on more conditions

Hi Arthur and thanks for your reply.

This was insightful, however I have some further questions as this did not work out for me;
- The problem I have now is that not all observations are created to the new set (50% are lost!). I would like to point out that my datafile has a lot more variables than specified above (500+, and about 12,5 million observations) which may enlarge the possibilites of mess-ups - especially since I am new to SAS.

I see you are using a "Having" statment, how does this compare to the normal "where" statement"?

PROC Star
Posts: 7,363

Re: Merging based on more conditions

In your test case where and having would have the same effect.  However, there are differences between the two.  Take a look at:

http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_sqlproc_6992.pdf

If you go to the index (at the end), and click on having, it will bring up the page that describes the basic differences between the two.

Respected Advisor
Posts: 3,124

Re: Merging based on more conditions

Of course Art's SQL approach is intuitive, direct and efficient. Here is just to show with the help of hash(), data step is able to mimic Cartesian products that SQL used to edge on:

data h1;

input    (A             B           C) (:$);      

cards;

Yellow          Bear       Why     

Yellow          Bear       When    

Yellow          Bear       Where   

Blue             Wolf       Why    

Blue             Wolf       When   

Blue             Wolf       Where  

Green          Lion        Where   

Green          Lion       How many 

;

data h2;

input   X$   Z;

cards;

   Bear                    100

  Bear                     250

  Wolf                       50

     Lion                    1000

    Lion                     5000

;

data want;

   if _n_=1 then do;

      if 0 then set h1 ;

        dcl hash h1(dataset:'h1',multidata:'y');

        h1.definekey('b');

        h1.definedata(all:'y');

        h1.definedone();

    end;

    set h2;

    rc=h1.find(key:x);

      do rc=0 by 0 while (rc=0);

          output;

          rc=h1.find_next(key:x);

      end;

      drop x rc;

run;

proc print;run;.

Haikuo

Super User
Super User
Posts: 6,502

Re: Merging based on more conditions

Can you explain more what you are doing?

Which of the datasets is the answers to the questions?  What is the other dataset?  Who is answering the questions and how do you uniquely identify individual respondents?

Why is it that you do not want all of the values of Z?  How do you pick which value to use?  For example when the answer is "Bear" when would you use 100 and when would you want to use 250 for the value of "Z"?  Would you like and average of 100 and 250? A sum ?

Occasional Contributor
Posts: 7

Re: Merging based on more conditions

Hi Tom,

I do want to keep all of Z. This is the essence of my problem. From the example in the first post I have two Bear observations, a sum for a bear hide if you want. What I want is the questionaire for each of the individual animal observation (e.g. for each bear hide price) - and as the hide price is not a common variable I am finding this hard to achieve.

I am sorry if my explanation of the problem was bad. I'll use another example, more closely related to my problem below:

I have 2 datasets, one (I) with data on customers, and another (A) with a questionaire about our services. These two may be linked through a common variable, customer number. Each question on the questionaire is represented with one observation (one line in SAS), hence each individual customer questionaire may have multiple observations in dataset A. Similarly the data may have one or more observation for each customer number in the customer data set (B)may have one or more observations for each customer number in the second data set B. Luckily this is no problem using the Merge statement if the dataset is properly sorted. However, I have one other, uncommon variable, in one of the datasets (II) which is importaint is not lost.

As for now my problem using the merge statement is that the data are not properly merged together as SAS just jumps to the next observation with the same customer ID (in the example below; Customer Name)

SET A


               x                 y                                           r        s

Obs#1   Question 1   Jake                       Obs#1    Jake    100

Obs#2    Question 2   Jake                        Obs#1   Jake     250

Obs#3    Question 3   Jake

yielding

obs # 1 question 1 Jake 100

obs # 2 question 2 Jake 250

Obs# 3 question 3 Jake 250

Wheras I want to have

Obs#1     Question 1 Jake 100

Obs#2     Question 2 Jake 100

Obs#3     Question 3 Jake 100

Obs#4     Question 1 Jake 250

Obs#5     Question 2 Jake 250

Obs#6     Question 3 Jake 250

------

Arthur's procedure above did partially acomplish this, however when I control for #observations by filtering away the uncommon variables, leaving me with only the common variables and using NODUPLICATES to sort out repeated customer# the resulting number of observations deviates severly (50% lost!) from what "I started" out with... E.g. something goes severly wrong using this procedure.

(When I use the MERGE BY statement I am able to retrieve the same number of customer# as I started out with.)

Solution
‎08-19-2012 02:29 PM
Super User
Super User
Posts: 6,502

Re: Merging based on more conditions

Seems like you just want a simple join based on the common variables?  SQL is very good at this, DATA STEP is not.

If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.

proc sql noprint;

create table want as

   select * from ds1 left join ds2

   where ds1.b = ds2.b

;

quit;

Occasional Contributor
Posts: 7

Re: Merging based on more conditions

Thank you

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 339 views
  • 3 likes
  • 4 in conversation