BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rmrsr
Calcite | Level 5

Hi,

I am working on a dataset consisting of joining a series of quiestionaires to a database. I would therefore like to know how I could join the datasets together based on a common variable, while taking care to preserve the initial data in the database - e.g. merging based by a common variable while keeping all observation related with another, uncommon, variable a.

Obs     A             B           C                                      Obs     X                           Z

1     Yellow          Bear       Why                                   1       Bear                    100

2     Yellow          Bear       When                                 2       Bear                     250

3     Yellow          Bear       Where                                3       Wolf                       50

4     Blue             Wolf       Why                                   4        Lion                    1000

5     Blue             Wolf       When                                 5        Lion                     5000

6     Blue             Wolf       Where                          

7     Green          Lion        Where                          

8     Green          Lion       How many                      

The common variable is obviously here B & X, which I have merged using the normal merge statement (data-merge-by-run). However I do now want to join these two datasets in a manner that datafile ABC is repeated per Z i.e.

1     Yellow          Bear       Why         100                          

2     Yellow          Bear       When       100                        

3     Yellow          Bear       Where       100

4     Yellow          Bear       Why         250                          

5     Yellow          Bear       When       250                        

6     Yellow          Bear       Where      250                    

7     Blue             Wolf       Why         50                          

8     Blue             Wolf       When       50                         

9     Blue             Wolf       Where      50                         

10     Green          Lion        Where       1000                         

11     Green          Lion       How many   1000

12     Green          Lion        Where       5000

13     Green          Lion       How many   5000

I hope my description was not too confusing, hoping for a quick reply.

-R

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Seems like you just want a simple join based on the common variables?  SQL is very good at this, DATA STEP is not.

If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.

proc sql noprint;

create table want as

   select * from ds1 left join ds2

   where ds1.b = ds2.b

;

quit;

View solution in original post

8 REPLIES 8
art297
Opal | Level 21

You could do it by using proc sql.  e.g.,

proc sql;

  create table want as

    select one.*,two.Z

      from one,two

        having one.B=two.X

          order by B,Z

  ;

quit;

rmrsr
Calcite | Level 5

Hi Arthur and thanks for your reply.

This was insightful, however I have some further questions as this did not work out for me;
- The problem I have now is that not all observations are created to the new set (50% are lost!). I would like to point out that my datafile has a lot more variables than specified above (500+, and about 12,5 million observations) which may enlarge the possibilites of mess-ups - especially since I am new to SAS.

I see you are using a "Having" statment, how does this compare to the normal "where" statement"?

art297
Opal | Level 21

In your test case where and having would have the same effect.  However, there are differences between the two.  Take a look at:

http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_sqlproc_6992.pdf

If you go to the index (at the end), and click on having, it will bring up the page that describes the basic differences between the two.

Haikuo
Onyx | Level 15

Of course Art's SQL approach is intuitive, direct and efficient. Here is just to show with the help of hash(), data step is able to mimic Cartesian products that SQL used to edge on:

data h1;

input    (A             B           C) (:$);      

cards;

Yellow          Bear       Why     

Yellow          Bear       When    

Yellow          Bear       Where   

Blue             Wolf       Why    

Blue             Wolf       When   

Blue             Wolf       Where  

Green          Lion        Where   

Green          Lion       How many 

;

data h2;

input   X$   Z;

cards;

   Bear                    100

  Bear                     250

  Wolf                       50

     Lion                    1000

    Lion                     5000

;

data want;

   if _n_=1 then do;

      if 0 then set h1 ;

        dcl hash h1(dataset:'h1',multidata:'y');

        h1.definekey('b');

        h1.definedata(all:'y');

        h1.definedone();

    end;

    set h2;

    rc=h1.find(key:x);

      do rc=0 by 0 while (rc=0);

          output;

          rc=h1.find_next(key:x);

      end;

      drop x rc;

run;

proc print;run;.

Haikuo

Tom
Super User Tom
Super User

Can you explain more what you are doing?

Which of the datasets is the answers to the questions?  What is the other dataset?  Who is answering the questions and how do you uniquely identify individual respondents?

Why is it that you do not want all of the values of Z?  How do you pick which value to use?  For example when the answer is "Bear" when would you use 100 and when would you want to use 250 for the value of "Z"?  Would you like and average of 100 and 250? A sum ?

rmrsr
Calcite | Level 5

Hi Tom,

I do want to keep all of Z. This is the essence of my problem. From the example in the first post I have two Bear observations, a sum for a bear hide if you want. What I want is the questionaire for each of the individual animal observation (e.g. for each bear hide price) - and as the hide price is not a common variable I am finding this hard to achieve.

I am sorry if my explanation of the problem was bad. I'll use another example, more closely related to my problem below:

I have 2 datasets, one (I) with data on customers, and another (A) with a questionaire about our services. These two may be linked through a common variable, customer number. Each question on the questionaire is represented with one observation (one line in SAS), hence each individual customer questionaire may have multiple observations in dataset A. Similarly the data may have one or more observation for each customer number in the customer data set (B)may have one or more observations for each customer number in the second data set B. Luckily this is no problem using the Merge statement if the dataset is properly sorted. However, I have one other, uncommon variable, in one of the datasets (II) which is importaint is not lost.

As for now my problem using the merge statement is that the data are not properly merged together as SAS just jumps to the next observation with the same customer ID (in the example below; Customer Name)

SET A


               x                 y                                           r        s

Obs#1   Question 1   Jake                       Obs#1    Jake    100

Obs#2    Question 2   Jake                        Obs#1   Jake     250

Obs#3    Question 3   Jake

yielding

obs # 1 question 1 Jake 100

obs # 2 question 2 Jake 250

Obs# 3 question 3 Jake 250

Wheras I want to have

Obs#1     Question 1 Jake 100

Obs#2     Question 2 Jake 100

Obs#3     Question 3 Jake 100

Obs#4     Question 1 Jake 250

Obs#5     Question 2 Jake 250

Obs#6     Question 3 Jake 250

------

Arthur's procedure above did partially acomplish this, however when I control for #observations by filtering away the uncommon variables, leaving me with only the common variables and using NODUPLICATES to sort out repeated customer# the resulting number of observations deviates severly (50% lost!) from what "I started" out with... E.g. something goes severly wrong using this procedure.

(When I use the MERGE BY statement I am able to retrieve the same number of customer# as I started out with.)

Tom
Super User Tom
Super User

Seems like you just want a simple join based on the common variables?  SQL is very good at this, DATA STEP is not.

If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.

proc sql noprint;

create table want as

   select * from ds1 left join ds2

   where ds1.b = ds2.b

;

quit;

rmrsr
Calcite | Level 5

Thank you

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3933 views
  • 3 likes
  • 4 in conversation