Hi,
I am working on a dataset consisting of joining a series of quiestionaires to a database. I would therefore like to know how I could join the datasets together based on a common variable, while taking care to preserve the initial data in the database - e.g. merging based by a common variable while keeping all observation related with another, uncommon, variable a.
Obs A B C Obs X Z
1 Yellow Bear Why 1 Bear 100
2 Yellow Bear When 2 Bear 250
3 Yellow Bear Where 3 Wolf 50
4 Blue Wolf Why 4 Lion 1000
5 Blue Wolf When 5 Lion 5000
6 Blue Wolf Where
7 Green Lion Where
8 Green Lion How many
The common variable is obviously here B & X, which I have merged using the normal merge statement (data-merge-by-run). However I do now want to join these two datasets in a manner that datafile ABC is repeated per Z i.e.
1 Yellow Bear Why 100
2 Yellow Bear When 100
3 Yellow Bear Where 100
4 Yellow Bear Why 250
5 Yellow Bear When 250
6 Yellow Bear Where 250
7 Blue Wolf Why 50
8 Blue Wolf When 50
9 Blue Wolf Where 50
10 Green Lion Where 1000
11 Green Lion How many 1000
12 Green Lion Where 5000
13 Green Lion How many 5000
I hope my description was not too confusing, hoping for a quick reply.
-R
Seems like you just want a simple join based on the common variables? SQL is very good at this, DATA STEP is not.
If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.
proc sql noprint;
create table want as
select * from ds1 left join ds2
where ds1.b = ds2.b
;
quit;
You could do it by using proc sql. e.g.,
proc sql;
create table want as
select one.*,two.Z
from one,two
having one.B=two.X
order by B,Z
;
quit;
Hi Arthur and thanks for your reply.
This was insightful, however I have some further questions as this did not work out for me;
- The problem I have now is that not all observations are created to the new set (50% are lost!). I would like to point out that my datafile has a lot more variables than specified above (500+, and about 12,5 million observations) which may enlarge the possibilites of mess-ups - especially since I am new to SAS.
I see you are using a "Having" statment, how does this compare to the normal "where" statement"?
In your test case where and having would have the same effect. However, there are differences between the two. Take a look at:
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_sqlproc_6992.pdf
If you go to the index (at the end), and click on having, it will bring up the page that describes the basic differences between the two.
Of course Art's SQL approach is intuitive, direct and efficient. Here is just to show with the help of hash(), data step is able to mimic Cartesian products that SQL used to edge on:
data h1;
input (A B C) (:$);
cards;
Yellow Bear Why
Yellow Bear When
Yellow Bear Where
Blue Wolf Why
Blue Wolf When
Blue Wolf Where
Green Lion Where
Green Lion How many
;
data h2;
input X$ Z;
cards;
Bear 100
Bear 250
Wolf 50
Lion 1000
Lion 5000
;
data want;
if _n_=1 then do;
if 0 then set h1 ;
dcl hash h1(dataset:'h1',multidata:'y');
h1.definekey('b');
h1.definedata(all:'y');
h1.definedone();
end;
set h2;
rc=h1.find(key:x);
do rc=0 by 0 while (rc=0);
output;
rc=h1.find_next(key:x);
end;
drop x rc;
run;
proc print;run;.
Haikuo
Can you explain more what you are doing?
Which of the datasets is the answers to the questions? What is the other dataset? Who is answering the questions and how do you uniquely identify individual respondents?
Why is it that you do not want all of the values of Z? How do you pick which value to use? For example when the answer is "Bear" when would you use 100 and when would you want to use 250 for the value of "Z"? Would you like and average of 100 and 250? A sum ?
Hi Tom,
I do want to keep all of Z. This is the essence of my problem. From the example in the first post I have two Bear observations, a sum for a bear hide if you want. What I want is the questionaire for each of the individual animal observation (e.g. for each bear hide price) - and as the hide price is not a common variable I am finding this hard to achieve.
I am sorry if my explanation of the problem was bad. I'll use another example, more closely related to my problem below:
I have 2 datasets, one (I) with data on customers, and another (A) with a questionaire about our services. These two may be linked through a common variable, customer number. Each question on the questionaire is represented with one observation (one line in SAS), hence each individual customer questionaire may have multiple observations in dataset A. Similarly the data may have one or more observation for each customer number in the customer data set (B)may have one or more observations for each customer number in the second data set B. Luckily this is no problem using the Merge statement if the dataset is properly sorted. However, I have one other, uncommon variable, in one of the datasets (II) which is importaint is not lost.
As for now my problem using the merge statement is that the data are not properly merged together as SAS just jumps to the next observation with the same customer ID (in the example below; Customer Name)
SET A
x y r s
Obs#1 Question 1 Jake Obs#1 Jake 100
Obs#2 Question 2 Jake Obs#1 Jake 250
Obs#3 Question 3 Jake
yielding
obs # 1 question 1 Jake 100
obs # 2 question 2 Jake 250
Obs# 3 question 3 Jake 250
Wheras I want to have
Obs#1 Question 1 Jake 100
Obs#2 Question 2 Jake 100
Obs#3 Question 3 Jake 100
Obs#4 Question 1 Jake 250
Obs#5 Question 2 Jake 250
Obs#6 Question 3 Jake 250
------
Arthur's procedure above did partially acomplish this, however when I control for #observations by filtering away the uncommon variables, leaving me with only the common variables and using NODUPLICATES to sort out repeated customer# the resulting number of observations deviates severly (50% lost!) from what "I started" out with... E.g. something goes severly wrong using this procedure.
(When I use the MERGE BY statement I am able to retrieve the same number of customer# as I started out with.)
Seems like you just want a simple join based on the common variables? SQL is very good at this, DATA STEP is not.
If you have variables A,B,C in DS1 and want to merge on the values Z from DS2 based on the values of B matching.
proc sql noprint;
create table want as
select * from ds1 left join ds2
where ds1.b = ds2.b
;
quit;
Thank you
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.