- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Posted 10-06-2008 02:52 PM
(2341 views)
Hi, I am trying to combine two data sets with the same variables conditionally. I only want to combine observations from the second data set that have the same value as a variablex (famtrac below) in the primary data set. Variablex has the same name & same format in both data sets. I tried using the set statement with in= and an if statement, but that didn't work:
data all65plus08;
set selena.cps65plus08a (in=eldfam) selena.cps65under08;
by famtrac;
if eldfam;
run;
There are duplicate observations in the secondary data set that meet the conditions I specified and I *do* want these duplicates to be added to the new combined data set.
Any help would be appreciated!
data all65plus08;
set selena.cps65plus08a (in=eldfam) selena.cps65under08;
by famtrac;
if eldfam;
run;
There are duplicate observations in the secondary data set that meet the conditions I specified and I *do* want these duplicates to be added to the new combined data set.
Any help would be appreciated!
7 REPLIES 7
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are using SET (interleave observations) with a BY. Do you mean to use a MERGE with a BY, in order to "combine" your two files? Do consider the impact of the MERGE operation with same-named variables in the two (or more) files on the MERGE statement. Have a look at the SAS support website http://support.sas.com/ and the Product Documentation for details about the Data step's MERGE statement processing.
Scott Barry
SBBWorks, Inc.
Scott Barry
SBBWorks, Inc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't want to merge the data sets, I need to interleave them (the end result would be a data set with the observations from the primary data set plus all observations from the secondary data set that have a value for famtrac that occurs for some observation in the primary data set).
I want to interleave observations, but I want to specify which observations from the secondary data set will be included in the interleaved data set...
I want to interleave observations, but I want to specify which observations from the secondary data set will be included in the interleaved data set...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Then you need to add an IN= for the second dataset on the SET (not the first one), and use IF/THEN logic to conditionally output obs contributed from the second dataset.
Scott Barry
SBBWorks, Inc.
Scott Barry
SBBWorks, Inc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
And it appears that you will also need to keep track of the prior first DSN contribution, using a RETAIN and assigning a temporary (different named) variable.
Scott Barry
SBBWorks, Inc.
Scott Barry
SBBWorks, Inc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
You are on the right track for interleaving, however, your IN= variable and your subsetting IF are limiting the new dataset to only those observations that come from selena.cps65plus08a -- which doesn't sound like what you want.
You might need to put an IN= on the secondary dataset as well, and then tailor your IF statement accordingly. If I understand you correctly, you want EVERY obs from the first dataset, but only SOME obs from the second dataset??
Consider this data (Name, Present_type, Amtspend are my variables. I'm also making a variable called FROMFILE so that the final output shows which file contributed the interleaved observation):
[pre]
data one;
length name present_type $12;
infile datalines;
input name $ present_type $ amtspend;
fromfile = 'one';
return;
datalines;
alan anniversary 15
alan birthday 10
carl anniversary 15
edna birthday 20
;
run;
data two;
length name present_type $12;
infile datalines;
input name $ present_type $ amtspend;
fromfile = 'two';
return;
datalines;
alan christmas 25
bob anniversary 10
bob birthday 15
carl birthday 10
edna christmas 15
;
run;
[/pre]
For this data, I want ALL the obs from dataset ONE and everybody from dataset TWO except poor Carl -- he only gets an anniversary present. So this is the output I want:
[pre]
present_
Obs name type amtspend fromfile
1 alan anniversary 15 one
2 alan birthday 10 one
3 alan christmas 25 two
4 bob anniversary 10 two
5 bob birthday 15 two
6 carl anniversary 15 one
7 edna birthday 20 one
8 edna christmas 15 two
[/pre]
Created by THIS program:
[pre]
data three;
set one(in=inone)
two(in=intwo);
by name;
if inone or
(intwo and name ne 'carl') then output;
run;
ods listing;
proc print data=three;
title 'three';
run;
title;
[/pre]
If you have more complex conditions, then you'll have to code for those as well. This was just a simple test. But it should give you an idea of how interleaving works. Remember that the BY in interleaving performs differently than the BY in merging.
To learn more about the interleaving process, I recommend this site,
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/a001318366.htm
which has a section entitled: Understanding the Interleaving Process.
cynthia
You are on the right track for interleaving, however, your IN= variable and your subsetting IF are limiting the new dataset to only those observations that come from selena.cps65plus08a -- which doesn't sound like what you want.
You might need to put an IN= on the secondary dataset as well, and then tailor your IF statement accordingly. If I understand you correctly, you want EVERY obs from the first dataset, but only SOME obs from the second dataset??
Consider this data (Name, Present_type, Amtspend are my variables. I'm also making a variable called FROMFILE so that the final output shows which file contributed the interleaved observation):
[pre]
data one;
length name present_type $12;
infile datalines;
input name $ present_type $ amtspend;
fromfile = 'one';
return;
datalines;
alan anniversary 15
alan birthday 10
carl anniversary 15
edna birthday 20
;
run;
data two;
length name present_type $12;
infile datalines;
input name $ present_type $ amtspend;
fromfile = 'two';
return;
datalines;
alan christmas 25
bob anniversary 10
bob birthday 15
carl birthday 10
edna christmas 15
;
run;
[/pre]
For this data, I want ALL the obs from dataset ONE and everybody from dataset TWO except poor Carl -- he only gets an anniversary present. So this is the output I want:
[pre]
present_
Obs name type amtspend fromfile
1 alan anniversary 15 one
2 alan birthday 10 one
3 alan christmas 25 two
4 bob anniversary 10 two
5 bob birthday 15 two
6 carl anniversary 15 one
7 edna birthday 20 one
8 edna christmas 15 two
[/pre]
Created by THIS program:
[pre]
data three;
set one(in=inone)
two(in=intwo);
by name;
if inone or
(intwo and name ne 'carl') then output;
run;
ods listing;
proc print data=three;
title 'three';
run;
title;
[/pre]
If you have more complex conditions, then you'll have to code for those as well. This was just a simple test. But it should give you an idea of how interleaving works. Remember that the BY in interleaving performs differently than the BY in merging.
To learn more about the interleaving process, I recommend this site,
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/a001318366.htm
which has a section entitled: Understanding the Interleaving Process.
cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the help Cynthia. Yes, I want only some obs from the second dataset, but the obs that I want do not have a consistent value for the variable I am selecting on. I only want the obs from the second dataset where variablex (famtrac) in the second dataset matches any value of variablex (famtrac) in the first dataset...
In case anyone is also trying to figure out how to do this, here's what I finally came up with:
proc sql;
create table sql.fam6508
like sql.cps65plus08a;
proc sql;
insert into sql.fam6508
select * from sql.cps65plus08a **the primary data set**
union all
select * from sql.cps65under08 **secondary data set**
where famtrac in
(select famtrac from sql.cps65plus08a);
In case anyone is also trying to figure out how to do this, here's what I finally came up with:
proc sql;
create table sql.fam6508
like sql.cps65plus08a;
proc sql;
insert into sql.fam6508
select * from sql.cps65plus08a **the primary data set**
union all
select * from sql.cps65under08 **secondary data set**
where famtrac in
(select famtrac from sql.cps65plus08a);
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An (untested) alternative is the following, assuming the input data sets are sorted on famtrac:
data all65plus08 (drop=v1);
set selena.cps65plus08a (in=eldfam) selena.cps65under08 (in=b);
by famtrac;
retain v1;
if eldfam then v1=famtrac;
if eldfam or (b and v1=famtrac);
run;
which should give the same result, except for the sort order in the result data set.
data all65plus08 (drop=v1);
set selena.cps65plus08a (in=eldfam) selena.cps65under08 (in=b);
by famtrac;
retain v1;
if eldfam then v1=famtrac;
if eldfam or (b and v1=famtrac);
run;
which should give the same result, except for the sort order in the result data set.