- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have two datasets. My original with 150,000 observations and a second one I made from this dataset with exclusions and has a total of 145,000 observations. I'm trying to figure out how to write a code where I can just subtracts these two datasets to find the total amount excluded / missing. I know the total excluded is 5,000 from simple math but I'm a bit stuck on how we can do this in a SAS code.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As long as these two datasets have the same variable name ,type,order , you could get it by using SQL.
data A;
set sashelp.class;
run;
data B;
set sashelp.class;
if _n_ in (1:4) then delete;
run;
proc sql;
create table want as
select * from A
except
select * from B ;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Matching observations I guess would just be the unique ID?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please provide example data to illustrate your issue. I can't make code suggestions without knowing your data in structure and content.
Post example data as working DATA steps with DATALINES in a code box.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As long as these two datasets have the same variable name ,type,order , you could get it by using SQL.
data A;
set sashelp.class;
run;
data B;
set sashelp.class;
if _n_ in (1:4) then delete;
run;
proc sql;
create table want as
select * from A
except
select * from B ;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This worked! Ty so much !
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I know the total excluded is 5,000 from simple math but I'm a bit stuck on how we can do this in a SAS code.
You don't know that. Maybe there are duplicates, or unique observations in both data sets.
Never make assumptions about your data quality, and always check for all possible defects.
Assuming you won't rely on the observation order, here is one way:
data A;
X=1; output;
X=1; output;
X=2; output;
X=2; output;
run;
data B;
X=2; output;
X=3; output;
run;
proc sql;
create table WANT as
select 'In A only' as SRC, *
from (select * from A
except all
select * from B )
union all
select 'In B only' as SRC, *
from (select * from B
except all
select * from A );
quit;
This results in:
SRC | X |
---|---|
In A only | 1 |
In A only | 1 |
In A only | 2 * |
In B only | 3 |
* The all
keyword in except
does not suppress duplicate rows, hence why the second X=2 observation is flagged by the code above.
Same thing for keeping both X=1 observations, thanks to union all
.
See https://documentation.sas.com/doc/en/sqlproc/3.2/n0vo2lglyrnexwn14emi8m0jqvrj.htm
Depending on what you want, you could also use a variation of
proc sql;
create table WANT as
select *
from A
union all
select *
from B
except
(select * from A
intersect
select * from B );
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want to use the observation order and don't care about duplicates, here is another way to flag non-common observations:
proc sort data=A out=ASORTED;
by _ALL_;
run;
proc sort data=B out=BSORTED;
by _ALL_;
run;
data WANT;
merge ASORTED(in=A)
BSORTED(in=B);
by _ALL_;
if A & ^B then src='In A only';
else if ^A & B then src='In B only';
else delete;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An example of the data would be
DATA exclusions;
set original;
if CVD = . THN DELETE;
RUN;
The original dataset would have this kind of data ;
DATA original;
input ID BLD STROKE CVD ;datalines ;
1 A 1 .
2 B 0 1
3 O 0 0
4 A 1 . ;
RUN;
The excluded dataset would result in this data ;
DATA excluded ;
input ID BLD STROKE CVD ;datalines ;
2 B 0 1
3 O 0 0 ;
RUN;
So my goal is to figure out by code the difference in observations which should be 2.