Dear community,
I have 2 huge datasets with millions of records. I want to simulate that in this small example.
I have dataset AA & BB. I want to get unique records (no duplicates) when I merge these two datasets with ID field. and I also want get the records whose age is greater than or equal to 60 years (even with duplicates whose age is greater than or equal to 60). If there are any other lower age with 60 years then I don't want to see them (like BB dataset has ID 'D' which has age of 60 & 59).
Data AA;
Input ID Name$ Height age;
cards;
1 A 1 30
3 B 2 40
5 C 2 50
7 D 2 60
9 E 2 70
;
run;
Data BB;
Input ID Name1$ height1 age1;
cards;
1 A 1 30
3 B 4 40
3 B 2 40
5 C 2 50
5 C 2 50
5 C 2 50
7 D 2 60
7 D 2 60
7 D 2 59
9 E 2 70
9 E 2 70
;
run;
PROC SQL;
CREATE TABLE LOOK AS
SELECT A.*, B.*
FROM AA AS A INNER JOIN BB AS B ON A.ID=B.ID
;
QUIT;
DATA ONE TWO;
SET LOOK ;
BY ID;
IF AGE >= 60 and age1 >=60 THEN OUTPUT ONE;
ELSE IF FIRST.ID AND LAST.ID THEN OUTPUT ONE;
ELSE OUTPUT TWO;
RUN;
The desired output should look like below.
ID Name Height age Name1 Height1 age1
1 A 1 30 A 1 30
9 E 2 70 E 2 70
9 E 2 70 E 2 70
Any help is much appreciated. Thanks in advance.
This does it:
Data AA;
Input ID Name$ Height age;
cards;
1 A 1 30
3 B 2 40
5 C 2 50
7 D 2 60
9 E 2 70
;
Data BB;
Input ID Name1$ height1 age1;
cards;
1 A 1 30
3 B 4 40
3 B 2 40
5 C 2 50
5 C 2 50
5 C 2 50
7 D 2 60
7 D 2 60
7 D 2 59
9 E 2 70
9 E 2 70
;
data want;
merge
aa
bb
;
by id;
if _n_ = 1
then do;
declare hash b (dataset:"bb (where=(age1 le 60))");
b.definekey("id");
b.definedone();
end;
if (first.id and last.id) or b.check() ne 0;
run;
proc print data=want noobs;
run;
Result:
D Name Height age Name1 height1 age1 1 A 1 30 A 1 30 9 E 2 70 E 2 70 9 E 2 70 E 2 70
I don't follow this logic. Why should ID = 1 be in the result but ID = 2 should not?
Please post the exact result that you expect from the posted sample data. Makes it much easier to provide a usable code answer.
ID Name Height age Name1 Height1 age1
1 A 1 30 A 1 30
9 E 2 70 E 2 70
9 E 2 70 E 2 70
Again, since this is your desired output, you must be able to explain why ID = 2 is not present and ID = 1 is.
You asked for: "unique records (no duplicates) when I merge these two datasets with ID field. and I also want get the records whose age is greater than or equal to 60 years ".
1) Why have you added next lines to desired output?
ID Name Height age Name1 Height1 age1
1 A 1 30 A 1 30 - Age is less than 60 ?
9 E 2 70 E 2 70
9 E 2 70 E 2 70 - This is a duplicate ID ?!
You will get what you asked by next code:
PROC SQL;
CREATE TABLE look AS
SELECT distinct A.*, B.*
FROM AA AS A INNER JOIN BB AS B
ON A.ID=B.ID and min(A.age,B.age1) >= 60
order by A.ID
;
quit;
Result:
ID Name
ID Name Height Age Name1 Height1 Age1
7 | D | 2 | 60 | D | 2 | 60 |
9 | E | 2 | 70 | E | 2 | 70 |
Shmuel,
I have a condition where I want to have unique records (which can have no age limit restrictions) and second condition is if the age is >= 60 then I have a condition where I don't want to see any age that is below 60. since ID "D" has one record (age 59 below 60), I don't want to see that output at all.
Thanks !!
@buddha_d wrote:
Shmuel,
I have a condition where I want to have unique records (which can have no age limit restrictions) and second condition is if the age is >= 60 then I have a condition where I don't want to see any age that is below 60. since ID "D" has one record (age 59 below 60), I don't want to see that output at all.
Thanks !!
Next code fits your conditions:
/* first codition - age>= 60 and age2 >= 60 */
PROC SQL;
CREATE TABLE tmp1 AS
SELECT A.*, B.*
FROM AA AS A INNER JOIN BB AS B
ON A.ID=B.ID and A.age>= 60 and B.age1 >= 60
order by A.ID
;
create table tmp1a as
select distinct ID
from BB(where=(age1 < 60))
group by ID;
quit;
/* data tmp1; set tmp1; if age1 < 60 then delete; run; */
/* second condition - uniqe ID */
proc sort data=AA; by ID; run;
proc sort data=BB; by ID; run;
data tmp2;
merge AA(in=inA) bb(in=inB);
by ID;
if first.ID and last.ID;
run;
data look;
merge tmp1(in=in1) tmp1a(in=in1a) tmp2(in=in2) ;
by ID;
if (in1 and not in1a) or in2;
run;
proc sort data=look; by ID; run;
proc print data=look; run;
This does it:
Data AA;
Input ID Name$ Height age;
cards;
1 A 1 30
3 B 2 40
5 C 2 50
7 D 2 60
9 E 2 70
;
Data BB;
Input ID Name1$ height1 age1;
cards;
1 A 1 30
3 B 4 40
3 B 2 40
5 C 2 50
5 C 2 50
5 C 2 50
7 D 2 60
7 D 2 60
7 D 2 59
9 E 2 70
9 E 2 70
;
data want;
merge
aa
bb
;
by id;
if _n_ = 1
then do;
declare hash b (dataset:"bb (where=(age1 le 60))");
b.definekey("id");
b.definedone();
end;
if (first.id and last.id) or b.check() ne 0;
run;
proc print data=want noobs;
run;
Result:
D Name Height age Name1 height1 age1 1 A 1 30 A 1 30 9 E 2 70 E 2 70 9 E 2 70 E 2 70
Thanks Kurt
It can be done in a single data step, like this:
data want;
merge aa(in=in1) bb(in=in2) bb(where=(age1<60) in=young);
by ID;
if young then
if first.id and last.id;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.