Solved: Re: join or merge and remove unwanted records

buddha_d · Posted 12-17-2020 12:59 AM

Dear community,

I have 2 huge datasets with millions of records. I want to simulate that in this small example.

I have dataset AA & BB. I want to get unique records (no duplicates) when I merge these two datasets with ID field. and I also want get the records whose age is greater than or equal to 60 years (even with duplicates whose age is greater than or equal to 60). If there are any other lower age with 60 years then I don't want to see them (like BB dataset has ID 'D' which has age of 60 & 59).

Data AA;
Input ID Name$ Height age;
cards;
1 A 1 30
3 B 2 40
5 C 2 50
7 D 2 60
9 E 2 70
;
run;

Data BB;
Input ID Name1$ height1 age1;
cards;
1 A 1 30
3 B 4 40
3 B 2 40
5 C 2 50
5 C 2 50
5 C 2 50
7 D 2 60
7 D 2 60
7 D 2 59
9 E 2 70
9 E 2 70
;
run;

PROC SQL;
	CREATE TABLE LOOK AS 
	SELECT A.*, B.* 
	FROM AA AS A INNER JOIN BB AS B ON A.ID=B.ID 
	;
QUIT;

DATA ONE TWO;
	SET LOOK ;
	BY ID; 
	IF AGE >= 60 and age1 >=60 THEN OUTPUT ONE;
	ELSE IF FIRST.ID AND LAST.ID THEN OUTPUT ONE;
	ELSE OUTPUT TWO;
RUN;

The desired output should look like below.

ID Name Height age Name1 Height1 age1
1 A 1 30 A 1 30
9 E 2 70 E 2 70
9 E 2 70 E 2 70

Any help is much appreciated. Thanks in advance.

Kurt_Bremser · Posted 12-17-2020 04:15 AM

This does it:

Data AA;
Input ID Name$ Height age;
cards;
1 A 1 30
3 B 2 40
5 C 2 50
7 D 2 60
9 E 2 70
;

Data BB;
Input ID Name1$ height1 age1;
cards;
1 A 1 30
3 B 4 40
3 B 2 40
5 C 2 50
5 C 2 50
5 C 2 50
7 D 2 60
7 D 2 60
7 D 2 59
9 E 2 70
9 E 2 70
;

data want;
merge
  aa
  bb
;
by id;
if _n_ = 1
then do;
  declare hash b (dataset:"bb (where=(age1 le 60))");
  b.definekey("id");
  b.definedone();
end;
if (first.id and last.id) or b.check() ne 0;
run;

proc print data=want noobs;
run;

Result:

D	Name	Height	age	Name1	height1	age1
1	A	1	30	A	1	30
9	E	2	70	E	2	70
9	E	2	70	E	2	70

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

PeterClemmensen · Posted 12-17-2020 01:03 AM

I don't follow this logic. Why should ID = 1 be in the result but ID = 2 should not?

The DATA to DATA Step Macro
Blog: SASnrd

buddha_d · Posted 12-17-2020 01:10 AM

This is just made up data and I picked odd numbers as IDs. There is no logic why ID=2 is not there. But, the point here is I want unique numbers (when merged both datasets) as well as any records which has some duplicates with age is equal or greater than 60 . At the same time, if there are any ages below 60 along with age of 60 then I don't want to see it.
Please let me know if you need more clarification

PeterClemmensen · Posted 12-17-2020 01:26 AM

Please post the exact result that you expect from the posted sample data. Makes it much easier to provide a usable code answer.

The DATA to DATA Step Macro
Blog: SASnrd

buddha_d · Posted 12-17-2020 01:31 AM

ID Name Height age Name1 Height1 age1
1 A 1 30 A 1 30
9 E 2 70 E 2 70
9 E 2 70 E 2 70

PeterClemmensen · Posted 12-17-2020 01:43 AM

Again, since this is your desired output, you must be able to explain why ID = 2 is not present and ID = 1 is.

The DATA to DATA Step Macro
Blog: SASnrd

buddha_d · Posted 12-17-2020 01:44 AM

The above dataset is the desired result.

Shmuel · Posted 12-17-2020 03:37 AM

You asked for: "unique records (no duplicates) when I merge these two datasets with ID field. and I also want get the records whose age is greater than or equal to 60 years ".

1) Why have you added next lines to desired output?

ID Name Height age Name1 Height1 age1
1 A 1 30 A 1 30 - Age is less than 60 ?
9 E 2 70 E 2 70
9 E 2 70 E 2 70 - This is a duplicate ID ?!

You will get what you asked by next code:

PROC SQL;
	CREATE TABLE look AS 
	SELECT distinct A.*, B.* 
	FROM AA AS A INNER JOIN BB AS B
	ON A.ID=B.ID and min(A.age,B.age1) >= 60
	order by A.ID
	;
quit;

Result:

ID Name

ID Name Height Age Name1 Height1 Age1

7

D

2

60

D

2

60

9

E

2

70

E

2

70

buddha_d · Posted 12-17-2020 07:37 AM

Shmuel,

I have a condition where I want to have unique records (which can have no age limit restrictions) and second condition is if the age is >= 60 then I have a condition where I don't want to see any age that is below 60. since ID "D" has one record (age 59 below 60), I don't want to see that output at all.
Thanks !!

Shmuel · Posted 12-17-2020 10:19 AM

@buddha_d wrote:

Shmuel,

I have a condition where I want to have unique records (which can have no age limit restrictions) and second condition is if the age is >= 60 then I have a condition where I don't want to see any age that is below 60. since ID "D" has one record (age 59 below 60), I don't want to see that output at all.
Thanks !!

Next code fits your conditions:

/* first codition - age>= 60 and age2 >= 60 */
PROC SQL;
	CREATE TABLE tmp1 AS 
	SELECT A.*, B.*
	FROM AA AS A INNER JOIN BB AS B
	ON A.ID=B.ID and A.age>= 60 and B.age1 >= 60
	order by A.ID
	;
	create table tmp1a as 
	select distinct ID	 
	from BB(where=(age1 < 60))
	group by ID;
quit;
/* data tmp1; set tmp1; if age1 < 60 then delete; run; */

/* second condition - uniqe ID */
proc sort data=AA; by ID; run;
proc sort data=BB; by ID; run;
data tmp2;
 merge AA(in=inA) bb(in=inB);
  by ID;
     if first.ID and last.ID;    
run;
data look;
 merge tmp1(in=in1) tmp1a(in=in1a) tmp2(in=in2) ;
   by ID;
      if (in1 and not in1a) or in2;
run;
proc sort data=look; by ID; run;
proc print data=look; run;

Kurt_Bremser · Posted 12-17-2020 04:15 AM

This does it:

Data AA;
Input ID Name$ Height age;
cards;
1 A 1 30
3 B 2 40
5 C 2 50
7 D 2 60
9 E 2 70
;

Data BB;
Input ID Name1$ height1 age1;
cards;
1 A 1 30
3 B 4 40
3 B 2 40
5 C 2 50
5 C 2 50
5 C 2 50
7 D 2 60
7 D 2 60
7 D 2 59
9 E 2 70
9 E 2 70
;

data want;
merge
  aa
  bb
;
by id;
if _n_ = 1
then do;
  declare hash b (dataset:"bb (where=(age1 le 60))");
  b.definekey("id");
  b.definedone();
end;
if (first.id and last.id) or b.check() ne 0;
run;

proc print data=want noobs;
run;

Result:

D	Name	Height	age	Name1	height1	age1
1	A	1	30	A	1	30
9	E	2	70	E	2	70
9	E	2	70	E	2	70

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

buddha_d · Posted 12-17-2020 05:44 PM

Thanks Kurt

s_lassen · Posted 12-17-2020 06:16 AM

It can be done in a single data step, like this:

data want;
  merge aa(in=in1) bb(in=in2) bb(where=(age1<60) in=young);
  by ID;
  if young then
    if first.id and last.id;
run;

Registration is open

SAS Training: Just a Click Away