Re: Deleted and take the difference

hk2013 · Posted 06-29-2018 01:03 PM

I have two data sets with IDs and Cust_id data_1 is my main data set its a snapshot of our customer list at the start of the year and i want to compare to data_2 which is updated monthly.

Data_2 might have new cust_id added over the time but I dont want to look at those.

Data_1 : data at start of year

ID Cust_id

1 111

1 222

1 333

1 444

2 555

2 666

2 777

2 888

3 999

3 1000

3 1001

3 1002

Data_2

ID Cust_id

1 111

1 222

1 1110

1 5786

2 555

2 8989

2 0986

3 999

Want:

Compare Data_1 to Data_2 and only keep cust_ids in data_2 that are also in Data_1 and than get the diff

New_Data_2

ID Cust_id difference

1 111 2

1 222 2

2 555 3

3 999 3

difference = count(data_1.Cust_id) -count(data_2.Cust_id)

Kurt_Bremser · Posted 06-29-2018 01:19 PM

Are you sure you want sum() and not count()? Because I can't see where you'd get just 2 from those numbers.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hk2013 · Posted 06-29-2018 01:39 PM

sorry yeah i meant count of the cust_id

novinosrin · Posted 06-29-2018 02:16 PM

data one;
input ID      Cust_id;
cards;
1          111
1          222
1          333
1         444
2          555
2          666
2          777
2          888
3         999
3        1000
3        1001
3        1002
;
data two;
input ID      Cust_id;
cards;
1          111
1          222
1          1110
1         5786
2          555
2          8989
2          0986
3         999
;
proc sql;
create table want as
select a.*,b.count-count(a.cust_id) as difference
from two a, (select *,count(cust_id) as count from one group by id) b
where a.cust_id=b.cust_id and a.id=b.id
group by a.id;
quit;

novinosrin · Posted 06-29-2018 02:37 PM

data one;
input ID      Cust_id;
cards;
1          111
1          222
1          333
1         444
2          555
2          666
2          777
2          888
3         999
3        1000
3        1001
3        1002
;
data two;
input ID      Cust_id;
cards;
1          111
1          222
1          1110
1         5786
2          555
2          8989
2          0986
3         999
;

data want;
if _n_=1 then do;
   dcl hash h(dataset:'two', multidata: 'y');
   h.definekey('id','cust_id');
    h.definedone();
end;
_c=0;
do _n=1 by 1 until(last.id);
set one;
by ID    Cust_id;
if h.check()=0 then  _c+1;
end;
do until(last.id);
set one;
by ID    Cust_id;
if h.check()=0 then do;count=_n-_c;output;end;
end;
drop _:;
run;

PGStats · Posted 06-30-2018 12:54 AM

Can also be done this way with a left join:

proc sql;
select 
    one.*,
    sum(missing(two.cust_id)) as difference
from 
    one left join two on one.id=two.id and one.cust_id=two.cust_id
group by one.id
having not missing(two.cust_id);
quit;

PG

Deleted and take the difference