In search of an efficient way to deal with ID<>ID records

eric_lecoutre_businessdecision_com · Posted 08-29-2011 09:34 AM

Hey SAS folks,

I will have to work with Telco CDR data,that is Call Data Record ; basically, one phone number associated with one phone number and details on communication (such as operator).

The table is sorted according to time (as one can expect -- this is tracking).

From those data, I have to extract a list of distinct phone numbers and their latests status (say phone operator).

I can think of reading data, outputing per CDR two records (phone numbers) with the date and perform a PROC SORT nodedupkey on it.

I was wondering wether there is a way to perform the task reading only once the source data.

In particular, is there the place where hashkey could help?

Sample code and what needed:

DATA CDR;
INPUT phone1 phone2 date:date9. phone1_status:$1. phone2_status:$1.;
format date date9.;
CARDS;
1 2 01JAN2007 A B
1 3 02JAN2007 A B
2 3 03JAN2007 A B
2 4 04JAN2007 C A
4 1 05JAN2007 C B
RUN;

* should already be sorted;
proc sort data=cdr; by date; run;
* ;

data phones_dup(keep=phone status date);
set cdr;
phone=phone1; status=phone1_status; output;
phone=phone2; status=phone2_status; output;
run;

proc sort data=phones_dup out=phones_dup;
by phone date;
run;
data phones;
set phones_dup; by phone; if last.phone;
run;
proc print; run;

Thank you for any insight,

Eric

DLing · Posted 08-29-2011 10:51 AM

Your code is: input -> sort -> split -> sort -> find last

The split operation is not dependent on the sort. Maybe this will work better:

input+split all in one step -> sort -> find last

DATA CDR1(keep=phone1 phone2 status1 status2 date)

CDR2(keep=phone status date)

;

INPUT phone1 phone2 date:date9. status1:$1. status2:$1.;
format date date9.;

output cdr1;

phone=phone1; status=status1; output cdr2;

phone=phone2; status=status2; output cdr2;

CARDS;
1 2 01JAN2007 A B
1 3 02JAN2007 A B
2 3 03JAN2007 A B
2 4 04JAN2007 C A
4 1 05JAN2007 C B
RUN;

You don't have to create cdr1 if it is no longer required.

Then you can sort CDR2 and find last, or use PROC SQL.

Ksharp · Posted 08-29-2011 10:32 PM

Yes. there is a way for Hash Table. But why do you want to do this so.

Your code is very good.

DATA CDR;
INPUT phone1 phone2 date:date9. phone1_status:$1. phone2_status:$1.;
format date date9.;
CARDS;
1 2 01JAN2007 A B
1 3 02JAN2007 A B
2 3 03JAN2007 A B
2 4 04JAN2007 C A
4 1 05JAN2007 C B
;
RUN;
data want(keep=phone status date);
 declare hash ha(hashexp: 16,ordered: 'D',multidata: 'Y');
 declare hiter hi('ha');
  ha.definekey('phone','date');
  ha.definedata('phone','date','status');
  ha.definedone();

 do until(last);
  set cdr end=last;
  phone=phone1;status=phone1_status;ha.add();
  phone=phone2;status=phone2_status;ha.add();
 end;

 rc=hi.first(); pre_phone=.;
 do while(rc=0);
  if phone ne pre_phone then output;
  pre_phone=phone;
  rc=hi.next();
 end;
stop;
run;

Ksharp

MikeZdeb · Posted 08-29-2011 11:46 PM

Hi ... here's another idea ...

DATA CDR;

INPUT phone1 phone2 date:date9. phone1_status:$1. phone2_status:$1.;

format date date9.;

CARDS;

1 2 01JAN2007 A B

1 3 02JAN2007 A B

2 3 03JAN2007 A B

2 4 04JAN2007 C A

4 1 05JAN2007 C B

RUN;

proc sql;

create table new as

select *

from (select phone1 as phone, phone1_status as status, date

from cdr

union

select phone2 as phone, phone2_status as status, date

from cdr)

group phone

having date eq max(date);

quit;

proc print data=new noobs;

run;

phone status date

1 B 05JAN2007

2 C 04JAN2007

3 B 03JAN2007

4 C 05JAN2007

eric_lecoutre_businessdecision_com · Posted 08-30-2011 04:05 AM

Thanks for all your input;

I will try the different versions to test performance.

In particular I was thinking (hoping) hashkey could be more performant.

With regards to the high number of records, I fear a big SQL may not be the most performant but there again I have to test.

Eric