Solved: By Group - dominant occurances

Reddi · Posted 03-12-2017 04:41 AM

Hi i have a situation where i need to select only the transactions from each ID where the no.of transactions made to a perticular country is more if ties then latest transaction date decides.

data have;
infile cards;
informat trxn_date ddmmyy10.;
format trxn_date ddmmyy10.;
input ID Country_from $ country_to $ trxn_date amount ;
cards;
1 US IND 01-03-2017 387
1 US IND 02-03-2017 388
1 US IND 03-03-2017 514
1 US UK 04-03-2017 889
1 US UK 05-03-2017 987
1 US UK 06-03-2017 690
1 US UK 07-03-2017 158
1 US AUS 08-03-2017 341
1 US AUS 09-03-2017 247
1 US CAN 10-03-2017 699
1 US CAN 11-03-2017 623
2 UK IND 04-03-2017 627
2 UK IND 05-03-2017 822
2 UK IND 06-03-2017 724
2 UK US 07-03-2017 231
2 UK US 08-03-2017 437
2 UK US 04-03-2017 193
2 UK AUS 05-03-2017 922
2 UK AUS 06-03-2017 750
2 UK AUS 07-03-2017 123
2 UK CAN 08-03-2017 920
2 UK CAN 08-03-2017 587
;;
run;

Expected Output :

1 US UK 04-03-2017 889
1 US UK 05-03-2017 987
1 US UK 06-03-2017 690
1 US UK 07-03-2017 158

2 UK US 07-03-2017 231
2 UK US 08-03-2017 437
2 UK US 04-03-2017 193

mkeintz · Posted 03-12-2017 02:05 PM

The appearance of your output suggests you want all the transactions for a given id, that come from the most frequent COUNTRY_FROM/COUNTRY_TO pair. But it also appears, though not stated, that country_from is constant for each id. If so, then you really want a list of transactions that have the most frequent COUNTRY_TO.

And in the case of multiple country_to's with the same frequency choose the country with the latest transaction date. You provide no rules for when the tied countries also have ties in latest transaction date, so we'll ignore that for now.

From the look of your datasets, it seems that

HAVE is sorted by ID.
Within ID, it is grouped (but not sorted) by country_to.
Within country_to, your data is NOT sorted by date, but you wish to preserve original order

data have;
infile cards;
informat trxn_date ddmmyy10.;
format trxn_date yymmddn8.;
input ID Country_from $ country_to $ trxn_date amount ;
cards;
1 US IND 01-03-2017 387
1 US IND 02-03-2017 388
1 US IND 03-03-2017 514
1 US UK 04-03-2017 889
1 US UK 05-03-2017 987
1 US UK 06-03-2017 690
1 US UK 07-03-2017 158
1 US AUS 08-03-2017 341
1 US AUS 09-03-2017 247
1 US CAN 10-03-2017 699
1 US CAN 11-03-2017 623
2 UK IND 04-03-2017 627
2 UK IND 05-03-2017 822
2 UK IND 06-03-2017 724
2 UK US 07-03-2017 231
2 UK US 08-03-2017 437
2 UK US 04-03-2017 193
2 UK AUS 05-03-2017 922
2 UK AUS 06-03-2017 750
2 UK AUS 07-03-2017 123
2 UK CAN 08-03-2017 920
2 UK CAN 08-03-2017 587
;;
run;


data want (drop=_:);

  do until (last.id);
    do _nt=1 by 1 until (last.country_to);
      set have;
      by id country_to notsorted;
	  if _nt=1 then _to_maxdate=trxn_date;
	  else _to_maxdate=max(_to_maxdate,trxn_date);
    end;

    if (_nt>_maxnt) or (_nt=_maxnt and _to_maxdate>_id_maxdate) then do;
      _maxnt=_nt;
      _max_to=country_to;
      _id_maxdate=_to_maxdate;
    end;
  end;

  do until (last.id);
    do until(last.country_to);
      set have;
      by id country_to notsorted;
      if country_to=_max_to then output;
    end;
  end;
run;

Notes:

This program has two DO .... until(last.id) loops.

EAch has a "set have; by id country_to notsorted;" pair of statements telling SAS to expect the data to be grouped by id/country_to, but not necessarily in ascending order. If that is not a secure assumption, then presort your data by id/country_to.
The first loop reads all the records for and ID, one country_to at a time. It identifies the country_to (variable _MAX_TO) that satisfies the criteria (highest _MAXNT, etc.)
The second loop re-reads the very same records and output those records whose country_to matches the _max_to variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Ksharp · Posted 03-12-2017 06:25 AM

Assuming I understood what you mean.




 
data have;
infile cards;
informat trxn_date ddmmyy10.;
format trxn_date ddmmyy10.;
input ID Country_from $ country_to $ trxn_date amount ;
cards;
1 US IND 01-03-2017 387
1 US IND 02-03-2017 388
1 US IND 03-03-2017 514
1 US UK 04-03-2017 889
1 US UK 05-03-2017 987
1 US UK 06-03-2017 690
1 US UK 07-03-2017 158
1 US AUS 08-03-2017 341
1 US AUS 09-03-2017 247
1 US CAN 10-03-2017 699
1 US CAN 11-03-2017 623
2 UK IND 04-03-2017 627
2 UK IND 05-03-2017 822
2 UK IND 06-03-2017 724
2 UK US 07-03-2017 231
2 UK US 08-03-2017 437
2 UK US 04-03-2017 193
2 UK AUS 05-03-2017 922
2 UK AUS 06-03-2017 750
2 UK AUS 07-03-2017 123
2 UK CAN 08-03-2017 920
2 UK CAN 08-03-2017 587
;;
run;
 data temp;
  set have;
  from=country_from;
  to=country_to;
  call sortc(from,to);
run;
proc sql;
select *
 from temp
  group by from,to
   having count(distinct id) ne 1;
quit;

Reddi · Posted 03-12-2017 07:02 AM

Hi @Ksharp,

Thanks for the reply i think you got the list of countires where more than one ID's had the transactions. But my issue is i need to select a list of all the transactions made by each ID to a unique country and the unique country must be selected based on the no.of transactions(more) , if thereare more than one country having the maximum count then the latest transaction date would decide(country having the latest transaction).

Hope it make sense.

atzamis · Posted 03-12-2017 01:16 PM

Hi,

you can try this (I wish I could find something that is not as lengthy though)

proc summary data=have missing nway;
class ID Country_from  country_to;
var trxn_date ;
output out=temp (drop=_type_ rename=(_freq_=no_trans)) max(trxn_date)=max_date;
run;

proc sort data=temp ;
by id  descending no_trans descending  max_date;
run;

data temp (drop=max_date no_trans );
set temp ;
      by id  descending no_trans descending  max_date;
                  if first.id ;
run;

proc sort data=temp;
     by id Country_from  country_to  ;
run;
proc sort data=have;
    by id Country_from  country_to ;
run;

data have;
merge have (in=A) temp (in=b);
  by id Country_from  country_to ;
        if a and b;
run;

proc datasets lib=work nolist;
     delete temp;
quit;
proc print data=have noobs;
run;

mkeintz · Posted 03-12-2017 02:05 PM

The appearance of your output suggests you want all the transactions for a given id, that come from the most frequent COUNTRY_FROM/COUNTRY_TO pair. But it also appears, though not stated, that country_from is constant for each id. If so, then you really want a list of transactions that have the most frequent COUNTRY_TO.

And in the case of multiple country_to's with the same frequency choose the country with the latest transaction date. You provide no rules for when the tied countries also have ties in latest transaction date, so we'll ignore that for now.

From the look of your datasets, it seems that

HAVE is sorted by ID.
Within ID, it is grouped (but not sorted) by country_to.
Within country_to, your data is NOT sorted by date, but you wish to preserve original order

data have;
infile cards;
informat trxn_date ddmmyy10.;
format trxn_date yymmddn8.;
input ID Country_from $ country_to $ trxn_date amount ;
cards;
1 US IND 01-03-2017 387
1 US IND 02-03-2017 388
1 US IND 03-03-2017 514
1 US UK 04-03-2017 889
1 US UK 05-03-2017 987
1 US UK 06-03-2017 690
1 US UK 07-03-2017 158
1 US AUS 08-03-2017 341
1 US AUS 09-03-2017 247
1 US CAN 10-03-2017 699
1 US CAN 11-03-2017 623
2 UK IND 04-03-2017 627
2 UK IND 05-03-2017 822
2 UK IND 06-03-2017 724
2 UK US 07-03-2017 231
2 UK US 08-03-2017 437
2 UK US 04-03-2017 193
2 UK AUS 05-03-2017 922
2 UK AUS 06-03-2017 750
2 UK AUS 07-03-2017 123
2 UK CAN 08-03-2017 920
2 UK CAN 08-03-2017 587
;;
run;


data want (drop=_:);

  do until (last.id);
    do _nt=1 by 1 until (last.country_to);
      set have;
      by id country_to notsorted;
	  if _nt=1 then _to_maxdate=trxn_date;
	  else _to_maxdate=max(_to_maxdate,trxn_date);
    end;

    if (_nt>_maxnt) or (_nt=_maxnt and _to_maxdate>_id_maxdate) then do;
      _maxnt=_nt;
      _max_to=country_to;
      _id_maxdate=_to_maxdate;
    end;
  end;

  do until (last.id);
    do until(last.country_to);
      set have;
      by id country_to notsorted;
      if country_to=_max_to then output;
    end;
  end;
run;

Notes:

This program has two DO .... until(last.id) loops.

EAch has a "set have; by id country_to notsorted;" pair of statements telling SAS to expect the data to be grouped by id/country_to, but not necessarily in ascending order. If that is not a secure assumption, then presort your data by id/country_to.
The first loop reads all the records for and ID, one country_to at a time. It identifies the country_to (variable _MAX_TO) that satisfies the criteria (highest _MAXNT, etc.)
The second loop re-reads the very same records and output those records whose country_to matches the _max_to variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

Re: By Group - dominant occurances

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!