I have posted a similar query in a different thread and have received some great responses. My requirements are now slightly different than what I initially had.
I have a sample dataset (below) that links an IP to an Account login attempt. I want to identify any IP (alongside the Account and Date) that has attempted to log into three or more accounts within any 7 day window.
data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;
My output would hence ideally read:
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 54321, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 67890, 17MAY2020:07:12:03
In the above output the following criteria is satisfied: (i) three or more unique Account IDs to the IP; (ii) login attempts occurred within a 7 day window. Even though rows with dates 09MAY2020 and 17MAY2020 are more than seven days a part, they are still associated to a 7 day streak.
Thank you!
I've coded up a little something; see code below. Here is what it ultimately produces in the last Data step:
The code is below. It's getting late, and I'm tired, so I won't try to walk you through the code right now. Please take a look at the code, and if you have any questions or encounter any problems, just post a response here in this thread, and I will reply tomorrow.
Jim
DATA Have;
INFILE Cards DSD DLM=',';
FORMAT IP $30.;
FORMAT _DateTm Datetime20.;
INPUT IP : $30.
_Account $
_DateTm : Datetime20.
;
Cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA= Have
OUT = Have_Sort
;
BY IP _DateTm;
RUN;
**------------------------------------------------------------------------------**;
DATA Identified_IPs;
DROP _:;
LENGTH IP $30;
RETAIN _Work_Count;
RETAIN Access_Count;
RETAIN _Save_Date;
RETAIN _Prior_Date;
RETAIN First_Date;
FORMAT _Save_Date YYMMDDD10.;
FORMAT _Prior_Date YYMMDDD10.;
FORMAT _Current_Date YYMMDDD10.;
FORMAT First_Date YYMMDDD10.;
SET Have_Sort;
BY IP;
_Current_Date = DATEPART(_DateTm);
IF _N_ = 1 THEN
DO;
DECLARE Hash Hsh_Accounts();
Hsh_Accounts.DEFINEKEY('_Account');
Hsh_Accounts.DEFINEDONE();
END;
IF FIRST.IP THEN
DO;
_Work_Count = 1;
Access_Count = 0;
_RC = Hsh_Accounts.CLEAR();
_RC = Hsh_Accounts.ADD();
_3_Or_More = 0;
_Prior_Date = _Current_Date;
_Save_Date = _Current_Date;
END;
ELSE
DO;
_Days_Apart = (_Current_Date - _Prior_Date);
IF _Days_Apart <= 7 THEN
DO;
_RC = Hsh_Accounts.ADD();
IF _RC = 0 THEN
_Work_Count + 1;
END;
ELSE
DO;
_Work_Count = 1;
_Save_Date = _Current_Date;
END;
_Prior_Date = _Current_Date;
END;
IF _Work_Count >= 3 THEN
DO;
_3_Or_More = 1;
First_Date = _Save_Date;
IF _Work_Count > Access_Count THEN
Access_Count = _Work_Count;
END;
IF LAST.IP AND
Access_Count >= 3 THEN
DO;
OUTPUT;
END;
ELSE
DO;
DELETE;
END;
RUN;
**------------------------------------------------------------------------------**;
DATA Identified_Records;
DROP _:;
DROP First_Date;
DROP Access_Count;
LENGTH IP $30;
RETAIN _Prior_Date;
FORMAT First_Date YYMMDDD10.;
FORMAT _Prior_Date YYMMDDD10.;
FORMAT _Current_Date YYMMDDD10.;
IF _N_ = 1 THEN
DO;
IF 0 THEN
SET WORK.Identified_IPs;
DECLARE Hash Hsh_IP_Look_Up(dataset: 'WORK.Identified_IPs');
Hsh_IP_Look_Up.DEFINEKEY('IP');
Hsh_IP_Look_Up.DEFINEDATA('First_Date');
Hsh_IP_Look_Up.DEFINEDONE();
END;
SET Have_Sort(RENAME=(_Account=Account _DateTm=DateTm));
BY IP;
_Current_Date = DATEPART(DateTm);
IF FIRST.IP THEN
DO;
_RC = Hsh_IP_Look_Up.FIND();
IF _RC = 0 THEN
IF _Current_Date = First_Date THEN
DO;
OUTPUT;
END;
ELSE
DO;
END;
ELSE
DO;
END;
_Prior_Date = _Current_Date;
END;
ELSE
DO;
_RC = Hsh_IP_Look_Up.FIND();
IF _RC = 0 THEN
IF _Current_Date = First_Date THEN
DO;
OUTPUT;
END;
ELSE
DO;
_Days_Apart = (_Current_Date - _Prior_Date);
IF _Days_Apart <= 7 THEN
DO;
_RC = Hsh_IP_Look_Up.FIND();
IF _RC = 0 THEN
DO;
OUTPUT;
END;
ELSE
DO;
END;
END;
ELSE
DO;
END;
END;
ELSE
DO;
END;
_Prior_Date = _Current_Date;
END;
DELETE;
RUN;
Something like this?
data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 27MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
run;
proc sql;
create table SEVEN_DAY_FILTER as
select unique a.*
from HAVE a, HAVE b
where a.IP=b.IP
and a.DATE ne b.DATE
and -8 < intck('dtday',a.DATE,b.DATE) < 8
order by a.IP, a.DATE;
quit;
data DATE_GROUPS;
set SEVEN_DAY_FILTER;
by IP;
if intck('dtday',lag(DATE),DATE) > 7 then GROUP+1;
else if first.IP then GROUP+1;
run;
proc sql;
create table THREE_ACCT_FILTER as
select *
from DATE_GROUPS
group by GROUP
having count(distinct ACCOUNT) > 2
order by IP, DATE;
quit;
IP | Account | Date |
---|---|---|
108.950.45 | 54321 | 09MAY2020:12:55:22 |
108.950.45 | 67890 | 10MAY2020:03:00:02 |
108.950.45 | 09876 | 11MAY2020:06:14:03 |
108.950.45 | 45879 | 11MAY2020:12:37:35 |
108.950.45 | 78562 | 13MAY2020:13:13:09 |
Note: I changed one date to validate the steps.
I was thinking the same approach, but you beat me to it.
We often spend so much effort preventing an unwanted Cartesian product in a multi-table join, that we forget that a Cartesian product, usually with a table self-join, can be our friend.
Having every row within a key combined with every other row within that key, then filtering as required, can be very powerful.
Edit: as long as the table isn't, say, 100M rows 😉
Another (probably better) way:
data GROUPS;
do GROUP='01jan2020'd to '24dec2020'd;
START=dhms(GROUP,0,0,0);
END =dhms(GROUP+7,0,0,0)-1;
output;
end;
format START END datetime.;
run;
proc sql;
create table WANT as
select unique HAVE.*
from HAVE
, GROUPS
where DATE between START and END
group by IP, GROUP
having count(distinct ACCOUNT) > 2
order by IP, DATE;
@wylie_ma Does this answer your question?
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.