I have posted a similar query in a different thread and have received some great responses. My requirements are now slightly different than what I initially had.
I have a sample dataset (below) that links an IP to an Account login attempt. I want to identify any IP (alongside the Account and Date) that has attempted to log into three or more accounts within any 7 day window.
data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;
My output would hence ideally read:
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 54321, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 67890, 17MAY2020:07:12:03
In the above output the following criteria is satisfied: (i) three or more unique Account IDs to the IP; (ii) login attempts occurred within a 7 day window. Even though rows with dates 09MAY2020 and 17MAY2020 are more than seven days a part, they are still associated to a 7 day streak.
Thank you!
I've coded up a little something; see code below. Here is what it ultimately produces in the last Data step:
The code is below. It's getting late, and I'm tired, so I won't try to walk you through the code right now. Please take a look at the code, and if you have any questions or encounter any problems, just post a response here in this thread, and I will reply tomorrow.
Jim
DATA Have;
INFILE Cards DSD DLM=',';
FORMAT IP $30.;
FORMAT _DateTm Datetime20.;
INPUT IP : $30.
_Account $
_DateTm : Datetime20.
;
Cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA= Have
OUT = Have_Sort
;
BY IP _DateTm;
RUN;
**------------------------------------------------------------------------------**;
DATA Identified_IPs;
DROP _:;
LENGTH IP $30;
RETAIN _Work_Count;
RETAIN Access_Count;
RETAIN _Save_Date;
RETAIN _Prior_Date;
RETAIN First_Date;
FORMAT _Save_Date YYMMDDD10.;
FORMAT _Prior_Date YYMMDDD10.;
FORMAT _Current_Date YYMMDDD10.;
FORMAT First_Date YYMMDDD10.;
SET Have_Sort;
BY IP;
_Current_Date = DATEPART(_DateTm);
IF _N_ = 1 THEN
DO;
DECLARE Hash Hsh_Accounts();
Hsh_Accounts.DEFINEKEY('_Account');
Hsh_Accounts.DEFINEDONE();
END;
IF FIRST.IP THEN
DO;
_Work_Count = 1;
Access_Count = 0;
_RC = Hsh_Accounts.CLEAR();
_RC = Hsh_Accounts.ADD();
_3_Or_More = 0;
_Prior_Date = _Current_Date;
_Save_Date = _Current_Date;
END;
ELSE
DO;
_Days_Apart = (_Current_Date - _Prior_Date);
IF _Days_Apart <= 7 THEN
DO;
_RC = Hsh_Accounts.ADD();
IF _RC = 0 THEN
_Work_Count + 1;
END;
ELSE
DO;
_Work_Count = 1;
_Save_Date = _Current_Date;
END;
_Prior_Date = _Current_Date;
END;
IF _Work_Count >= 3 THEN
DO;
_3_Or_More = 1;
First_Date = _Save_Date;
IF _Work_Count > Access_Count THEN
Access_Count = _Work_Count;
END;
IF LAST.IP AND
Access_Count >= 3 THEN
DO;
OUTPUT;
END;
ELSE
DO;
DELETE;
END;
RUN;
**------------------------------------------------------------------------------**;
DATA Identified_Records;
DROP _:;
DROP First_Date;
DROP Access_Count;
LENGTH IP $30;
RETAIN _Prior_Date;
FORMAT First_Date YYMMDDD10.;
FORMAT _Prior_Date YYMMDDD10.;
FORMAT _Current_Date YYMMDDD10.;
IF _N_ = 1 THEN
DO;
IF 0 THEN
SET WORK.Identified_IPs;
DECLARE Hash Hsh_IP_Look_Up(dataset: 'WORK.Identified_IPs');
Hsh_IP_Look_Up.DEFINEKEY('IP');
Hsh_IP_Look_Up.DEFINEDATA('First_Date');
Hsh_IP_Look_Up.DEFINEDONE();
END;
SET Have_Sort(RENAME=(_Account=Account _DateTm=DateTm));
BY IP;
_Current_Date = DATEPART(DateTm);
IF FIRST.IP THEN
DO;
_RC = Hsh_IP_Look_Up.FIND();
IF _RC = 0 THEN
IF _Current_Date = First_Date THEN
DO;
OUTPUT;
END;
ELSE
DO;
END;
ELSE
DO;
END;
_Prior_Date = _Current_Date;
END;
ELSE
DO;
_RC = Hsh_IP_Look_Up.FIND();
IF _RC = 0 THEN
IF _Current_Date = First_Date THEN
DO;
OUTPUT;
END;
ELSE
DO;
_Days_Apart = (_Current_Date - _Prior_Date);
IF _Days_Apart <= 7 THEN
DO;
_RC = Hsh_IP_Look_Up.FIND();
IF _RC = 0 THEN
DO;
OUTPUT;
END;
ELSE
DO;
END;
END;
ELSE
DO;
END;
END;
ELSE
DO;
END;
_Prior_Date = _Current_Date;
END;
DELETE;
RUN;
Something like this?
data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 27MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
run;
proc sql;
create table SEVEN_DAY_FILTER as
select unique a.*
from HAVE a, HAVE b
where a.IP=b.IP
and a.DATE ne b.DATE
and -8 < intck('dtday',a.DATE,b.DATE) < 8
order by a.IP, a.DATE;
quit;
data DATE_GROUPS;
set SEVEN_DAY_FILTER;
by IP;
if intck('dtday',lag(DATE),DATE) > 7 then GROUP+1;
else if first.IP then GROUP+1;
run;
proc sql;
create table THREE_ACCT_FILTER as
select *
from DATE_GROUPS
group by GROUP
having count(distinct ACCOUNT) > 2
order by IP, DATE;
quit;
IP | Account | Date |
---|---|---|
108.950.45 | 54321 | 09MAY2020:12:55:22 |
108.950.45 | 67890 | 10MAY2020:03:00:02 |
108.950.45 | 09876 | 11MAY2020:06:14:03 |
108.950.45 | 45879 | 11MAY2020:12:37:35 |
108.950.45 | 78562 | 13MAY2020:13:13:09 |
Note: I changed one date to validate the steps.
I was thinking the same approach, but you beat me to it.
We often spend so much effort preventing an unwanted Cartesian product in a multi-table join, that we forget that a Cartesian product, usually with a table self-join, can be our friend.
Having every row within a key combined with every other row within that key, then filtering as required, can be very powerful.
Edit: as long as the table isn't, say, 100M rows 😉
Another (probably better) way:
data GROUPS;
do GROUP='01jan2020'd to '24dec2020'd;
START=dhms(GROUP,0,0,0);
END =dhms(GROUP+7,0,0,0)-1;
output;
end;
format START END datetime.;
run;
proc sql;
create table WANT as
select unique HAVE.*
from HAVE
, GROUPS
where DATE between START and END
group by IP, GROUP
having count(distinct ACCOUNT) > 2
order by IP, DATE;
@wylie_ma Does this answer your question?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.