BookmarkSubscribeRSS Feed
wylie_ma
Fluorite | Level 6

I have posted a similar query in a different thread and have received some great responses. My requirements are now slightly different than what I initially had.

 

I have a sample dataset (below) that links an IP to an Account login attempt. I want to identify any IP (alongside the Account and Date) that has attempted to log into three or more accounts within any 7 day window. 

 

data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;

 

My output would hence ideally read:

 

108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 54321, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 67890, 17MAY2020:07:12:03

 

In the above output the following criteria is satisfied: (i) three or more unique Account IDs to the IP; (ii) login attempts occurred within a 7 day window. Even though rows with dates 09MAY2020 and 17MAY2020 are more than seven days a part, they are still associated to 7 day streak.

 

Thank you!

5 REPLIES 5
jimbarbour
Meteorite | Level 14

@wylie_ma,

 

I've coded up a little something; see code below.  Here is what it ultimately produces in the last Data step:

jimbarbour_0-1603343150247.png

 

The code is below.  It's getting late, and I'm tired, so I won't try to walk you through the code right now.  Please take a look at the code, and if you have any questions or encounter any problems, just post a response here in this thread, and I will reply tomorrow.

 

Jim

 

DATA	Have;
	INFILE	Cards		DSD	DLM=',';
	FORMAT	IP			$30.;
	FORMAT	_DateTm		Datetime20.;

	INPUT	IP		:	$30. 
			_Account	$
			_DateTm	:	Datetime20.
			;

Cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=	Have
				OUT	=	Have_Sort
				;
	BY	IP	_DateTm;
RUN;

**------------------------------------------------------------------------------**;

DATA	Identified_IPs;
	DROP	_:;

	LENGTH	IP				$30;

	RETAIN	_Work_Count;
	RETAIN	Access_Count;
	RETAIN	_Save_Date;
	RETAIN	_Prior_Date;
	RETAIN	First_Date;

	FORMAT	_Save_Date		YYMMDDD10.;
	FORMAT	_Prior_Date		YYMMDDD10.;
	FORMAT	_Current_Date	YYMMDDD10.;
	FORMAT	First_Date		YYMMDDD10.;

	SET	Have_Sort;
		BY	IP;

	_Current_Date						=	DATEPART(_DateTm);

	IF	_N_								=	1						THEN
		DO;
			DECLARE		Hash		Hsh_Accounts();
				Hsh_Accounts.DEFINEKEY('_Account');
				Hsh_Accounts.DEFINEDONE();
		END;

	IF	FIRST.IP													THEN
		DO;
			_Work_Count					=	1;
			Access_Count				=	0;
			_RC							=	Hsh_Accounts.CLEAR();
			_RC							=	Hsh_Accounts.ADD();
			_3_Or_More					=	0;
			_Prior_Date					=	_Current_Date;
			_Save_Date					=	_Current_Date;
		END;
	ELSE
		DO;
			_Days_Apart					=	(_Current_Date	-	_Prior_Date);
			IF	_Days_Apart				<=	7						THEN
				DO;
					_RC					=	Hsh_Accounts.ADD();
					IF	_RC				=	0						THEN
						_Work_Count		+	1;
				END;
			ELSE
				DO;
					_Work_Count			=	1;
					_Save_Date			=	_Current_Date;
				END;
			_Prior_Date					=	_Current_Date;
		END;

	IF	_Work_Count						>=	3						THEN
		DO;
			_3_Or_More					=	1;
			First_Date					=	_Save_Date;
			IF	_Work_Count				>	Access_Count			THEN
				Access_Count			=	_Work_Count;
		END;

	IF	LAST.IP														AND
		Access_Count					>=	3						THEN
		DO;
			OUTPUT;
		END;
	ELSE
		DO;
			DELETE;
		END;
RUN;

**------------------------------------------------------------------------------**;

DATA	Identified_Records;
	DROP	_:;
	DROP	First_Date;
	DROP	Access_Count;

	LENGTH	IP				$30;

	RETAIN	_Prior_Date;

	FORMAT	First_Date		YYMMDDD10.;
	FORMAT	_Prior_Date		YYMMDDD10.;
	FORMAT	_Current_Date	YYMMDDD10.;

	IF	_N_								=	1						THEN
		DO;
			IF	0	THEN
				SET	WORK.Identified_IPs;
			DECLARE		Hash		Hsh_IP_Look_Up(dataset: 'WORK.Identified_IPs');
				Hsh_IP_Look_Up.DEFINEKEY('IP');
				Hsh_IP_Look_Up.DEFINEDATA('First_Date');
				Hsh_IP_Look_Up.DEFINEDONE();
		END;

	SET	Have_Sort(RENAME=(_Account=Account _DateTm=DateTm));
		BY	IP;

	_Current_Date						=	DATEPART(DateTm);

	IF	FIRST.IP													THEN
		DO;
			_RC							=	Hsh_IP_Look_Up.FIND();
			IF	_RC						=	0						THEN
				IF	_Current_Date		=	First_Date				THEN
					DO;
						OUTPUT;
					END;
				ELSE
					DO;
					END;
			ELSE
				DO;
				END;
			_Prior_Date					=	_Current_Date;
		END;
	ELSE
		DO;
			_RC							=	Hsh_IP_Look_Up.FIND();
			IF	_RC						=	0						THEN
				IF	_Current_Date		=	First_Date				THEN
					DO;
						OUTPUT;
					END;
				ELSE
					DO;
						_Days_Apart		=	(_Current_Date	-	_Prior_Date);
						IF	_Days_Apart	<=	7						THEN
							DO;
								_RC		=	Hsh_IP_Look_Up.FIND();
								IF	_RC	=	0						THEN
									DO;
										OUTPUT;
									END;
								ELSE
									DO;
									END;
							END;
						ELSE
							DO;
							END;
					END;
			ELSE
				DO;
				END;
			_Prior_Date					=	_Current_Date;
		END;

	DELETE;
RUN;
ChrisNZ
Tourmaline | Level 20

Something like this?

 

data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 27MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
run;

proc sql;
  create table SEVEN_DAY_FILTER as
  select unique a.*   
  from HAVE a, HAVE b  
  where a.IP=b.IP
    and a.DATE ne b.DATE
    and -8 < intck('dtday',a.DATE,b.DATE) < 8
  order by a.IP, a.DATE;
quit;

data DATE_GROUPS;
  set SEVEN_DAY_FILTER;
  by IP;
  if intck('dtday',lag(DATE),DATE) > 7 then GROUP+1;
  else if first.IP then GROUP+1;
run;

proc sql;
  create table THREE_ACCT_FILTER as
  select *
  from DATE_GROUPS
  group by GROUP
  having count(distinct ACCOUNT) > 2
  order by IP, DATE;
quit;

IP Account Date
108.950.45 54321 09MAY2020:12:55:22
108.950.45 67890 10MAY2020:03:00:02
108.950.45 09876 11MAY2020:06:14:03
108.950.45 45879 11MAY2020:12:37:35
108.950.45 78562 13MAY2020:13:13:09

Note: I changed one date to validate the steps.

 

 

ScottBass
Rhodochrosite | Level 12

I was thinking the same approach, but you beat me to it.

 

We often spend so much effort preventing an unwanted Cartesian product in a multi-table join, that we forget that a Cartesian product, usually with a table self-join, can be our friend.

 

Having every row within a key combined with every other row within that key, then filtering as required, can be very powerful.

 

Edit:  as long as the table isn't, say, 100M rows 😉


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
ChrisNZ
Tourmaline | Level 20

Another (probably better) way:

data GROUPS;
 do GROUP='01jan2020'd to '24dec2020'd;
   START=dhms(GROUP,0,0,0);
   END  =dhms(GROUP+7,0,0,0)-1;
   output;
 end;
 format START END datetime.;
run;

proc sql;
  create table WANT as
  select unique HAVE.*
  from HAVE
     , GROUPS
  where DATE between START and END 
  group by IP, GROUP
  having count(distinct ACCOUNT) > 2
  order by IP, DATE;

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1641 views
  • 7 likes
  • 4 in conversation