BookmarkSubscribeRSS Feed
wylie_ma
Fluorite | Level 6

I have posted a similar query in a different thread and have received some great responses. My requirements are now slightly different than what I initially had.

 

I have a sample dataset (below) that links an IP to an Account login attempt. I want to identify any IP (alongside the Account and Date) that has attempted to log into three or more accounts within any 7 day window. 

 

data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;

 

My output would hence ideally read:

 

108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 54321, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 67890, 17MAY2020:07:12:03

 

In the above output the following criteria is satisfied: (i) three or more unique Account IDs to the IP; (ii) login attempts occurred within a 7 day window. Even though rows with dates 09MAY2020 and 17MAY2020 are more than seven days a part, they are still associated to 7 day streak.

 

Thank you!

5 REPLIES 5
jimbarbour
Meteorite | Level 14

@wylie_ma,

 

I've coded up a little something; see code below.  Here is what it ultimately produces in the last Data step:

jimbarbour_0-1603343150247.png

 

The code is below.  It's getting late, and I'm tired, so I won't try to walk you through the code right now.  Please take a look at the code, and if you have any questions or encounter any problems, just post a response here in this thread, and I will reply tomorrow.

 

Jim

 

DATA	Have;
	INFILE	Cards		DSD	DLM=',';
	FORMAT	IP			$30.;
	FORMAT	_DateTm		Datetime20.;

	INPUT	IP		:	$30. 
			_Account	$
			_DateTm	:	Datetime20.
			;

Cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 17MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=	Have
				OUT	=	Have_Sort
				;
	BY	IP	_DateTm;
RUN;

**------------------------------------------------------------------------------**;

DATA	Identified_IPs;
	DROP	_:;

	LENGTH	IP				$30;

	RETAIN	_Work_Count;
	RETAIN	Access_Count;
	RETAIN	_Save_Date;
	RETAIN	_Prior_Date;
	RETAIN	First_Date;

	FORMAT	_Save_Date		YYMMDDD10.;
	FORMAT	_Prior_Date		YYMMDDD10.;
	FORMAT	_Current_Date	YYMMDDD10.;
	FORMAT	First_Date		YYMMDDD10.;

	SET	Have_Sort;
		BY	IP;

	_Current_Date						=	DATEPART(_DateTm);

	IF	_N_								=	1						THEN
		DO;
			DECLARE		Hash		Hsh_Accounts();
				Hsh_Accounts.DEFINEKEY('_Account');
				Hsh_Accounts.DEFINEDONE();
		END;

	IF	FIRST.IP													THEN
		DO;
			_Work_Count					=	1;
			Access_Count				=	0;
			_RC							=	Hsh_Accounts.CLEAR();
			_RC							=	Hsh_Accounts.ADD();
			_3_Or_More					=	0;
			_Prior_Date					=	_Current_Date;
			_Save_Date					=	_Current_Date;
		END;
	ELSE
		DO;
			_Days_Apart					=	(_Current_Date	-	_Prior_Date);
			IF	_Days_Apart				<=	7						THEN
				DO;
					_RC					=	Hsh_Accounts.ADD();
					IF	_RC				=	0						THEN
						_Work_Count		+	1;
				END;
			ELSE
				DO;
					_Work_Count			=	1;
					_Save_Date			=	_Current_Date;
				END;
			_Prior_Date					=	_Current_Date;
		END;

	IF	_Work_Count						>=	3						THEN
		DO;
			_3_Or_More					=	1;
			First_Date					=	_Save_Date;
			IF	_Work_Count				>	Access_Count			THEN
				Access_Count			=	_Work_Count;
		END;

	IF	LAST.IP														AND
		Access_Count					>=	3						THEN
		DO;
			OUTPUT;
		END;
	ELSE
		DO;
			DELETE;
		END;
RUN;

**------------------------------------------------------------------------------**;

DATA	Identified_Records;
	DROP	_:;
	DROP	First_Date;
	DROP	Access_Count;

	LENGTH	IP				$30;

	RETAIN	_Prior_Date;

	FORMAT	First_Date		YYMMDDD10.;
	FORMAT	_Prior_Date		YYMMDDD10.;
	FORMAT	_Current_Date	YYMMDDD10.;

	IF	_N_								=	1						THEN
		DO;
			IF	0	THEN
				SET	WORK.Identified_IPs;
			DECLARE		Hash		Hsh_IP_Look_Up(dataset: 'WORK.Identified_IPs');
				Hsh_IP_Look_Up.DEFINEKEY('IP');
				Hsh_IP_Look_Up.DEFINEDATA('First_Date');
				Hsh_IP_Look_Up.DEFINEDONE();
		END;

	SET	Have_Sort(RENAME=(_Account=Account _DateTm=DateTm));
		BY	IP;

	_Current_Date						=	DATEPART(DateTm);

	IF	FIRST.IP													THEN
		DO;
			_RC							=	Hsh_IP_Look_Up.FIND();
			IF	_RC						=	0						THEN
				IF	_Current_Date		=	First_Date				THEN
					DO;
						OUTPUT;
					END;
				ELSE
					DO;
					END;
			ELSE
				DO;
				END;
			_Prior_Date					=	_Current_Date;
		END;
	ELSE
		DO;
			_RC							=	Hsh_IP_Look_Up.FIND();
			IF	_RC						=	0						THEN
				IF	_Current_Date		=	First_Date				THEN
					DO;
						OUTPUT;
					END;
				ELSE
					DO;
						_Days_Apart		=	(_Current_Date	-	_Prior_Date);
						IF	_Days_Apart	<=	7						THEN
							DO;
								_RC		=	Hsh_IP_Look_Up.FIND();
								IF	_RC	=	0						THEN
									DO;
										OUTPUT;
									END;
								ELSE
									DO;
									END;
							END;
						ELSE
							DO;
							END;
					END;
			ELSE
				DO;
				END;
			_Prior_Date					=	_Current_Date;
		END;

	DELETE;
RUN;
ChrisNZ
Tourmaline | Level 20

Something like this?

 

data have;
infile cards dsd ;
input IP :$30. Account $ Date :datetime20.;
format date datetime20.;
cards;
108.950.45, 12345, 01MAY2020:11:39:55
108.950.45, 54321, 09MAY2020:12:55:22
108.950.45, 67890, 10MAY2020:03:00:02
108.950.45, 09876, 11MAY2020:06:14:03
108.950.45, 45879, 11MAY2020:12:37:35
108.950.45, 78562, 13MAY2020:13:13:09
108.950.45, 54321, 27MAY2020:07:12:03
108.950.45, 54686, 30MAY2020:10:00:02
220.540.40, 45879, 05MAY2020:09:09:15
220.540.40, 94532, 11MAY2020:03:41:16
220.540.40, 45879, 11MAY2020:14:00:58
110.640.35, 54873, 05MAY2020:17:22:01
110.640.35, 56454, 05MAY2020:19:22:05
110.640.35, 87654, 20MAY2020:05:14:12
run;

proc sql;
  create table SEVEN_DAY_FILTER as
  select unique a.*   
  from HAVE a, HAVE b  
  where a.IP=b.IP
    and a.DATE ne b.DATE
    and -8 < intck('dtday',a.DATE,b.DATE) < 8
  order by a.IP, a.DATE;
quit;

data DATE_GROUPS;
  set SEVEN_DAY_FILTER;
  by IP;
  if intck('dtday',lag(DATE),DATE) > 7 then GROUP+1;
  else if first.IP then GROUP+1;
run;

proc sql;
  create table THREE_ACCT_FILTER as
  select *
  from DATE_GROUPS
  group by GROUP
  having count(distinct ACCOUNT) > 2
  order by IP, DATE;
quit;

IP Account Date
108.950.45 54321 09MAY2020:12:55:22
108.950.45 67890 10MAY2020:03:00:02
108.950.45 09876 11MAY2020:06:14:03
108.950.45 45879 11MAY2020:12:37:35
108.950.45 78562 13MAY2020:13:13:09

Note: I changed one date to validate the steps.

 

 

ScottBass
Rhodochrosite | Level 12

I was thinking the same approach, but you beat me to it.

 

We often spend so much effort preventing an unwanted Cartesian product in a multi-table join, that we forget that a Cartesian product, usually with a table self-join, can be our friend.

 

Having every row within a key combined with every other row within that key, then filtering as required, can be very powerful.

 

Edit:  as long as the table isn't, say, 100M rows 😉


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
ChrisNZ
Tourmaline | Level 20

Another (probably better) way:

data GROUPS;
 do GROUP='01jan2020'd to '24dec2020'd;
   START=dhms(GROUP,0,0,0);
   END  =dhms(GROUP+7,0,0,0)-1;
   output;
 end;
 format START END datetime.;
run;

proc sql;
  create table WANT as
  select unique HAVE.*
  from HAVE
     , GROUPS
  where DATE between START and END 
  group by IP, GROUP
  having count(distinct ACCOUNT) > 2
  order by IP, DATE;

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1682 views
  • 7 likes
  • 4 in conversation