BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wylie_ma
Fluorite | Level 6

This might be a tough one.

 

I have a dataset showing the date an IP address accessed an Account:

 

IP, Account, Date

108.950.45, 12345, 5/May/2020

108.950.45, 54321, 6/May/2020

108.950.45, 67890, 10/May/2020

108.950.45, 09876, 11/May/2020

108.950.45, 45879, 11/May/2020

108.950.45, 54686, 30/May/2020

220.540.40, 45879, 05/May/2020

220.540.40, 94532, 11/May/2020

220.540.40, 45879, 11/May/2020

110.640.35, 54873, 05/May/2020

110.640.35, 56454, 05/May/2020

110.640.35, 87654, 20/May/2020

 

I would like to create a dataset that shortlists any IP address that touched three or more unique accounts within a seven day period. My output from above dataset would hence read:

 

108.950.45, 12345, 5/May/2020

108.950.45, 54321, 6/May/2020

108.950.45, 67890, 10/May/2020

108.950.45, 09876, 11/May/2020

108.950.45, 45879, 11/May/2020

 

In the above output there are three more accounts linked to same IP within a seven day window.

 

220.540.40 would not be in final output as it interacted with two unique Accounts. 110.640.35 would also not appear as it did not interact with three accounts in less than 7 days (it took 15 days).

 

Any suggestions how to tackle this problem?

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @wylie_ma  My apologies as I haven't tested as I'm awefully tired after a long day. The following should be close. If not, sure you should have the much needed solution while we folks at eastern time US are asleep. Nonetheless, Please try, test and let me know should you need any further.

 


data have;
infile cards dsd ;
input IP :$30. Account $ Date	:date9.;
format date date9.;
cards;
108.950.45, 12345, 5/May/2020
108.950.45, 54321, 6/May/2020
108.950.45, 67890, 10/May/2020
108.950.45, 09876, 11/May/2020
108.950.45, 45879, 11/May/2020
108.950.45, 54686, 30/May/2020
220.540.40, 45879, 05/May/2020
220.540.40, 94532, 11/May/2020
220.540.40, 45879, 11/May/2020
110.640.35, 54873, 05/May/2020
110.640.35, 56454, 05/May/2020
110.640.35, 87654, 20/May/2020
;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("account") ;
   h.definedata ("account","ip", "date") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
 do until(last.ip);
  set have;
  by ip notsorted;
  if date<=_n then if h.check() ne 0 then do;
   _c+1;
   h.add();
  end;
  if date>_n or last.ip  then do;
   	if _c>=3 then do while(hi.next()=0);
	 output;
	end;
	_c=1;
    _n=date+7;
	h.clear();
	h.add();
  end;
 end;
 drop _:;
run;

results.PNG

View solution in original post

7 REPLIES 7
novinosrin
Tourmaline | Level 20

Hi @wylie_ma  My apologies as I haven't tested as I'm awefully tired after a long day. The following should be close. If not, sure you should have the much needed solution while we folks at eastern time US are asleep. Nonetheless, Please try, test and let me know should you need any further.

 


data have;
infile cards dsd ;
input IP :$30. Account $ Date	:date9.;
format date date9.;
cards;
108.950.45, 12345, 5/May/2020
108.950.45, 54321, 6/May/2020
108.950.45, 67890, 10/May/2020
108.950.45, 09876, 11/May/2020
108.950.45, 45879, 11/May/2020
108.950.45, 54686, 30/May/2020
220.540.40, 45879, 05/May/2020
220.540.40, 94532, 11/May/2020
220.540.40, 45879, 11/May/2020
110.640.35, 54873, 05/May/2020
110.640.35, 56454, 05/May/2020
110.640.35, 87654, 20/May/2020
;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("account") ;
   h.definedata ("account","ip", "date") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
 do until(last.ip);
  set have;
  by ip notsorted;
  if date<=_n then if h.check() ne 0 then do;
   _c+1;
   h.add();
  end;
  if date>_n or last.ip  then do;
   	if _c>=3 then do while(hi.next()=0);
	 output;
	end;
	_c=1;
    _n=date+7;
	h.clear();
	h.add();
  end;
 end;
 drop _:;
run;

results.PNG

wylie_ma
Fluorite | Level 6

Thanks so much again! I wish there was a button I could use on this page to tip you

PeterClemmensen
Tourmaline | Level 20

What if the pattern for an IP address was like this? There are more than 7 days from 05/May/2020 to 15/May/2020, but both can be a part of a 'within 7 days' streak.

 

110.640.35, 12345, 05/May/2020
110.640.35, 23456, 09/May/2020
110.640.35, 34567, 11/May/2020
110.640.35, 45678, 15/May/2020
wylie_ma
Fluorite | Level 6

Hypothetically, if I run add an additional row to dataset (see row 3 below), where IP address accesses the same account within 7 days, the code currently will only pickup the first entry within a seven day range.


data have;
infile cards dsd ;
input IP :$30. Account $ Date :date9.;
format date date9.;
cards;
108.950.45, 12345, 5/May/2020
108.950.45, 54321, 6/May/2020
108.950.45, 54321, 13/May/2020
108.950.45, 67890, 10/May/2020
108.950.45, 09876, 11/May/2020
108.950.45, 45879, 11/May/2020
108.950.45, 54686, 30/May/2020
220.540.40, 45879, 05/May/2020
220.540.40, 94532, 11/May/2020
220.540.40, 45879, 11/May/2020
110.640.35, 54873, 05/May/2020
110.640.35, 56454, 05/May/2020
110.640.35, 87654, 20/May/2020
;

 

Is there anyway to adjust code so that it will pickup all entries within seven day range, even if it is the same account? See below for desired output. Also, if my date format is datetime20., will code be compatible with this format?

 

108.950.45 54321 06MAY2020

108.950.45 45879 11MAY2020
108.950.45 67890 10MAY2020
108.950.45 12345 05MAY2020
108.950.45 09876 11MAY2020

108.950.45 54321 13MAY2020

novinosrin
Tourmaline | Level 20

Hi @wylie_ma  First off, a repeat occurrence of the same account within a particular interval changes your original stated requirement/objective. However, sure it's possible. Also, if you have datetime values instead of date values, we could extract the date value from a datetime value using a datepart function.  It's not a big deal.

 

Ideally, new requirement is better addressed as independent threads instead all in one. Anyways, Please review your requirement thoroughly and let me know as 13may2020 is outside of of the 1st seven interval i.e. 

interval 1: 5may2020- 12may2020   (5+7)

and the next goes like anything higher 12may2020 is found, increment the interval by +7 from that date.  

wylie_ma
Fluorite | Level 6

Thank you Novinisrin, Will do!

RichardDeVen
Barite | Level 11

A SQL query with a counting sub-select can create the desired result set.

 

Example:

proc sql;
  create index IP on have;

  create table want as 
  select *
  from have as each
  where 3 <= (
    select count(distinct lookup.account)
    from have as lookup
    where lookup.IP = each.IP
      and lookup.date between each.date-7 and each.date+7
    )    
  ;
quit;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1160 views
  • 1 like
  • 4 in conversation