08-18-2014 01:02 PM
Has anyone written efficient code for something similar to this?
I have a dataset containing timestamp ( SAS datetime) , ID1 and ID2. Would like to calculate the number of distinct ID2's observed for ID1 for each observation within a certain timeframe. So for each observation, I have to count the distinct ID2's ( not including the current observation ) that occurred within say 7 days ( to the second ) for the ID1 on the transaction.
Thanks in advance for your suggestions.
08-20-2014 10:12 AM
What is "efficient code"? Is this about not using a lot of lines of code, or code which is easy to understand and maintain or is this code which performs well?
Please give us always the SAS version and the OS.
If it is about performance then please tell us also what volumes we're dealing with, how often this needs to run etc. - and if it is "big data" then we would also need to understand your environment (memory available, I/0, is this a grid environment, ....).
08-20-2014 10:38 AM
I am dealing with big data ( and sensitive data ) , so I can't post any actual data here, running 9.3 on Solaris. I think I have come up with a solution that works, but I wanted to see if this was a problem that someone had already solved previously, and a more compact solution was available. Sample data would be something like
ID1 ID2 DTTIME 7D_COUNT
1 1 03aug2014:11:52:57 0
1 2 03aug2014:13:48:47 1
1 2 10aug2014:11:44:58 2
1 2 10aug2014:13:20:51 1
08-20-2014 06:00 PM
Is your source data in SAS tables or a data base. And if it's a data base: Which one?
If it's SAS tables: Can we assume that you're data is pre-sorted like in your data sample? Or at least that for the same ID dates are in sequence (even if there are records in-between from other ID's)?
08-20-2014 06:03 PM
Sounds like a readmission problem, though the exact business context may different.
Hopefully the appropriate term may at least help you narrow down solutions.
08-21-2014 12:59 AM
Same question as Patrick as you set some requirements mentioning "big data".
- How many id1's do you have. Within that, how many id2's
- How big is this dataset.
- Is it ordered id1 id2 date? can it get ordered (sorted) seen as a stream
- What is you machine capacity memory processing dasd.
- Which SAS version are you using (9.3 9,4 or an old one like 9.1)
Those are the things for choosing memory based processing (hash) or batch sequential or indexed data.
How fast should the processing be. order of seconds-minutes hours or days?
The first impression would be using two record pointers with a dataset when this dataset is sorted like this and is a SAS dataset.
In that case it is one pass sequential access.