I'm trying to write a formula that would do the following:
1) Identify which points are new clusters within given conditions (those conditions are when the Euclidean Distance is sufficiently small (<600), and when the time difference is sufficiently large (>15). The conditional statements are not the current issue, but renaming the BaseClusterID variable is.
2) I want to be able to create a new cluster ID name for each cluster for which this condition holds (i.e. ID points 3, 223, 10344, and 16078 all are satisfied by the above conditions, so I'd want them all named a different cluster ID (Cluster 1, 2, 3, and 4).
3) Every sequential ID point which satisfies this condition falls in the same cluster (so points 4 and 5 are in the same cluster as point 3).
I wanted to know if it was possible to achieve this renaming of clusters with Do loops and Arrays. Any assistance or direction would be much appreciated.
Here is a sample of what I have and what I am looking for:
Dataset One (what I have):
ID BaseClusterID DeltaTime EucDistance
1 cluster_0 3 70
2 cluster_0 1 4000
3 cluster_0 22 25
4 cluster_0 2 80
5 cluster_0 2 200
...
Dataset Two (what I'm looking for):
ID BaseClusterID DeltaTime EucDistance
1 cluster_0 3 70
2 cluster_0 1 4000
3 cluster_1 22 25
4 cluster_1 2 80
5 cluster_1 2 200
...
I think you'll need to provide a bit more information about the input data. From your dataset one I have no way/reason to tell that ID value 4 should be a different cluster than ID 1. I have to assume there are some groups of coordinates that are used as the base and another set compared with those and possibly there is a rule about which base(?) coordinates are considered when deciding which cluster value assignment is considered.
I understand these concerns and appreciate the prompt response. So for this project, I want these data to be grouped based on proximity in time and space.
So image we're talking about points 1 - 9. Points 3 - 9 form a cluster. But if there was missing data between points 3 and 4, there might be a larger gap in time. It is still evident there is a cluster there, however, as all points are sufficiently close to one another. I'm looking to detect that Point 3 is the first point in this cluster and to read in that all other points are sufficiently close in time and space to say they are also points in this cluster.
Without explicit data for coordinates I think I would approach this using Proc Fastclus. Possibly looking at creating the potential geographic clusters first and then applying the time element afterwards.
I'll definitely look into Proc Fastclus, thank you much. I should also mention that I do have explicit data for the coordinates, and these take place after running an ST-DBSCAN analysis. I'm just wondering if it is possible to rename points 3 - 9 using Do loops/ Arrays.
If you have a single column likc Cluster assigning values is easy. You should note that in FASTCLUS by default it will generate values like CLUSTER1, CLUSTER2 etc. to identify the groups of coordinates it recommends as a cluster. So your loop/array may not be needed.
I understand automating the function is a quick and simple way to identify the groups of coordinates. However, this is post-cluster scan analysis for any datum that might've fallen through the cracks, an aspect that I feel is best handled through manual detection.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.