Hello,
I have got three spatial datasets.
I have been trying to merge these datasets using SAS; however, I have got problem in merging the population data (DHS) with health facility data. Both the sampled health facility and population data were collected all over the country.
The DHS data has 622 clusters (each cluster has one coordinate information) and an average of 23 individuals (14,300 people in total) were interviewed per cluster. On the other hand, in the second dataset, 1,020 health facilities were interviewed along with their geographic coordinates.
The merge should be done using both geographic coordinates (1,020 health facility vs 622 clusters) (using the nearest health facility distance to each DHS cluster) and regions (11 regions where both the data were collected).
The merge that I want is not only by minimum distance between clusters and health facilities, but it should also consider the regional administration boundary. In other words, all nearest distance merges must not cross regional admin boundary.
During merge, how do I manage the multiple observations per cluster in the population dataset; 14,300 people in 622 clusters? There are more location coordinates of health facility (1,020) as compared to 622 clusters.
How can I merge the DHS data with the health facility data? How do I manage the attribute data (an average of 23 individuals information per one cluster) while combining it with a single health facility data?
Data x; *Health facility dataset;
Set a;
Keep LAT LONG REGION FACTYPE Q102_04 GR1 GR2 FA1 FA2 FR1 FR2 FR3
Run;
Data y; *Population dataset;
Set b;
Keep V001 LAT_DHS LONG_DHS V002 V012 REGION V190 V218 M14_1 V501 V313M;
Run;
Kind regards
Teketo
GeoDIST will calculate distances and GINSIDE will help you with the boundary issue. There’s an example in the docs.
This may be easier if you provide sample data. If your data is confidential then make fake data.
Hi Reeza,
Many thanks.
I have made some fictitious data set; DHS and Health facility, in excel.
Here attached to get some demonstration how I can manage the data sets using minimum distance and administrative boundary.
In addition to this, I love to see how I can manage the multiple observations per cluster which also impact the merging process and analysis in later stage.
Regards
Teketo
This isn't a merge problem.
Hi Reeza,
Once the two data sets are merged properly, I have to do descriptive and regression analysis. My main concern is how do these data sets be merged to run my statistical analysis?
Kind regards
Teketo
1. Use GINSIDE to determine which facilities are in which boundaries. The documentation has an example of that.
2. Once you've assigned each to the proper region, create a distance matrix that calculates the distance between the facility and the clusters, using GEODIST() and a cross join.
3. Once you have the distance matrix you can find the nearest cluster and use that as needed for your analysis.
4. Then join the data needed.
Unfortunately this is not a trivial problem and not something I can answer in a few minutes. I would suggest you break down the problem into smaller portions and try to solve each one first and then pull it all together. You can then ask more targeted questions that are likely to be answered. Asking a fairly complex question that requires a significant time investment is likely to be answered.
Also, you can prep your sample data for this or other questions using the techniques outlined here:
If you provide sample data and code as data steps you're much much more likely to get help compared to when some has to download files, import them, figure out if its correct and try and interpret the problem and then solve it.
Hi Reeza,
Many thanks for your unreserved support.
How do the GEODIST function work in calculating the distance between data set ‘x’ and data set ‘y’?
Do the two data sets merged first to do the distance calculation? If that is the case, how can I merge them?
The two data sets should be merged considering the minimum distance between cluster and health facility, and each cluster and health facility merge using the minimum distance should be within the respective regions.
For instance, the nearest health facility to a cluster might be in other region, in this case, the merging is not appropriate. Each merge shouldn’t extend beyond region boundary.
I used the below SAS code:
*Distance calculation using GEODIST function;
Data dist;
Set x y;
distance = geodist (lat1, lon1, lat2, lon2, 'DK');
put 'Distance = ' distance 'kilometres';
run;
But this brought nothing, I haven’t got the distance matrix. It gives me a merged data set, which is inappropraite.
Here attached the SAS code for the fictitious data set.
*Population data;
Data x;
Input Cluster Lat1 Lon1 region age FP;
datalines;
1 14.0 38.7 1 19 0
1 14.0 38.7 1 21 1
2 11.6 41.5 2 30 0
2 11.6 41.5 2 48 0
2 11.6 41.5 2 25 1
3 12.7 37.5 3 32 1
3 12.7 37.5 3 21 1
4 7.5 39.1 4 40 0
4 7.5 39.1 4 33 1
4 7.5 39.1 4 29 0
5 12.7 37.6 1 40 0
5 12.7 37.6 1 31 1
6 10.4 40.1 2 27 1
6 10.4 40.1 2 29 0
7 12.1 40.0 2 26 0
7 12.1 40.0 2 24 1
8 12.2 37.0 3 32 0
8 12.2 37.0 3 37 1
9 9.9 39.9 3 42 0
9 9.9 39.9 3 37 1
10 9.1 40.7 4 33 0
10 9.1 40.7 4 35 1
10 9.1 40.7 4 36 0
;
*Health facility data;
Data y;
Input ID Lat2 Lon2 region HFMA FPRS;
Datalines;
1 13.6 39.1 1 1 0
2 13.3 38.9 1 2 1
3 10.7 40.1 2 1 1
4 10.3 40.2 2 1 1
5 11.4 40.7 2 1 1
6 10.5 39.5 3 2 0
7 11.6 39.6 3 1 1
8 11.3 37.9 3 2 1
9 9.1 35.6 4 2 0
10 9.5 34.8 4 1 1
11 14.3 37.8 1 2 0
12 13.5 39.4 1 1 1
13 10.4 40.2 2 2 0
14 9.2 40.3 2 1 1
15 9.9 39.8 3 2 1
16 11.0 37.1 3 1 1
17 10.6 39.4 3 2 0
18 8.5 34.5 4 2 1
19 7.2 38.2 4 1 1
20 8.8 35.8 4 2 1
;
run;
Kind regards
Teketo
Your sample data doesn't make sense. You mentioned that the clusters only had one location, so why do you have multiple entries and if you do how do you want the facility location matched to the clusters? This type of multiplicity can cause your table size to get big fast.
Here's how to calculate the distance:
Data clusters;
Input Cluster Lat1 Lon1 region age FP;
datalines;
1 14.0 38.7 1 19 0
1 14.0 38.7 1 21 1
2 11.6 41.5 2 30 0
2 11.6 41.5 2 48 0
2 11.6 41.5 2 25 1
3 12.7 37.5 3 32 1
3 12.7 37.5 3 21 1
4 7.5 39.1 4 40 0
4 7.5 39.1 4 33 1
4 7.5 39.1 4 29 0
5 12.7 37.6 1 40 0
5 12.7 37.6 1 31 1
6 10.4 40.1 2 27 1
6 10.4 40.1 2 29 0
7 12.1 40.0 2 26 0
7 12.1 40.0 2 24 1
8 12.2 37.0 3 32 0
8 12.2 37.0 3 37 1
9 9.9 39.9 3 42 0
9 9.9 39.9 3 37 1
10 9.1 40.7 4 33 0
10 9.1 40.7 4 35 1
10 9.1 40.7 4 36 0
;
*Health facility data;
Data health_facility;
Input ID Lat2 Lon2 region HFMA FPRS;
Datalines;
1 13.6 39.1 1 1 0
2 13.3 38.9 1 2 1
3 10.7 40.1 2 1 1
4 10.3 40.2 2 1 1
5 11.4 40.7 2 1 1
6 10.5 39.5 3 2 0
7 11.6 39.6 3 1 1
8 11.3 37.9 3 2 1
9 9.1 35.6 4 2 0
10 9.5 34.8 4 1 1
11 14.3 37.8 1 2 0
12 13.5 39.4 1 1 1
13 10.4 40.2 2 2 0
14 9.2 40.3 2 1 1
15 9.9 39.8 3 2 1
16 11.0 37.1 3 1 1
17 10.6 39.4 3 2 0
18 8.5 34.5 4 2 1
19 7.2 38.2 4 1 1
20 8.8 35.8 4 2 1
;
run;
proc sql;
create table dist_table as
select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,
h.region,
c.lat1 as lat_cluster, c.lon1 as long_cluster, c.cluster,
geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km
from clusters as c
cross join health_facility as h
where c.region=h.region
order by h.id, dist_km;
quit;
Hi Reeza,
Many thanks.
Yes, each cluster has one location information; however, there are a number of people interviewed per each cluster. For example, in the fictitious data set, two women were interviewed in cluster 1. Since they were in the same cluster, they have the same location information; that is, lat = 14.0 and long = 38.7 is the same for both women.
In my actual data set, an average of 23 women were interviewed per one cluster; that is, a single coordinate or location information works for all the 23 women. Therefore, one location information is duplicated for 23 times for each individual woman in that particular cluster.
What is the importance of cross join or Cartesian product here? I don’t need a duplicate merge. I just need each cluster to link to one or more health facilities based on the minimum distance that each facility is located from a cluster. Therefore, one health facility will be linked to only one cluster; a single health facility can’t be linked to more than one cluster. But a single cluster can be linked to one or more health facilities depending on each facility proximity taking into consideration that both the health facility and the cluster are within the same region.
Therefore, the cross join, in this case, can’t work; it gives me a cross product.
Kind regards
Teketo
@Teketo wrote:
I just need each cluster to link to one or more health facilities based on the minimum distance that each facility is located from a cluster. Therefore, one health facility will be linked to only one cluster; a single health facility can’t be linked to more than one cluster. But a single cluster can be linked to one or more health facilities depending on each facility proximity taking into consideration that both the health facility and the cluster are within the same region.
Therefore, the cross join, in this case, can’t work; it gives me a cross product.
Kind regards
Teketo
How do you know what the minimum distance is unless you calculate them all?
Hi Reeza,
Many thanks for your unreserved kindly support.
Oh, I have got the main idea of the cross join. However, after the cross join, I am still worried about how to select health facilities that are linked to respective clusters based on the minimum distance and administrative boundary limit. Therefore, at the end, the joined dataset should fulfil the following conditions:
Once these conditions are fulfilled, I can do statistical analysis. Otherwise, the analysis will not give the correct figure and information about the problem I am studying.
Kind regards
Teketo
I understand, but I assumed region was your administrative boundaries, which is why the cross join does have that WHERE condition. If that's not the case, you missed the GINSIDE step I mentioned at the beginning and you need the SHP file and GINSIDE to first determine your regions and then in the lookup table add that condition to your WHERE statement.
I *think* all you need to do know is first to pick the minimum based on the code I provided. I assumed you can select the minimum by your particular lookup rules from the big table.
Hi Reeza,
Thank you indeed!
Yes, region is my administrative boundary and the WHERE condition works well for it.
Now, my main question is: how to pick a single minimum distance for each health facility having in mind the following criteria?
Kind regards
Teketo
Sort by health facility and distance
Take the first record using BY group and FIRST.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.