BookmarkSubscribeRSS Feed
Teketo
Calcite | Level 5

Dear Reeza,

 

Many thanks for your invaluable support.

 

The cross join works well; however, when I use the BY group and FIRST, I am losing most of the respondents (R_ID) and more than half of the clusters.

 

For example, respondents (R_ID) 8, 9, and from 11 to 23 and clusters from 5 to 10 are not included in the final dataset.  

 

I used the following dataset and code to do the cross join and pick the appropriate health facilities, respondents and clusters along with the other variables.

 

*To calculate the distance;

Data clusters;

Input R_ID Cluster Lat1 Lon1 region age FP;

datalines;

1 1 14.0 38.7 1 19 0

2 1 14.0 38.7 1 21 1

3 2 11.6 41.5 2 30 0

4 2 11.6 41.5 2 48 0

5 2 11.6 41.5 2 25 1

6 3 12.7 37.5 3 32 1

7 3 12.7 37.5 3 21 1

8 4 7.5 39.1 4 40 0

9 4 7.5 39.1 4 33 1

10 4 7.5 39.1 4 29 0

11 5 12.7 37.6 1 40 0

12 5 12.7 37.6 1 31 1

13 6 10.4 40.1 2 27 1

14 6 10.4 40.1 2 29 0

15 7 12.1 40.0 2 26 0

16 7 12.1 40.0 2 24 1

17 8 12.2 37.0 3 32 0

18 8 12.2 37.0 3 37 1

19 9 9.9 39.9 3 42 0

20 9 9.9 39.9 3 37 1

21 10 9.1 40.7 4 33 0

22 10 9.1 40.7 4 35 1

23 10 9.1 40.7 4 36 0

;

 

*Health facility data;

Data health_facility;

Input ID Lat2 Lon2 region HFMA FPRS;

Datalines;

1 13.6 39.1 1 1 0

2 13.3 38.9 1 2 1

3 10.7 40.1 2 1 1

4 10.3 40.2 2 1 1

5 11.4 40.7 2 1 1

6 10.5 39.5 3 2 0

7 11.6 39.6 3 1 1

8 11.3 37.9 3 2 1

9 9.1 35.6 4 2 0

10 9.5 34.8 4 1 1

11 14.3 37.8 1 2 0

12 13.5 39.4 1 1 1

13 10.4 40.2 2 2 0

14 9.2 40.3 2 1 1

15 9.9 39.8 3 2 1

16 11.0 37.1 3 1 1

17 10.6 39.4 3 2 0

18 8.5 34.5 4 2 1

19 7.2 38.2 4 1 1

20 8.8 35.8 4 2 1

;

run;

 

*Proc sql: Using cross join to merge to the two datasets;

proc sql;

create table dist_table as

select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,

     h.region,

     c.r_id, c.cluster, c.lat1 as lat_cluster, c.lon1 as long_cluster,

     geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km

from clusters as c cross join health_facility as h

where c.region=h.region

order by h.id, c.cluster, dist_km;

quit;

 

*To pick a single minimum distance for each health facility having in mind the following criteria

  1. One health facility should be linked to one cluster in a region
  2. One cluster can be linked to one or more health facilities in a region;

 

*BY group and FIRST;

Data merged;

Set dist_table;

By id cluster dist_km;

If first.id;

Run;

 

The following one also doesn’t give the appropriate result.

 

proc sort data = dist_table;

by id dist_km;

run;

 

 

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

 

 

 

Kind regards

Teketo

 

Reeza
Super User

Data merged;

Set dist_table;

By id cluster dist_km; <- This is wrong;

If first.id;

Run;

 

You're currently sorting by ID, CLUSTER and then distance. Thats not correct. You need to sort by ID and Distance only and then take the first of each ID. That may or may not include all clusters if you don't have facilities in each region. 

 

 

Teketo
Calcite | Level 5

Dear Reeza,

 

Thank you indeed.

The dataset has health facilities in each region; however, all the clusters are not included in the final dataset. Below in the table, I put the number of health facilities and clusters along with the respective regions given in the fictitious dataset.

 

Table showing the number of clusters and health facilities in each region

Region

Number of Health facilities

Number of Clusters

1

4

2

2

5

3

3

6

3

4

5

2

Total

20

10

 

I used the health facility id and distance to sort the dataset and in the BY group and FIRST data step too. However, it didn’t include all the clusters; for instance, clusters 3, 5, 7 and 10 were not included. I am not getting the desired final dataset.

 

Could there be any other possible option?

 

proc sort data = dist_table;

by id dist_km;

run;

 

*BY group and FIRST;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

 

Kind regards

Teketo

Reeza
Super User

Find a specific example of where Cluster 3 should be included, recreate the sample data to show how it's not being included and I can take another look.

 

But remember the join was on region, if regions isn't the same as cluster, maybe you needed to join on cluster. To me if there are facilities in each region and each region has multiple clusters it is possible that a cluster may not be selected - think rural outlier hospitals. 

 

 

Teketo
Calcite | Level 5

Dear Reeza,

 

Thank you indeed.

The join can’t be made at the cluster level. We only have region as a common variable. Both datasets (the fictitious and the main datasets) have multiple clusters and health facilities per region. Every region has both clusters and health facilities; none of the regions missed either of the two.

 

Here are the datasets and the codes used.

 

*To calculate the distance;

Data clusters;

Input Cluster Lat1 Lon1 region;

datalines;

1 14.0 38.7 1

2 11.6 41.5 2

3 12.7 37.5 3

4 7.5 39.1 4

5 12.7 37.6 1

6 10.4 40.1 2

7 12.1 40.0 2

8 12.2 37.0 3

9 9.9 39.9 3

10 9.1 40.7 4

;

 

*Health facility data;

Data health_facility;

Input ID Lat2 Lon2 region HFMA FPRS;

Datalines;

1 13.6 39.1 1 1 0

2 13.3 38.9 1 2 1

3 10.7 40.1 2 1 1

4 10.3 40.2 2 1 1

5 11.4 40.7 2 1 1

6 10.5 39.5 3 2 0

7 11.6 39.6 3 1 1

8 11.3 37.9 3 2 1

9 9.1 35.6 4 2 0

10 9.5 34.8 4 1 1

11 14.3 37.8 1 2 0

12 13.5 39.4 1 1 1

13 10.4 40.2 2 2 0

14 9.2 40.3 2 1 1

15 9.9 39.8 3 2 1

16 11.0 37.1 3 1 1

17 10.6 39.4 3 2 0

18 8.5 34.5 4 2 1

19 7.2 38.2 4 1 1

20 8.8 35.8 4 2 1

;

run;

 

*Proc sql: Using cross join to merge to the two datasets;

proc sql;

create table dist_table as

select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,

     h.region,

     c.cluster, c.lat1 as lat_cluster, c.lon1 as long_cluster,

     geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km

from clusters as c cross join health_facility as h

where c.region=h.region

order by h.id, c.cluster, dist_km;

quit;

 

 

proc sort data = dist_table;

by id dist_km;

run;

 

*BY group and FIRST;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

 

 

 

Kind regards

Teketo

 

Reeza
Super User

Then your assumption that each cluster will be included is not correct. If you want the closest facility to hospital, that doesn't necessarily mean that each cluster will be included. The two criteria's don't align. 

 

Check Cluster 3 in your data, specifically. It's not included because it's never the closes facility so you need to revisit your assumptions and requirements. 

 

proc sql;
create table cluster3 as
select *
from dist_table 
where id in (select id from dist_table where cluster=3)
order by id, region, dist_km;
quit;

 


@Teketo wrote:

Dear Reeza,

 

Thank you indeed.

The join can’t be made at the cluster level. We only have region as a common variable. Both datasets (the fictitious and the main datasets) have multiple clusters and health facilities per region. Every region has both clusters and health facilities; none of the regions missed either of the two.

 

Here are the datasets and the codes used.

 

*To calculate the distance;

Data clusters;

Input Cluster Lat1 Lon1 region;

datalines;

1 14.0 38.7 1

2 11.6 41.5 2

3 12.7 37.5 3

4 7.5 39.1 4

5 12.7 37.6 1

6 10.4 40.1 2

7 12.1 40.0 2

8 12.2 37.0 3

9 9.9 39.9 3

10 9.1 40.7 4

;

 

*Health facility data;

Data health_facility;

Input ID Lat2 Lon2 region HFMA FPRS;

Datalines;

1 13.6 39.1 1 1 0

2 13.3 38.9 1 2 1

3 10.7 40.1 2 1 1

4 10.3 40.2 2 1 1

5 11.4 40.7 2 1 1

6 10.5 39.5 3 2 0

7 11.6 39.6 3 1 1

8 11.3 37.9 3 2 1

9 9.1 35.6 4 2 0

10 9.5 34.8 4 1 1

11 14.3 37.8 1 2 0

12 13.5 39.4 1 1 1

13 10.4 40.2 2 2 0

14 9.2 40.3 2 1 1

15 9.9 39.8 3 2 1

16 11.0 37.1 3 1 1

17 10.6 39.4 3 2 0

18 8.5 34.5 4 2 1

19 7.2 38.2 4 1 1

20 8.8 35.8 4 2 1

;

run;

 

*Proc sql: Using cross join to merge to the two datasets;

proc sql;

create table dist_table as

select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,

     h.region,

     c.cluster, c.lat1 as lat_cluster, c.lon1 as long_cluster,

     geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km

from clusters as c cross join health_facility as h

where c.region=h.region

order by h.id, c.cluster, dist_km;

quit;

 

 

proc sort data = dist_table;

by id dist_km;

run;

 

*BY group and FIRST;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

 

 

 

Kind regards

Teketo

 


 

Reeza
Super User

Right now, rereading your requirements, maybe you want to get the cluster/facility match, so you should sort by cluster instead and take the closest?

 

I think at this point you have all the tools you need to solve this problem and I've provided enough assistance to solve it. The logic is up to you to determine, first what you need and you've learned how to find the minimum by group, so you should be able to switch it to do a slightly different variation. 

 

Good Luck. 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 21 replies
  • 3591 views
  • 0 likes
  • 2 in conversation