Re: How to combine spatial and attribute data set? - Page 2

Teketo · Posted 02-23-2018 01:25 AM

Dear Reeza,

Many thanks for your invaluable support.

The cross join works well; however, when I use the BY group and FIRST, I am losing most of the respondents (R_ID) and more than half of the clusters.

For example, respondents (R_ID) 8, 9, and from 11 to 23 and clusters from 5 to 10 are not included in the final dataset.

I used the following dataset and code to do the cross join and pick the appropriate health facilities, respondents and clusters along with the other variables.

*To calculate the distance;

Data clusters;

Input R_ID Cluster Lat1 Lon1 region age FP;

datalines;

1 1 14.0 38.7 1 19 0

2 1 14.0 38.7 1 21 1

3 2 11.6 41.5 2 30 0

4 2 11.6 41.5 2 48 0

5 2 11.6 41.5 2 25 1

6 3 12.7 37.5 3 32 1

7 3 12.7 37.5 3 21 1

8 4 7.5 39.1 4 40 0

9 4 7.5 39.1 4 33 1

10 4 7.5 39.1 4 29 0

11 5 12.7 37.6 1 40 0

12 5 12.7 37.6 1 31 1

13 6 10.4 40.1 2 27 1

14 6 10.4 40.1 2 29 0

15 7 12.1 40.0 2 26 0

16 7 12.1 40.0 2 24 1

17 8 12.2 37.0 3 32 0

18 8 12.2 37.0 3 37 1

19 9 9.9 39.9 3 42 0

20 9 9.9 39.9 3 37 1

21 10 9.1 40.7 4 33 0

22 10 9.1 40.7 4 35 1

23 10 9.1 40.7 4 36 0

;

*Health facility data;

Data health_facility;

Input ID Lat2 Lon2 region HFMA FPRS;

Datalines;

1 13.6 39.1 1 1 0

2 13.3 38.9 1 2 1

3 10.7 40.1 2 1 1

4 10.3 40.2 2 1 1

5 11.4 40.7 2 1 1

6 10.5 39.5 3 2 0

7 11.6 39.6 3 1 1

8 11.3 37.9 3 2 1

9 9.1 35.6 4 2 0

10 9.5 34.8 4 1 1

11 14.3 37.8 1 2 0

12 13.5 39.4 1 1 1

13 10.4 40.2 2 2 0

14 9.2 40.3 2 1 1

15 9.9 39.8 3 2 1

16 11.0 37.1 3 1 1

17 10.6 39.4 3 2 0

18 8.5 34.5 4 2 1

19 7.2 38.2 4 1 1

20 8.8 35.8 4 2 1

;

run;

*Proc sql: Using cross join to merge to the two datasets;

proc sql;

create table dist_table as

select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,

h.region,

c.r_id, c.cluster, c.lat1 as lat_cluster, c.lon1 as long_cluster,

geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km

from clusters as c cross join health_facility as h

where c.region=h.region

order by h.id, c.cluster, dist_km;

quit;

*To pick a single minimum distance for each health facility having in mind the following criteria

One health facility should be linked to one cluster in a region
One cluster can be linked to one or more health facilities in a region;

*BY group and FIRST;

Data merged;

Set dist_table;

By id cluster dist_km;

If first.id;

Run;

The following one also doesn’t give the appropriate result.

proc sort data = dist_table;

by id dist_km;

run;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

Kind regards

Teketo

Reeza · Posted 02-23-2018 02:45 PM

Data merged;

Set dist_table;

By id cluster dist_km; <- This is wrong;

If first.id;

Run;

You're currently sorting by ID, CLUSTER and then distance. Thats not correct. You need to sort by ID and Distance only and then take the first of each ID. That may or may not include all clusters if you don't have facilities in each region.

Teketo · Posted 02-24-2018 07:39 AM

Dear Reeza,

Thank you indeed.

The dataset has health facilities in each region; however, all the clusters are not included in the final dataset. Below in the table, I put the number of health facilities and clusters along with the respective regions given in the fictitious dataset.

Table showing the number of clusters and health facilities in each region

Region	Number of Health facilities	Number of Clusters
1	4	2
2	5	3
3	6	3
4	5	2
Total	20	10

I used the health facility id and distance to sort the dataset and in the BY group and FIRST data step too. However, it didn’t include all the clusters; for instance, clusters 3, 5, 7 and 10 were not included. I am not getting the desired final dataset.

Could there be any other possible option?

proc sort data = dist_table;

by id dist_km;

run;

*BY group and FIRST;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

Kind regards

Teketo

Reeza · Posted 02-24-2018 12:24 PM

Find a specific example of where Cluster 3 should be included, recreate the sample data to show how it's not being included and I can take another look.

But remember the join was on region, if regions isn't the same as cluster, maybe you needed to join on cluster. To me if there are facilities in each region and each region has multiple clusters it is possible that a cluster may not be selected - think rural outlier hospitals.

Teketo · Posted 02-24-2018 03:49 PM

Dear Reeza,

Thank you indeed.

The join can’t be made at the cluster level. We only have region as a common variable. Both datasets (the fictitious and the main datasets) have multiple clusters and health facilities per region. Every region has both clusters and health facilities; none of the regions missed either of the two.

Here are the datasets and the codes used.

*To calculate the distance;

Data clusters;

Input Cluster Lat1 Lon1 region;

datalines;

1 14.0 38.7 1

2 11.6 41.5 2

3 12.7 37.5 3

4 7.5 39.1 4

5 12.7 37.6 1

6 10.4 40.1 2

7 12.1 40.0 2

8 12.2 37.0 3

9 9.9 39.9 3

10 9.1 40.7 4

;

*Health facility data;

Data health_facility;

Input ID Lat2 Lon2 region HFMA FPRS;

Datalines;

1 13.6 39.1 1 1 0

2 13.3 38.9 1 2 1

3 10.7 40.1 2 1 1

4 10.3 40.2 2 1 1

5 11.4 40.7 2 1 1

6 10.5 39.5 3 2 0

7 11.6 39.6 3 1 1

8 11.3 37.9 3 2 1

9 9.1 35.6 4 2 0

10 9.5 34.8 4 1 1

11 14.3 37.8 1 2 0

12 13.5 39.4 1 1 1

13 10.4 40.2 2 2 0

14 9.2 40.3 2 1 1

15 9.9 39.8 3 2 1

16 11.0 37.1 3 1 1

17 10.6 39.4 3 2 0

18 8.5 34.5 4 2 1

19 7.2 38.2 4 1 1

20 8.8 35.8 4 2 1

;

run;

*Proc sql: Using cross join to merge to the two datasets;

proc sql;

create table dist_table as

select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,

h.region,

c.cluster, c.lat1 as lat_cluster, c.lon1 as long_cluster,

geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km

from clusters as c cross join health_facility as h

where c.region=h.region

order by h.id, c.cluster, dist_km;

quit;

proc sort data = dist_table;

by id dist_km;

run;

*BY group and FIRST;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

Kind regards

Teketo

Reeza · Posted 02-24-2018 04:19 PM

Then your assumption that each cluster will be included is not correct. If you want the closest facility to hospital, that doesn't necessarily mean that each cluster will be included. The two criteria's don't align.

Check Cluster 3 in your data, specifically. It's not included because it's never the closes facility so you need to revisit your assumptions and requirements.

proc sql;
create table cluster3 as
select *
from dist_table 
where id in (select id from dist_table where cluster=3)
order by id, region, dist_km;
quit;

@Teketo wrote:

Dear Reeza,

Thank you indeed.

The join can’t be made at the cluster level. We only have region as a common variable. Both datasets (the fictitious and the main datasets) have multiple clusters and health facilities per region. Every region has both clusters and health facilities; none of the regions missed either of the two.

Here are the datasets and the codes used.

*To calculate the distance;

Data clusters;

Input Cluster Lat1 Lon1 region;

datalines;

1 14.0 38.7 1

2 11.6 41.5 2

3 12.7 37.5 3

4 7.5 39.1 4

5 12.7 37.6 1

6 10.4 40.1 2

7 12.1 40.0 2

8 12.2 37.0 3

9 9.9 39.9 3

10 9.1 40.7 4

;

*Health facility data;

Data health_facility;

Input ID Lat2 Lon2 region HFMA FPRS;

Datalines;

1 13.6 39.1 1 1 0

2 13.3 38.9 1 2 1

3 10.7 40.1 2 1 1

4 10.3 40.2 2 1 1

5 11.4 40.7 2 1 1

6 10.5 39.5 3 2 0

7 11.6 39.6 3 1 1

8 11.3 37.9 3 2 1

9 9.1 35.6 4 2 0

10 9.5 34.8 4 1 1

11 14.3 37.8 1 2 0

12 13.5 39.4 1 1 1

13 10.4 40.2 2 2 0

14 9.2 40.3 2 1 1

15 9.9 39.8 3 2 1

16 11.0 37.1 3 1 1

17 10.6 39.4 3 2 0

18 8.5 34.5 4 2 1

19 7.2 38.2 4 1 1

20 8.8 35.8 4 2 1

;

run;

*Proc sql: Using cross join to merge to the two datasets;

proc sql;

create table dist_table as

select h.id, h.lat2 as lat_facility, h.lon2 as long_facility,

     h.region,

     c.cluster, c.lat1 as lat_cluster, c.lon1 as long_cluster,

     geodist(h.lat2, h.lon2, c.lat1, c.lon1, 'k' ) as dist_km

from clusters as c cross join health_facility as h

where c.region=h.region

order by h.id, c.cluster, dist_km;

quit;

proc sort data = dist_table;

by id dist_km;

run;

*BY group and FIRST;

Data merged;

Set dist_table;

By id dist_km;

If first.id;

Run;

Kind regards

Teketo

Reeza · Posted 02-24-2018 04:23 PM

Right now, rereading your requirements, maybe you want to get the cluster/facility match, so you should sort by cluster instead and take the closest?

I think at this point you have all the tools you need to solve this problem and I've provided enough assistance to solve it. The logic is up to you to determine, first what you need and you've learned how to find the minimum by group, so you should be able to switch it to do a slightly different variation.

Good Luck.