I am seeking help with an unusual merge/join.
I have two datasets/tables in a Teradata environment, which I am communicating with using EG 7.11.
One table (let's call it CensusBlkMiss) contains about 2m records with location ID numbers, lat values, lon values, and is missing census block values:
ID Lat Lon
1 37 -120.7
2 37.2 -123.1
3 38 -122.2
4 36.4 -119.8
5 35.9 -122.1
The other table (let's call it CensusBlkExist) contains about 10m records with lat values, lon values, and census block values:
Lat Lon CensusBlk
36.3 -121.7 C2
35.5 -120.9 A8
38.2 -122 A3
37.9 -122.4 C7
36.3 -121.9 B6
37 -120.3 A1
36.6 -122.1 B6
What I would like to do is assign census block values to the IDs in the first table based on the closest lat/long location from the second dataset, and end up with something like (let's call it CensusBlkCombo)
ID CensusBlk
1 A1
2 B6
3 C7
4 A1
5 B6
("A1" is selected for ID #1 because 37/-120.3 is the closest location with a census block to 37/-120.7, the location of ID #1)
I have not been able to think of a way to do this merge using normal SAS code.
The SQL code that I have tried has been something like this
proc sql
create table CensusBlkCombo as
(select A.SP_ID,
(select CENSUS_BLK_GRP from
(select top 1 CENSUS_BLK_GRP
from CensusBlkExist B
order by (abs(B.sp_lat - A.sp_lat2) + abs(B.sp_long - A.sp_long2))
as CENSUS_BLK_GRP)
from CensusBlkMiss A));
That particular code gives me an error message, "TOP N Syntax error: Top N option is not supported in subquery".
I am not wedded to that setup, though - any suggestions are welcome, including a radically different code structure welcome too.
What country are you in? US? If so, look at PROC GINSIDE.
The example maps to county/state but if you have the shapefiles for the blocks you can do a similar project. SAS may already have the US census shapes in the MAPS folder.
@Adam_Coutts wrote:
I am seeking help with an unusual merge/join.
I have two datasets/tables in a Teradata environment, which I am communicating with using EG 7.11.
One table (let's call it CensusBlkMiss) contains about 2m records with location ID numbers, lat values, lon values, and is missing census block values:
ID Lat Lon
1 37 -120.7
2 37.2 -123.1
3 38 -122.2
4 36.4 -119.8
5 35.9 -122.1
The other table (let's call it CensusBlkExist) contains about 10m records with lat values, lon values, and census block values:
Lat Lon CensusBlk
36.3 -121.7 C2
35.5 -120.9 A8
38.2 -122 A3
37.9 -122.4 C7
36.3 -121.9 B6
37 -120.3 A1
36.6 -122.1 B6
What I would like to do is assign census block values to the IDs in the first table based on the closest lat/long location from the second dataset, and end up with something like (let's call it CensusBlkCombo)
ID CensusBlk
1 A1
2 B6
3 C7
4 A1
5 B6
("A1" is selected for ID #1 because 37/-120.3 is the closest location with a census block to 37/-120.7, the location of ID #1)
I have not been able to think of a way to do this merge using normal SAS code.
The SQL code that I have tried has been something like thisproc sql
create table CensusBlkCombo as
(select A.SP_ID,
(select CENSUS_BLK_GRP from
(select top 1 CENSUS_BLK_GRP
from CensusBlkExist B
order by (abs(B.sp_lat - A.sp_lat2) + abs(B.sp_long - A.sp_long2))
as CENSUS_BLK_GRP)
from CensusBlkMiss A));
That particular code gives me an error message, "TOP N Syntax error: Top N option is not supported in subquery".
I am not wedded to that setup, though - any suggestions are welcome, including a radically different code structure welcome too.
What country are you in? US? If so, look at PROC GINSIDE.
The example maps to county/state but if you have the shapefiles for the blocks you can do a similar project. SAS may already have the US census shapes in the MAPS folder.
@Adam_Coutts wrote:
I am seeking help with an unusual merge/join.
I have two datasets/tables in a Teradata environment, which I am communicating with using EG 7.11.
One table (let's call it CensusBlkMiss) contains about 2m records with location ID numbers, lat values, lon values, and is missing census block values:
ID Lat Lon
1 37 -120.7
2 37.2 -123.1
3 38 -122.2
4 36.4 -119.8
5 35.9 -122.1
The other table (let's call it CensusBlkExist) contains about 10m records with lat values, lon values, and census block values:
Lat Lon CensusBlk
36.3 -121.7 C2
35.5 -120.9 A8
38.2 -122 A3
37.9 -122.4 C7
36.3 -121.9 B6
37 -120.3 A1
36.6 -122.1 B6
What I would like to do is assign census block values to the IDs in the first table based on the closest lat/long location from the second dataset, and end up with something like (let's call it CensusBlkCombo)
ID CensusBlk
1 A1
2 B6
3 C7
4 A1
5 B6
("A1" is selected for ID #1 because 37/-120.3 is the closest location with a census block to 37/-120.7, the location of ID #1)
I have not been able to think of a way to do this merge using normal SAS code.
The SQL code that I have tried has been something like thisproc sql
create table CensusBlkCombo as
(select A.SP_ID,
(select CENSUS_BLK_GRP from
(select top 1 CENSUS_BLK_GRP
from CensusBlkExist B
order by (abs(B.sp_lat - A.sp_lat2) + abs(B.sp_long - A.sp_long2))
as CENSUS_BLK_GRP)
from CensusBlkMiss A));
That particular code gives me an error message, "TOP N Syntax error: Top N option is not supported in subquery".
I am not wedded to that setup, though - any suggestions are welcome, including a radically different code structure welcome too.
You might be able to do something with the geodist function although I've never used it and with such a large number of records to process i don't think any method is likely to be quick. The documentation for geodist is here
ChrisBrooks - thank you for your reply.
It looks to me like the geodist function would replace me ad hoc distance calculation (abs(B.sp_lat - A.sp_lat2) + abs(B.sp_long - A.sp_long2)) but is not set up to translate lat and long to census block.
Reeza - thank you for the reply as well. PROC GINSIDE is indeed able to generate the census blocks that I was seeking. And I was able to find census block maps for California at ftp://ftp2.census.gov/geo/tiger/TIGER2010BLKPOPHU/.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.