BookmarkSubscribeRSS Feed

Using SAS to Merge IPv4 and IPv6 Addresses to Geolocation CIDR Range Data

Started ‎09-04-2020 by
Modified ‎09-04-2020 by
Views 3,332

IP Addresses are all around us these days - they identify every device connected to the Internet or a computer network and are vital to our day-to-day Internet connections.

 

Originally, IP addresses were designed in the IPv4 standard. This is the format most people associate with IP addresses which accounts for most of the internet’s traffic to date. IPv4 addresses follow a dotted decimal format with four octet segments, where each segment contains 8 bits. IPv4 Example:

 

10.4.29.100

 

This IPv4 standard allows for 2^32 possible combinations (more than 4.29 billion) of unique IP addresses. This was all well and good… until the development and popularization of more and more smart devices - from fitness trackers and speakers to even water bottles and egg trays (though the reviews are mixed on the latter).

 

The founders of the Internet did not anticipate such an explosion of IP-reliant devices and a problem quickly loomed: what if we run out of unique IP addresses?

 

Enter IPv6 – the new IP address format that re-imagines and expands the capability of identifying every internet connection in the world.

 

IPv6 was first deployed in 1999 and stores individual addresses as a 128-bit number. This allows for the possibility of 2^128 unique combinations (or 340,282,366,920,938,463,463,374,607,431,768,211,456… a virtually inexhaustible amount). The format shifted from a dotted decimal to hexadecimal notation. Each IPv6 address consists of 8 hexadecimal segments, separated by semicolons. IPv6 Example:

 

2001:0db8:85a3:0000:0000:8a2e:0370:7334

 

The challenge that we run into now is that both IPv4 and IPv6 addresses are in use today. Therefore, any analytics that leverage IP address information should be developed with both IP formats in mind.

 

This article explores a use case of leveraging SAS code to perform a geolocation analysis using both IPv4 and IPv6 formats.

 

Geolocation

Geolocation.png

 

Now that we know how these devices are identified, say we want to know where they are located?

 

Leveraging IP addresses to estimate a device location is known as geolocation, and is utilized across many industries including cybersecurity, banking, travel, hospitality, and law enforcement. Most geolocation analyses rely on geolocation services or databases that contain a collection of IP addresses already mapped to geographical information.

 

There are a variety of geolocation services, APIs, and databases available across the Internet. Many of these require paid subscriptions or accounts, but some offer a limited number of records or API calls for free. This article uses the free Max Mind geolocation City GeoLite2 files - GeoLite2-City-Blocks-IPv4.csv, GeoLite2-City-Blocks-IPv6.csv, and GeoLite2-City-Blocks-Locations-en.csv. A free membership is required to download for non-commercial users. See the MaxMind site for details and restrictions.

 

SAS Code Solution

SAS_Code_Solution.jpg

 

The Max Mind IPv4 and IPv6 files are a mapping of network CIDR prefixes to geographical information. In order to merge CIDR IP address ranges to a list of specific IP addresses, you must first convert the IPv4 and IPv6 CIDR notation network codes into a common numeric or character representation that can be programmatically compared against the list of IP addresses with unknown locations.

 

To convert IP addresses into a usable and comparable format, we first convert the IPv4/IPv6 formats to their binary representation. There is a CIDR prefix code designated for each of the network IP addresses in the Max Mind data. The CIDR range prefix for IP addresses corresponds to the number of significant bits that are used to group a block of IP addresses together.

 

Therefore, we translate the IP network address to its corresponding binary bits and truncate the string based on the number of significant digits specified in the Max Mind City CSV file.

 

For example, the IPv4 CIDR range 143.44.64.0/19 which corresponds to Appleton, Wisconsin has a CIDR prefix of 19. The IP address is translated to its binary representation:

 

10001111001011000100000000000000

 

And the first 19 bits are kept in the ipv4_match column, which will define this specific range of IP addresses:

 

1000111100101100010

 

The code creates two datasets, with a ipv4_match/ipv6_match variable that contains these significant bits to define the specific geographical range of IP addresses.

 

*IPv4 conversion of City data to binary match;
data out.GEOIP_RANGE_CITY_IPV4;
	set out.GEOIP_RANGE_CITY_IPV4;

	*Parse network field into octets;
	array ip_part {4} 8. ;
  	do i = 1 to 4;
    	ip_part{i} = scan(network,i,'./');
  	end;

	*Store number following '/' as CIDR prefix;
  	prefix = scan(network,2,"/");

	b1 = put(ip_part1, binary8.);
	b2 = put(ip_part2, binary8.);
	b3 = put(ip_part3, binary8.);
	b4 = put(ip_part4, binary8.);

	*Concactenante all together;
	ipv4_binary_long = cats(b1,b2,b3,b4);

	*Keep only the prefix number of characters from the left;
	ipv4_match = substr(ipv4_binary_long,1,prefix);

drop b1-b4;
run;

*IPv6 conversion of City data from hexidecimal to binary match;
data out.GEOIP_RANGE_CITY_IPV6;
	set out.GEOIP_RANGE_CITY_IPV6;

	*Store number following '/' as CIDR prefix;
  	prefix = scan(network,2,"/");

	*Separate out IPv6 octets;
	ipv6 = substrn(network,1,(find(network,"/")-1));
	length ipv6_1-ipv6_8 $4.;
	ipv6_1 = scan(ipv6,1,":",'m');
	ipv6_2 = scan(ipv6,2,":",'m');
	ipv6_3 = scan(ipv6,3,":",'m');
	ipv6_4 = scan(ipv6,4,":",'m');
	ipv6_5 = scan(ipv6,5,":",'m');
	ipv6_6 = scan(ipv6,6,":",'m');
	ipv6_7 = scan(ipv6,7,":",'m');
	ipv6_8 = scan(ipv6,8,":",'m');

	*Pad with leading 0's;
	length ipv6_h_1-ipv6_h_8 $4.;
	ipv6_h_1 = translate(right(ipv6_1),"0"," ");
	ipv6_h_2 = translate(right(ipv6_2),"0"," ");
	ipv6_h_3 = translate(right(ipv6_3),"0"," ");
	ipv6_h_4 = translate(right(ipv6_4),"0"," ");
	ipv6_h_5 = translate(right(ipv6_5),"0"," ");
	ipv6_h_6 = translate(right(ipv6_6),"0"," ");
	ipv6_h_7 = translate(right(ipv6_7),"0"," ");
	ipv6_h_8 = translate(right(ipv6_8),"0"," ");

	temp1 = put(ipv6_h_1, $ascii8.);
	temp2 = put(ipv6_h_2, $ascii8.);
	temp3 = put(ipv6_h_3, $ascii8.);
	temp4 = put(ipv6_h_4, $ascii8.);
	temp5 = put(ipv6_h_5, $ascii8.);
	temp6 = put(ipv6_h_6, $ascii8.);
	temp7 = put(ipv6_h_7, $ascii8.);
	temp8 = put(ipv6_h_8, $ascii8.);

	hex1 = input(temp1, $hex8.);
	hex2 = input(temp2, $hex8.);
	hex3 = input(temp3, $hex8.);
	hex4 = input(temp4, $hex8.);
	hex5 = input(temp5, $hex8.);
	hex6 = input(temp6, $hex8.);
	hex7 = input(temp7, $hex8.);
	hex8 = input(temp8, $hex8.);

	bin1 = put(hex1, $binary16.);
	bin2 = put(hex2, $binary16.);
	bin3 = put(hex3, $binary16.);
	bin4 = put(hex4, $binary16.);
	bin5 = put(hex5, $binary16.);
	bin6 = put(hex6, $binary16.);
	bin7 = put(hex7, $binary16.);
	bin8 = put(hex8, $binary16.);

	*Concactenante all binary bits together;
	ipv6_binary_long = cats(bin1,bin2,bin3,bin4,bin5,bin6,bin7,bin8);

	*Keep only the prefix number of characters from the left;
	ipv6_match = substr(ipv6_binary_long,1,prefix);

drop ipv6_1-ipv6_8 temp1-temp8 hex1-hex8 bin1-bin8;
run;

 

Test The Approach on a Random Subset of IP Addresses

Now that we have the Max Mind City file information translated to the IP address binary forms, how do we merge a list of IPs to estimate the geolocation information?

You can perform the same conversion of IPv4/IPv6 address to binary, then utilize PROC SQL code to return the records where the IP address matches the network grouping.

 

Step 3 below conducts the merge to the CIDR range binary representation datasets, where the IP address matches the start of the CIDR IPv4/IPv6 match column.

 

*(1);
*Create list of random IP addresses - both IPv4 and IPv6;
data out.random_ips;
length ip_address $50;
input ip_address;
datalines;
51.192.190.95
185.135.172.90
5.32.186.62
134.64.43.233
220.74.53.146
154.51.193.135
219.81.63.176
111.185.25.30
111.154.246.127
191.228.165.2
2001:4645:800:9ff5:d4e6:cd83:4654:0cb4
2600:1702:2a10:e334:d802:2234:2464:ecef
2601:680:cb00:b349:791b:47da:480d:f208
2a02:2898:e161:1019:547a:78ff:8e37:74e5
2a00:23c6:3100:3cf6:6334:0b4d:1b4d:a66f
2a04:bdc4:6b65:ee4d:bcfb:ba96:427c:c521
240f:97:8a00:3e3a:e44e:38f4:6fc4:d31f
2607:f2c0:ea80:17a9:282c:bed8:0308:5d51
2601:342:2c0:9983:1f6a:614d:8973:12a9
2409:4070:2e20:08c9:fc8a:3313:7eba:47f8
;
run;


*(2);
*Convert list of IPs to binary format;
* - For both IPv4 and IPv6         ;
data out.IP_list_converted;
	set out.random_ips;

ipv6_flag = 0;
if find(ip_address,":")>0 then ipv6_flag = 1;

if ipv6_flag = 0 then do;
	*Parse network field into octets;
	array ip_part {4} 8. ;
  	do i = 1 to 4;
    	ip_part{i} = scan(ip_address,i,'./');
  	end;

	ipv4_1 = put(ip_part1, binary8.);
	ipv4_2 = put(ip_part2, binary8.);
	ipv4_3 = put(ip_part3, binary8.);
	ipv4_4 = put(ip_part4, binary8.);

	*Concactenante all together;
	ipv4_binary = cats(ipv4_1,ipv4_2,ipv4_3,ipv4_4);
end;
else if ipv6_flag = 1 then do;
	*Seperate out octets;
	length ipv6_1-ipv6_8 $4.;
	ipv6_1 = scan(ip_address,1,":",'m');
	ipv6_2 = scan(ip_address,2,":",'m');
	ipv6_3 = scan(ip_address,3,":",'m');
	ipv6_4 = scan(ip_address,4,":",'m');
	ipv6_5 = scan(ip_address,5,":",'m');
	ipv6_6 = scan(ip_address,6,":",'m');
	ipv6_7 = scan(ip_address,7,":",'m');
	ipv6_8 = scan(ip_address,8,":",'m');

	*Pad with leading 0's;
	length ipv6_h_1-ipv6_h_8 $4.;
	ipv6_h_1 = translate(right(ipv6_1),"0"," ");
	ipv6_h_2 = translate(right(ipv6_2),"0"," ");
	ipv6_h_3 = translate(right(ipv6_3),"0"," ");
	ipv6_h_4 = translate(right(ipv6_4),"0"," ");
	ipv6_h_5 = translate(right(ipv6_5),"0"," ");
	ipv6_h_6 = translate(right(ipv6_6),"0"," ");
	ipv6_h_7 = translate(right(ipv6_7),"0"," ");
	ipv6_h_8 = translate(right(ipv6_8),"0"," ");

	temp1 = put(ipv6_h_1, $ascii8.);
	temp2 = put(ipv6_h_2, $ascii8.);
	temp3 = put(ipv6_h_3, $ascii8.);
	temp4 = put(ipv6_h_4, $ascii8.);
	temp5 = put(ipv6_h_5, $ascii8.);
	temp6 = put(ipv6_h_6, $ascii8.);
	temp7 = put(ipv6_h_7, $ascii8.);
	temp8 = put(ipv6_h_8, $ascii8.);

	hex1 = input(temp1, $hex8.);
	hex2 = input(temp2, $hex8.);
	hex3 = input(temp3, $hex8.);
	hex4 = input(temp4, $hex8.);
	hex5 = input(temp5, $hex8.);
	hex6 = input(temp6, $hex8.);
	hex7 = input(temp7, $hex8.);
	hex8 = input(temp8, $hex8.);

	bin1 = put(hex1, $binary16.);
	bin2 = put(hex2, $binary16.);
	bin3 = put(hex3, $binary16.);
	bin4 = put(hex4, $binary16.);
	bin5 = put(hex5, $binary16.);
	bin6 = put(hex6, $binary16.);
	bin7 = put(hex7, $binary16.);
	bin8 = put(hex8, $binary16.);

	*Concactenante all together;
	ipv6_binary = cats(bin1,bin2,bin3,bin4,bin5,bin6,bin7,bin8);
end;

drop i ipv4_1-ipv4_4 ip_part1-ip_part4 ipv6_1-ipv6_8 ipv6_h_1-ipv6_h_8 temp1-temp8 hex1-hex8 bin1-bin8;
run;

* (3) ;
* Merge IPv4 and IPv6 geolocation lookups to IP list;
proc sql;
create table out.IP_LIST_MERGED as
	select IP.*,
		coalescec(GEO.network, GEO2.network) as network,
		coalescec(GEO.postal_code, GEO2.postal_code) as postal_code,
		coalesce(GEO.latitude, GEO2.latitude) as latitude,
		coalesce(GEO.longitude, GEO2.longitude) as longitude,
		coalesce(GEO.geoname_id, GEO2.geoname_id) as geoname_id,
		coalescec(GEO.locale_code, GEO2.locale_code) as locale_code,
		coalescec(GEO.continent_code, GEO2.continent_code) as continent_code,
		coalescec(GEO.continent_name, GEO2.continent_name) as continent_name,
		coalescec(GEO.country_iso_code, GEO2.country_iso_code) as country_iso_code,
		coalescec(GEO.country_name, GEO2.country_name) as country_name,
		coalescec(GEO.subdivision_1_iso_code, GEO2.subdivision_1_iso_code) as subdivision_1_iso_code,
		coalescec(GEO.subdivision_1_name, GEO2.subdivision_1_name) as subdivision_1_name,
		coalescec(GEO.subdivision_2_iso_code, GEO2.subdivision_2_iso_code) as subdivision_2_iso_code,
		coalescec(GEO.subdivision_2_name, GEO2.subdivision_2_name) as subdivision_2_name,
		coalescec(GEO.city_name, GEO2.city_name) as city_name,
		coalesce(GEO.metro_code, GEO2.metro_code) as metro_code,
		coalescec(GEO.time_zone, GEO2.time_zone) as time_zone,		
		coalesce(GEO.is_in_european_union, GEO2.is_in_european_union) as is_in_european_union
	from out.IP_list_converted as IP
		left join out.GEOIP_RANGE_CITY_IPV4 as GEO on (strip(IP.ipv4_binary) LIKE catt(ipv4_match,'%'))
		left join out.GEOIP_RANGE_CITY_IPV6 as GEO2 on (strip(IP.ipv6_binary) LIKE catt(ipv6_match,'%'));
quit;

 

Note on Efficiency:

When running this code in SAS 9.4, you will receive the following note in your SAS Log:

 

NOTE: The execution of this query involves performing one or more Cartesian product joins that cannot be optimized.


If running this code against a large list of IP addresses, it may take some time to complete. CAS multi-threaded processing could be used instead to increase processing efficiency.

 

Comments

Hello,

Thanks so much for sharing the SAS code.

When I try to run your code, I don't know where I can find the input SAS data sets named "out.GEOIP_RANGE_CITY_IPV4" and "out.GEOIP_RANGE_CITY_IPV6".

 

Could you shed some lights on it?

 

I'm really appreciative of your help.

Regards,

Ethan

 

data out.GEOIP_RANGE_CITY_IPV4;
set out.GEOIP_RANGE_CITY_IPV4;

….

 

data out.GEOIP_RANGE_CITY_IPV6;
set out.GEOIP_RANGE_CITY_IPV6;

...

 

Hello Ethan @t75wez1 ,

 

Once you have the MaxMind data saved as CSV files and in a single folder location, you can use SAS code to import the CSVs into SAS data sets then merge the city location details into both the IPv4 and IPv6 SAS data sets.

 

Import and merge code is below, you will need to update the paths listed before running.

 

 


libname geo "<insert input data path here>";
libname out "<insert output data path here>";

*City IPv4 data;
DATA geo.CITY_BLOCKS_IPV4;
    LENGTH
        network          $ 18
        geoname_id         8
        registered_country_geoname_id   8
        represented_country_geoname_id   8
        is_anonymous_proxy   8
        is_satellite_provider   8
        postal_code      $ 8
        latitude           8
        longitude          8
        accuracy_radius    8 ;
    FORMAT
        network          $CHAR18.
        geoname_id       BEST8.
        registered_country_geoname_id BEST7.
        represented_country_geoname_id BEST7.
        is_anonymous_proxy BEST1.
        is_satellite_provider BEST1.
        postal_code      $CHAR8.
        latitude         BEST8.
        longitude        BEST9.
        accuracy_radius  BEST4. ;
    INFORMAT
        network          $CHAR18.
        geoname_id       BEST8.
        registered_country_geoname_id BEST7.
        represented_country_geoname_id BEST7.
        is_anonymous_proxy BEST1.
        is_satellite_provider BEST1.
        postal_code      $CHAR8.
        latitude         BEST8.
        longitude        BEST9.
        accuracy_radius  BEST4. ;
    INFILE '<insert input data path here>\GeoLite2-City-Blocks-IPv4.csv'
		FIRSTOBS=2
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        network          : $CHAR18.
        geoname_id       : ?? BEST8.
        registered_country_geoname_id : ?? BEST7.
        represented_country_geoname_id : ?? BEST7.
        is_anonymous_proxy : ?? BEST1.
        is_satellite_provider : ?? BEST1.
        postal_code      : $CHAR8.
        latitude         : ?? COMMA8.
        longitude        : ?? COMMA9.
        accuracy_radius  : ?? BEST4. ;
RUN;

*City IPv6 Data;
DATA geo.CITY_BLOCKS_IPV6;
    LENGTH
        network          $ 43
        geoname_id         8
        registered_country_geoname_id   8
        represented_country_geoname_id $ 1
        is_anonymous_proxy   8
        is_satellite_provider   8
        postal_code      $ 8
        latitude           8
        longitude          8
        accuracy_radius    8 ;
    FORMAT
        network          $CHAR43.
        geoname_id       BEST8.
        registered_country_geoname_id BEST7.
        represented_country_geoname_id $CHAR1.
        is_anonymous_proxy BEST1.
        is_satellite_provider BEST1.
        postal_code      $CHAR8.
        latitude         BEST8.
        longitude        BEST9.
        accuracy_radius  BEST4. ;
    INFORMAT
        network          $CHAR43.
        geoname_id       BEST8.
        registered_country_geoname_id BEST7.
        represented_country_geoname_id $CHAR1.
        is_anonymous_proxy BEST1.
        is_satellite_provider BEST1.
        postal_code      $CHAR8.
        latitude         BEST8.
        longitude        BEST9.
        accuracy_radius  BEST4. ;
    INFILE '<insert input data path here>\GeoLite2-City-Blocks-IPv6.csv'
		FIRSTOBS=2
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        network          : $CHAR43.
        geoname_id       : ?? BEST8.
        registered_country_geoname_id : ?? BEST7.
        represented_country_geoname_id : $CHAR1.
        is_anonymous_proxy : ?? BEST1.
        is_satellite_provider : ?? BEST1.
        postal_code      : $CHAR8.
        latitude         : ?? COMMA8.
        longitude        : ?? COMMA9.
        accuracy_radius  : ?? BEST4. ;
RUN;

*City Locations data;
DATA geo.CITY_LOCATIONS_EN;
    LENGTH
        geoname_id         8
        locale_code      $ 2
        continent_code   $ 2
        continent_name   $ 13
        country_iso_code $ 2
        country_name     $ 44
        subdivision_1_iso_code $ 3
        subdivision_1_name $ 52
        subdivision_2_iso_code $ 3
        subdivision_2_name $ 39
        city_name        $ 72
        metro_code         8
        time_zone        $ 30
        is_in_european_union   8 ;
    FORMAT
        geoname_id       BEST8.
        locale_code      $CHAR2.
        continent_code   $CHAR2.
        continent_name   $CHAR13.
        country_iso_code $CHAR2.
        country_name     $CHAR44.
        subdivision_1_iso_code $CHAR3.
        subdivision_1_name $CHAR52.
        subdivision_2_iso_code $CHAR3.
        subdivision_2_name $CHAR39.
        city_name        $CHAR72.
        metro_code       BEST3.
        time_zone        $CHAR30.
        is_in_european_union BEST1. ;
    INFORMAT
        geoname_id       BEST8.
        locale_code      $CHAR2.
        continent_code   $CHAR2.
        continent_name   $CHAR13.
        country_iso_code $CHAR2.
        country_name     $CHAR44.
        subdivision_1_iso_code $CHAR3.
        subdivision_1_name $CHAR52.
        subdivision_2_iso_code $CHAR3.
        subdivision_2_name $CHAR39.
        city_name        $CHAR72.
        metro_code       BEST3.
        time_zone        $CHAR30.
        is_in_european_union BEST1. ;
    INFILE '<insert input data path here>\GeoLite2-City-Locations-en.csv'
		FIRSTOBS=2
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        geoname_id       : ?? BEST8.
        locale_code      : $CHAR2.
        continent_code   : $CHAR2.
        continent_name   : $CHAR13.
        country_iso_code : $CHAR2.
        country_name     : $CHAR44.
        subdivision_1_iso_code : $CHAR3.
        subdivision_1_name : $CHAR52.
        subdivision_2_iso_code : $CHAR3.
        subdivision_2_name : $CHAR39.
        city_name        : $CHAR72.
        metro_code       : ?? BEST3.
        time_zone        : $CHAR30.
        is_in_european_union : ?? BEST1. ;
RUN;

*Merge in City information before further processing;
proc sql;
create table out.geoip_range_city_ipv4 as
	select 	ip.*,
		    locale_code,
	        continent_code,  
	        continent_name,  
	        country_iso_code, 
	        country_name, 
	        subdivision_1_iso_code,
	        subdivision_1_name,
	        subdivision_2_iso_code,
	        subdivision_2_name,
	        city_name,
	        metro_code,  
	        time_zone,
	        is_in_european_union 
	from geo.city_blocks_ipv4 as ip
	left join geo.city_locations_en as city on (ip.geoname_id = city.geoname_id)
;
create table out.geoip_range_city_ipv6 as
	select 	ip.*,
		    locale_code,
	        continent_code,  
	        continent_name,  
	        country_iso_code, 
	        country_name, 
	        subdivision_1_iso_code,
	        subdivision_1_name,
	        subdivision_2_iso_code,
	        subdivision_2_name,
	        city_name,
	        metro_code,  
	        time_zone,
	        is_in_european_union 
	from geo.city_blocks_ipv6 as ip
	left join geo.city_locations_en as city on (ip.geoname_id = city.geoname_id)
;
quit;

 

Hope that helps!

Agata

 

Version history
Last update:
‎09-04-2020 01:06 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags