Solved: Combining observations

kmaths · Posted 02-26-2015 05:10 PM

I have a dataset where observations are identified by two factors: geographic area, and grid. Geographic area is represented by three variables: 1. A 3-letter code for that geographic area; 2. A 4-digit numerical ID for that geographic area; and 3. The full name of the geographic area (i.e. “Summer County”). The grid is represented by a 4 or 5-digit number. The grid number is really the crucial ID variable, with all other information (case count, relative risk, etc) being linked to the grid number.

Most of the time, a grid falls cleanly within a geographic area. However, occasionally, the grid overlaps two adjacent geographic areas. When this occurs, in the current dataset, the information (case count, etc) simply repeats, so that there are two identical rows, differing only in that one row specifies one geographic area, and the other row specifies the other geographic area. For example, if grid #54321 falls into geographic area ABC *and* geographic area XYZ, while grid #11111 fits in only 1 geographic area, it looks like:

Grid # Geographic Code Geographic ID Geographic Name Case count RR etc

54321 ABC 1000 Summer County 3 1.5

54321 XYZ 1001 Winter County 3 1.5

11111 DEF 1002 Fall County 2 1

Note that the critical information, such as case count and RR, is the same for a particular grid, even if that grid covers two geographic areas. I created this file by merging a SAS dataset listing the case count, RR, etc per grid number (after using SAS to calculate all that) with a file specifying the geographic code(s), ID(s), and name(s) that correspond with each grid number.

The problem is that it is confusing for the people who will receive this data to recognize that a grid includes two different geographic areas if the information is on two rows. What we would like to do is combine the information into one row, for example:

Grid # Geographic Code Geographic ID Geographic Name Case count RR etc

54321 ABC/XYZ 1000/1001 Summer County/Winter County 3 1.5

11111 DEF 1002 Fall County 2 1

Does anyone have any idea on how I would go about accomplishing this? If needed, I could probably afford to drop the geographic ID variable and the geographic name variables, and somehow reintroduce those later – it is most imperative that the geographic codes get combined.

Currently, my only thought is to try to find out of all the grids (there are approximately 2400), which ones fall into two geographic areas, create a master file in Excel that lists these with a combined geographic name, geographic ID, and geographic code, and then merge that with the file containing RR, case count, etc per grid number, instead of using the original file, but I am wondering if there is some way to do it all within SAS instead.

Please note (in case it is relevant) that this is a very simplified version - in reality, most geographic areas have multiple grids assigned to them.

Thank you very much for any help – I sincerely appreciate it.

PGStats · Posted 02-26-2015 11:07 PM

It is fairly simple to do with the CATX function and a DO UNTIL() loop. It also extends to the case where more than two geographic areas are involved :

data have;

length GeographicCode GeographicID GeographicName $16;

input GridId GeographicCode GeographicID GeographicName & CaseCount RR;

datalines;

54321 ABC 1000 Summer County 3 1.5

54321 XYZ 1001 Winter County 3 1.5

54321 RST 1002 Spring County 3 1.5

11111 DEF 1002 Fall County 2 1

;

proc sort data=have; by gridId GeographicCode; run;

data want;

length NewGeographicCode NewGeographicID NewGeographicName $50;

do until(last.GridId);

set have; by GridId;

NewGeographicCode = catx("/", NewGeographicCode, GeographicCode);

NewGeographicID = catx("/", NewGeographicID, GeographicID);

NewGeographicName = catx("/", NewGeographicName, GeographicName);

end;

drop GeographicCode GeographicID GeographicName;

rename

NewGeographicCode=GeographicCode

NewGeographicID=GeographicID

NewGeographicName=GeographicName;

run;

proc print data=want noobs;

var GridId GeographicCode GeographicID GeographicName CaseCount RR;

run;

PG

View solution in original post

sbb · Posted 02-26-2015 05:24 PM

Consider using PROC TRANSPOSE based on unique GRID value(s) and horizontal-ize your variable data-values -- then in a subsequent DATA step, concatenate the SAS-generated variables combining text-data as needed.

Scott Barry

SBBWorks, Inc.

billfish · Posted 02-26-2015 07:51 PM

A point of view.

This reminds me of postal zip codes spanning 2 or more counties.
At some companies, a set of pricing models are functions of zipcodes and another set of pricing models are functions of counties. Much confusion.
Even some zip9 (xxxxx-yyyy) span more than 1 county.

In the following, let (i1,j1,k1) be 3 different sets of geographical locations (say, zipcode, counties, FIPS).

/**************************/
/*** simulate some data ***/
/**************************/
data t_a(keep=i1 j1 k1);
do i = 1 to 125;
     i1 = ceil(20*ranuni(3));
     j1 = ceil(20*ranuni(3));
     k1 = ceil(20*ranuni(3));
output;
end;
run;

data t_b;
do i1 = 1 to 20;
sales = round(100*ranuni(3),2);
output;
end;
run;

proc sort data=t_a; by i1; run;

data t_a;
   merge t_a(in=a) t_b(in=b);
   by i1;
   if a then output;
run;

part of table t_a looks like:

=================================
i1 j1 k1 sales

1     1    11      56
2    10    10      92
2    19    18      92
2    10    20      92
2    16     7      92
. . .
. . .
=================================

/***************************/
/*** a possible solution ***/
/***************************/
data t_b(keep=i1 all_id sales);
set t_a;
by i1;
length all_id $150.;
retain all_id;

if first.i1 then do; all_id=''; end;
c1 = '('||put(j1,z2.)||'-'||put(k1,z2.)||')';
all_id = catx('+',all_id, c1);
if last.i1 then do; output; end;
run;

part of table t_b looks like:
========================================================
i1 sales all_id

1      56     (01-11)
2      92     (10-10)+(19-18)+(10-20)+(16-07)
3      20     (17-07)+(12-09)+(13-05)+(10-18)+(12-02)+(08-04)
4      78     (10-12)+(11-16)+(09-01)+(01-06)+(15-03)+(03-18)+(19-06)+(18-14)+(08-01)
5      76     (14-18)+(07-03)+(03-19)+(07-01)+(18-06)+(07-19)+(18-01)
. . .
. . .
========================================================

PGStats · Posted 02-26-2015 11:07 PM

It is fairly simple to do with the CATX function and a DO UNTIL() loop. It also extends to the case where more than two geographic areas are involved :

data have;

length GeographicCode GeographicID GeographicName $16;

input GridId GeographicCode GeographicID GeographicName & CaseCount RR;

datalines;

54321 ABC 1000 Summer County 3 1.5

54321 XYZ 1001 Winter County 3 1.5

54321 RST 1002 Spring County 3 1.5

11111 DEF 1002 Fall County 2 1

;

proc sort data=have; by gridId GeographicCode; run;

data want;

length NewGeographicCode NewGeographicID NewGeographicName $50;

do until(last.GridId);

set have; by GridId;

NewGeographicCode = catx("/", NewGeographicCode, GeographicCode);

NewGeographicID = catx("/", NewGeographicID, GeographicID);

NewGeographicName = catx("/", NewGeographicName, GeographicName);

end;

drop GeographicCode GeographicID GeographicName;

rename

NewGeographicCode=GeographicCode

NewGeographicID=GeographicID

NewGeographicName=GeographicName;

run;

proc print data=want noobs;

var GridId GeographicCode GeographicID GeographicName CaseCount RR;

run;

PG

kmaths · Posted 02-27-2015 10:36 AM

Thank you for the input everyone. The solution from PGStats worked really simply and gave the result I was looking for - thank you so much for your help!

Combining observations

Re: Combining observations

Re: Combining observations

Re: Combining observations

Re: Combining observations

Re: Combining observations

Combining observations

Re: Combining observations

Re: Combining observations

Re: Combining observations

Re: Combining observations

Re: Combining observations

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away