I am working with a data set of water samples that were tested for the presence of legionella using PCR. I most cases multiple samples were collected from different sites in the same building and all of the rows of the table contain a building ID indicating which building they are from. I am looking to subset the data by building ID, build a new row of data for each subset consisting of the combined stats from all the rows of the subset, and then merge the newly created rows into a new data set. I know I can subset the data by using "by Building_ID" but I am totally lost on creating and merging a new data set from those subsets.
For reference my data looks something like this:
Sample ID | Building ID | Facility Type | Legionella Present | Legionella Load | Building Floors |
1 | 1 | 1 | 1 | 500 | 1 |
2 | 1 | 1 | 0 | 0 | 1 |
3 | 2 | 2 | 0 | 0 | 3 |
4 | 3 | 2 | 1 | 6000 | 5 |
5 | 3 | 2 | 0 | 0 | 5 |
6 | 3 | 2 | 0 | 0 | 5 |
7 | 4 | 1 | 1 | 200 | 2 |
8 | 4 | 1 | 1 | 400 | 2 |
And I need my output to look something like this:
Building ID | Legionella Present | Legionella Load Max | Legionella Load Min | Legionella Load Average | Building Floors |
1 | 1 | 500 | 0 | 250 | 1 |
2 | 0 | 0 | 0 | 0 | 3 |
3 | 1 | 6000 | 0 | 2000 | 5 |
4 | 1 | 400 | 200 | 300 | 2 |
You can also try simple proc sql query.
/* Created sample data here */
proc import datafile='C:\Users\Dharmu\Desktop\Minu SAS Training\Water.xlsx' out=have dbms=xlsx replace;run;
proc sort data=have;by building_id Legionella_Load;run;
proc sql;
create table want as
select Building_ID,
max(Legionella_Present) as Legionella_Present,
min(Legionella_Load) as Min_Legionella_Load,
max(Legionella_Load) as Max_Legionella_Load,
Avg(Legionella_Load) as Avg_Legionella_Load,
max(building_floors) as building_floors
from have group by Building_ID;
quit;
proc print data=want;run;
Assuming that Legionella_Present and Legionella_Load are numeric:
proc summary data=work.have nway;
class Building_ID;
id Building_Floors;
var Legionella_Present Legionella_Load;
output out= work.calc(drop= _type_ _freq_) max= min(Legionella_Load)= mean(Legionella_Load)= /autoname;
run;
You can also try simple proc sql query.
/* Created sample data here */
proc import datafile='C:\Users\Dharmu\Desktop\Minu SAS Training\Water.xlsx' out=have dbms=xlsx replace;run;
proc sort data=have;by building_id Legionella_Load;run;
proc sql;
create table want as
select Building_ID,
max(Legionella_Present) as Legionella_Present,
min(Legionella_Load) as Min_Legionella_Load,
max(Legionella_Load) as Max_Legionella_Load,
Avg(Legionella_Load) as Avg_Legionella_Load,
max(building_floors) as building_floors
from have group by Building_ID;
quit;
proc print data=want;run;
You could potentially use a PROC REPORT function to get the data pretty close to what you want.
DATA WORK.QA;
INFILE DATALINES;
INPUT SampleID
BuildingID
FacilityType
LegionellaPresent
LegioellaLoad
Buildingfloors
;
DATALINES;
1 1 1 1 500 1
2 1 1 0 0 1
3 2 2 0 0 3
4 3 2 1 6000 5
5 3 2 0 0 5
6 3 2 0 0 5
7 4 1 1 200 2
8 4 1 1 400 2
;
So based off the data that you gave us, I read the data in by using datalines, so the names of the variables are listed above. After that I used a PROC REPORT function to get the table.
PROC REPORT DATA = WORK.QA
SPLIT = "/";
COLUMN BuildingID LegionellaPresent LegioellaLoad,(MAX MIN MEAN) Buildingfloors;
DEFINE BuildingID / GROUP "Building/ID";
DEFINE LegionellaPresent / "Legionella/Present";
DEFINE Buildingfloors / GROUP "Building/floors";
DEFINE LegioellaLoad / "Legioella Load";
RUN;
So here I am creating a table with the columns as the variable names for BuildingID LegionellaPresent LegioellaLoad, Buildingfloors. For LegioellaLoad I asked SAS to find the max, min, and mean.
Then I grouped the data by BuildingID and Buildingfloors, and renamed all the variables, so that they somewhat matched the description above.
At the end, the table should look something like this.
1 | 1 | 500 | 0 | 250 | 1 |
2 | 0 | 0 | 0 | 0 | 3 |
3 | 1 | 6000 | 0 | 2000 | 5 |
4 | 2 | 400 | 200 | 300 | 2 |
Hope this helps!
data have;
infile datalines dlm="09"x;
input Sample_ID Building_ID Facility_Type Legionella_Present Legionella_Load
Building_Floors;
datalines;
1 1 1 1 500 1
2 1 1 0 0 1
3 2 2 0 0 3
4 3 2 1 6000 5
5 3 2 0 0 5
6 3 2 0 0 5
7 4 1 1 200 2
8 4 1 1 400 2
;
run;
data want(keep=Building_ID
Legionella_Present_max
Legionella_load_max
Legionella_load_min
Legionella_load_Average
Building_Floors rename=(Legionella_Present_max=Legionella_Present)
);
*keep variables ordered using length statement;
length Building_ID
Legionella_Present_max
Legionella_load_max
Legionella_load_min
Legionella_load_Average
Building_Floors 8;
*define arrays to calculate min and max;
array l_[10];
array p_[10];
do _N_=1 by 1 until(last.building_id);
set have;
by building_id;
*input Legionella_Load and Legionella_Present into array by obs;
l_[_N_]=Legionella_Load;
p_[_N_]=Legionella_Present;
*calculate max and min using of keyword;
Legionella_load_max=max(of l_[*]);
Legionella_load_min=min(of l_[*]);
Legionella_Present_max=max(of p_[*]);
if first.building_id
then
sum=.;
*sum statement;
sum+Legionella_Load;
Legionella_load_Average=sum/_N_;
end;
*drop Legionella_Present variable and rename it in data step rename option;
drop Legionella_Present;
run;
Thank you very much for your help everyone I ended up using the "proc sql" since it seemed like it would be the easiest to repurpose and add extra data to for my other analyses.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.