I am working with a data set of water samples that were tested for the presence of legionella using PCR. I most cases multiple samples were collected from different sites in the same building and all of the rows of the table contain a building ID indicating which building they are from. I am looking to subset the data by building ID, build a new row of data for each subset consisting of the combined stats from all the rows of the subset, and then merge the newly created rows into a new data set. I know I can subset the data by using "by Building_ID" but I am totally lost on creating and merging a new data set from those subsets.
For reference my data looks something like this:
| Sample ID | Building ID | Facility Type | Legionella Present | Legionella Load | Building Floors |
| 1 | 1 | 1 | 1 | 500 | 1 |
| 2 | 1 | 1 | 0 | 0 | 1 |
| 3 | 2 | 2 | 0 | 0 | 3 |
| 4 | 3 | 2 | 1 | 6000 | 5 |
| 5 | 3 | 2 | 0 | 0 | 5 |
| 6 | 3 | 2 | 0 | 0 | 5 |
| 7 | 4 | 1 | 1 | 200 | 2 |
| 8 | 4 | 1 | 1 | 400 | 2 |
And I need my output to look something like this:
| Building ID | Legionella Present | Legionella Load Max | Legionella Load Min | Legionella Load Average | Building Floors |
| 1 | 1 | 500 | 0 | 250 | 1 |
| 2 | 0 | 0 | 0 | 0 | 3 |
| 3 | 1 | 6000 | 0 | 2000 | 5 |
| 4 | 1 | 400 | 200 | 300 | 2 |
You can also try simple proc sql query.
/* Created sample data here */
proc import datafile='C:\Users\Dharmu\Desktop\Minu SAS Training\Water.xlsx' out=have dbms=xlsx replace;run;
proc sort data=have;by building_id Legionella_Load;run;
proc sql;
create table want as
select Building_ID,
max(Legionella_Present) as Legionella_Present,
min(Legionella_Load) as Min_Legionella_Load,
max(Legionella_Load) as Max_Legionella_Load,
Avg(Legionella_Load) as Avg_Legionella_Load,
max(building_floors) as building_floors
from have group by Building_ID;
quit;
proc print data=want;run;
Assuming that Legionella_Present and Legionella_Load are numeric:
proc summary data=work.have nway;
class Building_ID;
id Building_Floors;
var Legionella_Present Legionella_Load;
output out= work.calc(drop= _type_ _freq_) max= min(Legionella_Load)= mean(Legionella_Load)= /autoname;
run;
You can also try simple proc sql query.
/* Created sample data here */
proc import datafile='C:\Users\Dharmu\Desktop\Minu SAS Training\Water.xlsx' out=have dbms=xlsx replace;run;
proc sort data=have;by building_id Legionella_Load;run;
proc sql;
create table want as
select Building_ID,
max(Legionella_Present) as Legionella_Present,
min(Legionella_Load) as Min_Legionella_Load,
max(Legionella_Load) as Max_Legionella_Load,
Avg(Legionella_Load) as Avg_Legionella_Load,
max(building_floors) as building_floors
from have group by Building_ID;
quit;
proc print data=want;run;
You could potentially use a PROC REPORT function to get the data pretty close to what you want.
DATA WORK.QA;
INFILE DATALINES;
INPUT SampleID
BuildingID
FacilityType
LegionellaPresent
LegioellaLoad
Buildingfloors
;
DATALINES;
1 1 1 1 500 1
2 1 1 0 0 1
3 2 2 0 0 3
4 3 2 1 6000 5
5 3 2 0 0 5
6 3 2 0 0 5
7 4 1 1 200 2
8 4 1 1 400 2
;
So based off the data that you gave us, I read the data in by using datalines, so the names of the variables are listed above. After that I used a PROC REPORT function to get the table.
PROC REPORT DATA = WORK.QA
SPLIT = "/";
COLUMN BuildingID LegionellaPresent LegioellaLoad,(MAX MIN MEAN) Buildingfloors;
DEFINE BuildingID / GROUP "Building/ID";
DEFINE LegionellaPresent / "Legionella/Present";
DEFINE Buildingfloors / GROUP "Building/floors";
DEFINE LegioellaLoad / "Legioella Load";
RUN;
So here I am creating a table with the columns as the variable names for BuildingID LegionellaPresent LegioellaLoad, Buildingfloors. For LegioellaLoad I asked SAS to find the max, min, and mean.
Then I grouped the data by BuildingID and Buildingfloors, and renamed all the variables, so that they somewhat matched the description above.
At the end, the table should look something like this.
| 1 | 1 | 500 | 0 | 250 | 1 |
| 2 | 0 | 0 | 0 | 0 | 3 |
| 3 | 1 | 6000 | 0 | 2000 | 5 |
| 4 | 2 | 400 | 200 | 300 | 2 |
Hope this helps!
data have;
infile datalines dlm="09"x;
input Sample_ID Building_ID Facility_Type Legionella_Present Legionella_Load
Building_Floors;
datalines;
1 1 1 1 500 1
2 1 1 0 0 1
3 2 2 0 0 3
4 3 2 1 6000 5
5 3 2 0 0 5
6 3 2 0 0 5
7 4 1 1 200 2
8 4 1 1 400 2
;
run;
data want(keep=Building_ID
Legionella_Present_max
Legionella_load_max
Legionella_load_min
Legionella_load_Average
Building_Floors rename=(Legionella_Present_max=Legionella_Present)
);
*keep variables ordered using length statement;
length Building_ID
Legionella_Present_max
Legionella_load_max
Legionella_load_min
Legionella_load_Average
Building_Floors 8;
*define arrays to calculate min and max;
array l_[10];
array p_[10];
do _N_=1 by 1 until(last.building_id);
set have;
by building_id;
*input Legionella_Load and Legionella_Present into array by obs;
l_[_N_]=Legionella_Load;
p_[_N_]=Legionella_Present;
*calculate max and min using of keyword;
Legionella_load_max=max(of l_[*]);
Legionella_load_min=min(of l_[*]);
Legionella_Present_max=max(of p_[*]);
if first.building_id
then
sum=.;
*sum statement;
sum+Legionella_Load;
Legionella_load_Average=sum/_N_;
end;
*drop Legionella_Present variable and rename it in data step rename option;
drop Legionella_Present;
run;
Thank you very much for your help everyone I ended up using the "proc sql" since it seemed like it would be the easiest to repurpose and add extra data to for my other analyses.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.