Hello all,
I am about intermediate level with SAS programming and beginner level with SQL. I am having difficulty subsetting my dataset by certain criteria. The dataset contains patients who have had cancer either once or multiple times. However, if the patients FIRST cancer was not pancreatic, then they need to be deleted. This part I have done. The second step is to get rid of the remaining patients who are in the dataset with any type of cancer, but never had a first cancer listed as pancreatic cancer. One goal of this project is to take a look at patients who had pancreatic cancer as their first cancer, and for those who have had multiple cancers, find out what other types of cancers they got after their pancreatic cancer diagnosis.
One way I was trying to do this was by Patient_ID, but I am not aware of any DISTINCT equivalency statements in SAS and I don't know SQL that well.
Below is the code I used to get rid of those whose first cancer was not pancreatic. It is also subset by histology type as I do not want patients with certain histological types of pancreatic cancer:
DATA pancreas_temp1; SET pancreas;
IF Histologic_Type_ICD_O_3 not in (8150, 8151, 8152, 8153, 8155, 8240, 8241, 8246)
AND
Sequence_number IN ("One primary only", "1st of 2 or more primaries")
then DELETE;
IF Primary_Site___labeled not =: 'C25'
AND
Sequence_number IN ("One primary only", "1st of 2 or more primaries")
then DELETE;
RUN;
Here is a sample of my dataset. I have sorted it by Patient_ID which is what revealed the problem - patients with no "Only one primary" or "1st of 2 or more primaries" are in this dataset:
Patient_ID | Sequence_number | Primary_site__labeled | Histologic_Type_ICD_O_3 |
16 | 2nd of 2 or more primaries | C20.9 - Rectum, NOS | 8140 |
18 | 2nd of 2 or more primaries | C38.1-Anterior mediastinum | 8801 |
22 | 2nd of 2 or more primaries | C42.1-Bone marrow | 9732 |
22 | 3rd of 3 or more primaries | C50.9-Breast, NOS | 8500 |
25 | One primary only | C25.2-Tail of pancreas | 8246 |
31 | 4th of 4 or more primaries | C50.4-Upper-outer quadrant of breast | 8500 |
31 | 5th of 5 or more primaries | C34.1-Upper lobe, lung | 8070 |
32 | 1st of 2 or more primaries | C25.0-Head of pancreas | 8155 |
Here is what I want:
Patient_ID | Sequence_number | Primary_site__labeled | Histologic_Type_ICD_O_3 |
25 | One primary only | C25.2-Tail of pancreas | 8246 |
32 | 1st of 2 or more primaries | C25.0-Head of pancreas | 8155 |
32 | 2nd of 2 or more primaries | C64.9-Kidney, NOS | 8312 |
Patients 16, 18, 22, and 31 need to go because it is apparent that their first cancer is not listed so we can't tell if it was pancreatic or not.
Any help is appreciated. Thank you!
You can do it all in one go:
data HAVE;
input ID $ SEQ &: $50. SITE &: $50. TYPE;
cards;
16 1st of 2 or more primaries C20.9 - Rectum, NOS 8140
16 2nd of 2 or more primaries C20.9 - Rectum, NOS 8140
18 1st of 2 or more primaries C20.9 - Rectum, NOS 8140
18 2nd of 2 or more primaries C38.1-Anterior mediastinum 8801
22 1st of 2 or more primaries C20.9 - Rectum, NOS 8140
22 2nd of 2 or more primaries C42.1-Bone marrow 9732
22 3rd of 3 or more primaries C50.9-Breast, NOS 8500
25 One primary only C25.2-Tail of pancreas 8246
31 4th of 4 or more primaries C50.4-Upper-outer quadrant of breast 8500
31 5th of 5 or more primaries C34.1-Upper lobe, lung 8070
32 1st of 2 or more primaries C25.0-Head of pancreas 8155
32 2nd of 2 or more primaries C64.9-Kidney, NOS 8312
run;
data WANT;
retain KEEP;
set HAVE;
by ID;
if first.ID then KEEP=( SEQ in:('One', '1st') and index(SITE ,'ancreas'));
if KEEP;
run;
ID | SEQ | SITE |
---|---|---|
25 | One primary only | C25.2-Tail of pancreas |
32 | 1st of 2 or more primaries | C25.0-Head of pancreas |
32 | 2nd of 2 or more primaries | C64.9-Kidney, NOS |
You write you're already able to select the patient_id's you want. Below sample code shows you two ways how you then can select all rows in a table matching a list of such patient_id's.
data have;
do pat_id=1 to 100;
do row=1 to 5;
if ranuni(1)>0.2 then diag='Other';
else diag='Panc';
output;
end;
end;
run;
/**
option 1: Using SAS Data Step
**/
data pat_want(keep=pat_id);
set have;
by pat_id;
/* selection criteria */
if row=1 and diag='Panc';
run;
data want1;
if _n_=1 then
do;
/* lookup table with list of unique pat_id's matching selection criteria */
dcl hash h1(dataset:'pat_want');
h1.defineKey('pat_id');
h1.defineDone();
end;
set have;
/* select all pat_id's from have with match to lookup table */
if h1.check()=0 then output;
run;
/*
Option 2: SQL
*/
proc sql;
create table want2 as
select h.*
from
have h
inner join
(
/* list of unique pat_id's matching selection criteria */
select distinct pat_id
from have
where row=1 and diag='Panc'
) s
on h.pat_id=s.pat_id
;
quit;
You can do it all in one go:
data HAVE;
input ID $ SEQ &: $50. SITE &: $50. TYPE;
cards;
16 1st of 2 or more primaries C20.9 - Rectum, NOS 8140
16 2nd of 2 or more primaries C20.9 - Rectum, NOS 8140
18 1st of 2 or more primaries C20.9 - Rectum, NOS 8140
18 2nd of 2 or more primaries C38.1-Anterior mediastinum 8801
22 1st of 2 or more primaries C20.9 - Rectum, NOS 8140
22 2nd of 2 or more primaries C42.1-Bone marrow 9732
22 3rd of 3 or more primaries C50.9-Breast, NOS 8500
25 One primary only C25.2-Tail of pancreas 8246
31 4th of 4 or more primaries C50.4-Upper-outer quadrant of breast 8500
31 5th of 5 or more primaries C34.1-Upper lobe, lung 8070
32 1st of 2 or more primaries C25.0-Head of pancreas 8155
32 2nd of 2 or more primaries C64.9-Kidney, NOS 8312
run;
data WANT;
retain KEEP;
set HAVE;
by ID;
if first.ID then KEEP=( SEQ in:('One', '1st') and index(SITE ,'ancreas'));
if KEEP;
run;
ID | SEQ | SITE |
---|---|---|
25 | One primary only | C25.2-Tail of pancreas |
32 | 1st of 2 or more primaries | C25.0-Head of pancreas |
32 | 2nd of 2 or more primaries | C64.9-Kidney, NOS |
This is fantastic! I am unfamiliar with some of the syntax you used, but I will figure it out. For now, it looks like it worked. My original dataset is over 1 million observations long and now, after implementing your code, it's only 6500+ long. That seems about right as another study was done drawing data from SEER (same place I got my data) and they ended up with only 3300+ pancreatic cancer patients, but they also weren't including subsequent cancers of other non-pancreatic types like I did. I just need to check it over and make sure before I accept as final solution.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.