Solved: Re: Subsetting by unique patient IDs and other criteria

lady8506 · Posted 03-13-2019 11:33 AM

Hello all,

I am about intermediate level with SAS programming and beginner level with SQL. I am having difficulty subsetting my dataset by certain criteria. The dataset contains patients who have had cancer either once or multiple times. However, if the patients FIRST cancer was not pancreatic, then they need to be deleted. This part I have done. The second step is to get rid of the remaining patients who are in the dataset with any type of cancer, but never had a first cancer listed as pancreatic cancer. One goal of this project is to take a look at patients who had pancreatic cancer as their first cancer, and for those who have had multiple cancers, find out what other types of cancers they got after their pancreatic cancer diagnosis.

One way I was trying to do this was by Patient_ID, but I am not aware of any DISTINCT equivalency statements in SAS and I don't know SQL that well.

Below is the code I used to get rid of those whose first cancer was not pancreatic. It is also subset by histology type as I do not want patients with certain histological types of pancreatic cancer:

DATA pancreas_temp1; SET pancreas;
	IF Histologic_Type_ICD_O_3 not in (8150, 8151, 8152, 8153, 8155, 8240, 8241, 8246)
	AND
	Sequence_number IN ("One primary only", "1st of 2 or more primaries") 
	then DELETE;

	IF Primary_Site___labeled not =: 'C25' 
	AND 
	Sequence_number IN ("One primary only", "1st of 2 or more primaries") 
	then DELETE;
RUN;

Here is a sample of my dataset. I have sorted it by Patient_ID which is what revealed the problem - patients with no "Only one primary" or "1st of 2 or more primaries" are in this dataset:

Patient_ID	Sequence_number	Primary_site__labeled	Histologic_Type_ICD_O_3
16	2nd of 2 or more primaries	C20.9 - Rectum, NOS	8140
18	2nd of 2 or more primaries	C38.1-Anterior mediastinum	8801
22	2nd of 2 or more primaries	C42.1-Bone marrow	9732
22	3rd of 3 or more primaries	C50.9-Breast, NOS	8500
25	One primary only	C25.2-Tail of pancreas	8246
31	4th of 4 or more primaries	C50.4-Upper-outer quadrant of breast	8500
31	5th of 5 or more primaries	C34.1-Upper lobe, lung	8070
32	1st of 2 or more primaries	C25.0-Head of pancreas	8155

Here is what I want:

Patient_ID	Sequence_number	Primary_site__labeled	Histologic_Type_ICD_O_3
25	One primary only	C25.2-Tail of pancreas	8246
32	1st of 2 or more primaries	C25.0-Head of pancreas	8155
32	2nd of 2 or more primaries	C64.9-Kidney, NOS	8312

Patients 16, 18, 22, and 31 need to go because it is apparent that their first cancer is not listed so we can't tell if it was pancreatic or not.

Any help is appreciated. Thank you!

ChrisNZ · Posted 03-14-2019 05:25 PM

You can do it all in one go:

data HAVE;
  input ID $ SEQ &: $50. SITE  &: $50. TYPE;
 cards;
16  1st of 2 or more primaries  C20.9 - Rectum, NOS  8140
16  2nd of 2 or more primaries  C20.9 - Rectum, NOS  8140
18  1st of 2 or more primaries  C20.9 - Rectum, NOS  8140
18  2nd of 2 or more primaries  C38.1-Anterior mediastinum  8801
22  1st of 2 or more primaries  C20.9 - Rectum, NOS  8140
22  2nd of 2 or more primaries  C42.1-Bone marrow  9732
22  3rd of 3 or more primaries  C50.9-Breast, NOS  8500
25  One primary only  C25.2-Tail of pancreas  8246
31  4th of 4 or more primaries  C50.4-Upper-outer quadrant of breast  8500
31  5th of 5 or more primaries  C34.1-Upper lobe, lung  8070
32  1st of 2 or more primaries  C25.0-Head of pancreas  8155
32  2nd of 2 or more primaries  C64.9-Kidney, NOS  8312
run;

data WANT;
  retain KEEP;
  set HAVE;
  by ID;
  if first.ID then KEEP=( SEQ in:('One', '1st') and index(SITE ,'ancreas'));
  if KEEP;
run;

ID	SEQ	SITE
25	One primary only	C25.2-Tail of pancreas
32	1st of 2 or more primaries	C25.0-Head of pancreas
32	2nd of 2 or more primaries	C64.9-Kidney, NOS

High-Performance SAS Coding - Third Edition

View solution in original post

Patrick · Posted 03-14-2019 02:04 PM

@lady8506

You write you're already able to select the patient_id's you want. Below sample code shows you two ways how you then can select all rows in a table matching a list of such patient_id's.

data have;
  do pat_id=1 to 100;
    do row=1 to 5;
      if ranuni(1)>0.2 then diag='Other';
      else diag='Panc';
      output;
    end;
  end;
run;
  
/**
  option 1: Using SAS Data Step 
**/
data pat_want(keep=pat_id);
  set have;
  by pat_id;
  /* selection criteria */
  if row=1 and diag='Panc';
run;

data want1;
  if _n_=1 then 
    do;
      /* lookup table with list of unique pat_id's matching selection criteria */
      dcl hash h1(dataset:'pat_want');
      h1.defineKey('pat_id');
      h1.defineDone();
    end;
  set have;
  /* select all pat_id's from have with match to lookup table */
  if h1.check()=0 then output;
run;

/*
  Option 2: SQL
*/
proc sql;
  create table want2 as
    select h.*
    from 
      have h
      inner join
      (
        /* list of unique pat_id's matching selection criteria */
        select distinct pat_id
        from have 
        where row=1 and diag='Panc'
      ) s
      on h.pat_id=s.pat_id
    ;
quit;

ChrisNZ · Posted 03-14-2019 05:25 PM

You can do it all in one go:

data HAVE;
  input ID $ SEQ &: $50. SITE  &: $50. TYPE;
 cards;
16  1st of 2 or more primaries  C20.9 - Rectum, NOS  8140
16  2nd of 2 or more primaries  C20.9 - Rectum, NOS  8140
18  1st of 2 or more primaries  C20.9 - Rectum, NOS  8140
18  2nd of 2 or more primaries  C38.1-Anterior mediastinum  8801
22  1st of 2 or more primaries  C20.9 - Rectum, NOS  8140
22  2nd of 2 or more primaries  C42.1-Bone marrow  9732
22  3rd of 3 or more primaries  C50.9-Breast, NOS  8500
25  One primary only  C25.2-Tail of pancreas  8246
31  4th of 4 or more primaries  C50.4-Upper-outer quadrant of breast  8500
31  5th of 5 or more primaries  C34.1-Upper lobe, lung  8070
32  1st of 2 or more primaries  C25.0-Head of pancreas  8155
32  2nd of 2 or more primaries  C64.9-Kidney, NOS  8312
run;

data WANT;
  retain KEEP;
  set HAVE;
  by ID;
  if first.ID then KEEP=( SEQ in:('One', '1st') and index(SITE ,'ancreas'));
  if KEEP;
run;

ID	SEQ	SITE
25	One primary only	C25.2-Tail of pancreas
32	1st of 2 or more primaries	C25.0-Head of pancreas
32	2nd of 2 or more primaries	C64.9-Kidney, NOS

High-Performance SAS Coding - Third Edition

lady8506 · Posted 03-15-2019 09:34 AM

This is fantastic! I am unfamiliar with some of the syntax you used, but I will figure it out. For now, it looks like it worked. My original dataset is over 1 million observations long and now, after implementing your code, it's only 6500+ long. That seems about right as another study was done drawing data from SEER (same place I got my data) and they ended up with only 3300+ pancreatic cancer patients, but they also weren't including subsequent cancers of other non-pancreatic types like I did. I just need to check it over and make sure before I accept as final solution.

ChrisNZ · Posted 03-15-2019 05:18 PM

Ask if you get stuck on some of the syntax. Note that the expression giving its value to KEEP is just a test. So it yields a Boolean result.

High-Performance SAS Coding - Third Edition

Subsetting by unique patient IDs and other criteria

Re: Subsetting by unique patient IDs and other criteria

Re: Subsetting by unique patient IDs and other criteria

Re: Subsetting by unique patient IDs and other criteria

Re: Subsetting by unique patient IDs and other criteria

Re: Subsetting by unique patient IDs and other criteria

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away