BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mayasak
Quartz | Level 8

Hello,

 

I have a dataset about cancer. SEER has some conditions in defining brain tumors as benign.

 

For other tumors such as breast, cervix, and other cancers I used: 

if Seer_site_group in ( 26000 ) and beh=2
then CASite67= 1901 ; *Breast, In Situ;
if Seer_site_group in ( 26000 ) and beh=3
then CASite67= 1902 ; *Breast, Malignant;

if Seer_site_group in ( 27010 )
then CASite67= 2000 ; *Cervix;
if Seer_site_group in ( 27020)
then CASite67= 2101 ; *Corpus Uteri;
if Seer_site_group in ( 27030 )
then CASite67= 2102 ; *Uteri, NOS;
if Seer_site_group in ( 27040 )
then CASite67= 2200 ; *Ovary;

 

For brain cancer I used:

if Seer_site_group in ( 31010 ) and beh=2
then CASite67= 3101 ; *Brain, Benign;
if p in ( 31010 ) and beh=3
then CASite67= 3102 ; *Brain, Malignant;

 

Then I realized that brain tumor in the "Seer_site_group" with code (32020) captures more than just C751-C753.  So the code should involve something like that:

 

IF "beh" = 0 or 1; AND Primary_site = C751, C752, C753; AND histology_icdo3 NOT = 9050-9055, 9140, 9590-9992;  THEN = "Benign Brain"

but I'm not sure how to integrate that in one SAS doable statement along with the other cancer statements.

 

Thank you.


brain.benign .JPG
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

@mayasak wrote:

Thank you ote that histology_icdo3 is a character value too with a length of exactly 4 for all of them.


This can be dealt with easily:

 

For example, if you know for sure that this character variable always ("always" in the relevant observations, not necessarily in general!) contains 4 digits (from '0' to '9'), you could replace the corresponding condition by

 

input(histology_icdo3, 4.) not in (9050:9055 9140 9590:9992)

If, however, there is a possibility that some of the codes are alphanumeric and, e.g., '959A' is regarded as belonging to the range "9590-9992," you could use inequalities with regard to alphabetical order:

 

 

not ('9050'<=histology_icdo3<='9055' | histology_icdo3='9140' | '9590'<=histology_icdo3<='9992')

This is equivalent to ballardw's format approach.

 

View solution in original post

14 REPLIES 14
mayasak
Quartz | Level 8

Sorry I forgot that I had this statement too regarding the question.

 

if Seer_site_group in ( 32020 )
then CASite67= 3400 ; *Other Endocrine including Thymus;

 

Thanks

FreelanceReinh
Jade | Level 19

Hello @mayasak,

 

Syntactically correct would be this:

if beh in (0 1) and primary_site in ('C751' 'C752' 'C753') 
  and histology_icdo3 not in (9050:9055 9140 9590:9992) then CASite67 = 3101; *Brain, Benign;

(assuming that primary_site is a character variable with values such as those listed and histology_icdo3 is numeric, so that the abbreviations for ranges of integers like 1:5 for 1 2 3 4 5 are valid).

 

Please adapt and combine this with a criterion about Seer_site_group if necessary and appropriate. It seems confusing to me that you write 'brain tumor in the "Seer_site_group" with code (32020)' while code 32020 is rather subordinate to the endocrine system.

mayasak
Quartz | Level 8

Thank you ote that histology_icdo3 is a character value too with a length of exactly 4 for all of them.

FreelanceReinh
Jade | Level 19

@mayasak wrote:

Thank you ote that histology_icdo3 is a character value too with a length of exactly 4 for all of them.


This can be dealt with easily:

 

For example, if you know for sure that this character variable always ("always" in the relevant observations, not necessarily in general!) contains 4 digits (from '0' to '9'), you could replace the corresponding condition by

 

input(histology_icdo3, 4.) not in (9050:9055 9140 9590:9992)

If, however, there is a possibility that some of the codes are alphanumeric and, e.g., '959A' is regarded as belonging to the range "9590-9992," you could use inequalities with regard to alphabetical order:

 

 

not ('9050'<=histology_icdo3<='9055' | histology_icdo3='9140' | '9590'<=histology_icdo3<='9992')

This is equivalent to ballardw's format approach.

 

mayasak
Quartz | Level 8

I ran this as you've stated:

 

if Seer_site_group in ( 31010 32020 ) and beh in (0 1) and primary_site in ('C751' 'C752' 'C753') and input(histology_icdo3, 4.) not in (9050:9055 9140 9590:9992)
then CASite67= 3101 ; *Brain, Benign;
if Seer_site_group in ( 31010 ) and beh=3
then CASite67= 3102 ; *Brain, Malignant;

 

there was no error in the log however, when I ran proc freq for casite67,  3101 was missing in the table. 

2900 97
3000 338
3102 1109
3200 2477

 

I also ran the proc freq with: 

 

if Seer_site_group in ( 31010 32020 ) and beh in (0 1) and primary_site in ('C751' 'C752' 'C753')

then CASite67= 3101 ; *Brain, Benign;

 

and had the same problem.

 

It worked only with:

 

if Seer_site_group in ( 31010 32020 ) and beh in (0 1)

then CASite67= 3101 ; *Brain, Benign;

 

2900 97
3000 338
3101 285
3102 1109
3200 2477

 

Thank you

 

ballardw
Super User

If you have a condition that involves multiple comparisons then you may be looking at using the IN operator.

Basically you provide the list of values inside paratheses:

 

If X in (23 24 25 26) then (do something)

or for character comparisons:

If y in ('23' '24' '25' '26') then ...

You can use NOT as well

If y not in ('23' '24' '25' '26') then ...

If you want a sequence of values that are numeric (actual numbers not character values that look like numbers) you can use

 

if X in (3:6  9 15) to match values of 3 to 6 and 9 and 15.

So IF "beh" = 0 or 1; AND Primary_site = C751, C752, C753"

would become

 

If beh in (0,1) and Primary_site in ('C751' 'C752' 'C753') ...

 

I suspect you 9050-9055, 9140, 9590-9992 values are likely to be character values and even writing an IN statement gets long and the range indicator : is not allowed for character comparisons.

 

You may be able to use a custom format. Please look at the following code, run it and examine the log.

proc format library=work;
value $code
'9050'-'9055', '9140', '9590'-'9992' = 'Match'
other ='No match';
run;

data _null_;
   length x $ 8.;
   do x= '9050','9051','9166','9593','95936','960';
      if put(x,$code.) = 'Match' then put "Found" x=;
   end;
run;

Note that you test a character value. But character value ranges basically will accept anything that matches the first 4 characters if that is all that are used in the defining Format. So 95936 is considered to be between 9590 and 9992 because the 95936 part is. Also the 3 character string matches the first 3 that appear in that range.

 

So depending on your actual range of values of your histology variable (basically all of them exactly 4 characters) you may be able to include a condition of

and put(histology_icd03,$code.)='No match' with the previous.

 

Or make a very long IN clause.

mayasak
Quartz | Level 8

Thank you 

proc format library=work; value $code '9050'-'9055', '9140', '9590'-'9992' = 'Match' other ='No match'; run; data _null_;length x $ 8.; do x= '9050','9051','9166','9593','95936','960'; if put(x,$code.) = 'Match' then put "Found" x=; end; run;

as it is after I read the data as follows:

 

libname temp "I:\cancer";

data CA_test;
set cancer_data;

run;

 

and then I added 

 

if Seer_site_group in (32020) and beh in (0,1) and Primary_site in ('C751' 'C752' 'C753') and (histology_icd03,$code.)='No match';
then CASite67= 3101 ; *Brain, Benign;

 

I have the following error:

 

if Seer_site_group in ( 32020 ) and beh in (0,1) and Primary_site in ('C751' 'C752' 'C753') and
2688! (histology_icd03 $code)='No match'
-
388
200
ERROR 388-185: Expecting an arithmetic operator.

ERROR 200-322: The symbol is not recognized and will be ignored.

 

I'm not sure what is done with the format statement and what has gone wrong. Do I have to run as it is or customize it somewhere according to my dataset.  

 

Thank you

 

FreelanceReinh
Jade | Level 19

You forgot to type the name of the PUT function:

put(histology_icd03, $code.)

 

Edit: ... among other things.

ballardw
Super User

Your error:

if Seer_site_group in ( 32020 ) and beh in (0,1) and Primary_site in ('C751' 'C752' 'C753') and
 (histology_icd03 $code)='No match'

Does not match the posted code:

 

if Seer_site_group in (32020) and beh in (0,1) and Primary_site in ('C751' 'C752' 'C753') and (histology_icd03,$code.)='No match';

Posted code

 

Missing Put(histology_ (etc.)) AND includes a ; before the then.

The posted error is missing Put, comma separating variable and format, and period at end of format .

mayasak
Quartz | Level 8

I rerun it with the correction. It went through with no errors but as I mentioned to  FreelanceReinhard in my last reply that 3101 was missing when I ran the proc freq for casite67:

 

2900   97

3000   338

3102   1109

3200   2477

3300   4882

 

It worked when only beh condition added.

Thank you a lot

FreelanceReinh
Jade | Level 19

We don't have your data. So, you'll have to check why the two IF conditions for casite67=3101 are never met. For example, you can apply the weak criterion that produces observations with casite67=3101, let's call the resulting dataset RESULT, and then apply PROC FREQ to variables primary_site and histology_icdo3. Example:

proc freq data=result;
where casite67=3101;
format _all_;
tables primary_site;
run;
mayasak
Quartz | Level 8

Freq results from:

 

proc freq data=result;
where casite67=3101;
format _all_;
tables beh primary_site histology_icdo3 /nocol nocum nopercent norow;
run;


Beh Frequency

0 197
1 215


primary_site

primary_site Frequency

C710  19
C711  38
C712  36
C713  21
C714  16
C715  35
C716  73
C717  32
C718 9
C719 133


histology_icdo3

histology_icdo3 Frequency

8000  132
8001  1
8680  1
8824  1
8850  2
9084  3
9120  31
9121  42
9122  1
9150  4
9161  53
9350  4
9351  4
9361  1
9383  20
9384  3
9390  18
9394  3
9400  2
9413  4
9440  2
9444  2
9490  1
9493  2
9505  23
9506  4
9540  5
9560  43

FreelanceReinh
Jade | Level 19

@mayasak wrote:

(...)
primary_site

primary_site Frequency

C710  19
C711  38
C712  36
C713  21
C714  16
C715  35
C716  73
C717  32
C718 9
C719 133


histology_icdo3

(...)


But you are no longer wondering why a criterion stronger than

primary_site in ('C751' 'C752' 'C753')

resulted in zero observations, are you?

mayasak
Quartz | Level 8

Sure I'm not Smiley Happy

Thank you

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 2746 views
  • 8 likes
  • 3 in conversation