Re: Information Check with IN LIST

TC_ · Posted 02-19-2022 10:34 PM

Hello,

I have the following datasets "INPUT" and "IN2"

data INPUT;
	input hist $;
	cards;
	100
	105
	110
	110
;
run;

data IN2;
    length hist $15.;
	input hist $ label;
	cards;
	100-105,108 1
	110 3
;
run;

My goal is to get “label” of IN2 by matching "hist". "hist" of IN2 is given in a compact format with ranges and individual values. 100-105 means 100,101 ... 105.

Inspired by another post, I was trying to use a DATA NULL step within a DATA step and convert "hist" into a list so I can check whether a value is in that list. Below is what I was trying to do after all things get resolved:

if hist in (100-105,108) then label = IN2.label
if hist in (110) then label = IN2.label

Here is my attempt:

data test;
	set input;
	RC=dosubl(cats('data _null_;                                     '
                ,'  set in2; '
				,'call symput("hist_list",hist);'
				,'if ', hist,  ' in (&hist_list) then do;'
				,'call symputx("lab",label);' 
				,'stop;'
				,'call symputx("lab"," ");'
				,'run;'));
	label=symget('lab');
run;

Here's one piece (out of 4) of error message:

ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, INPUT, PUT.
ERROR 22-322: Syntax error, expecting one of the following: +, =.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
ERROR: Undeclared array referenced: if110in.
WARNING: Apparent symbolic reference HIST_LIST not resolved.
ERROR: Variable if110in has not been declared as an array.

Can someone help me troubleshoot this?

Thanks!

SASKiwi · Posted 02-20-2022 01:10 AM

Looks like you don't know the correct syntax of the IN operator or the IF statement:

if hist in (100-105,108) then label = IN2.label
if hist in (110) then label = IN2.label

This is correct syntax but is nonsense code because LABEL is being assigned to itself:

if hist in ('100','101','102','103','104','105','108') then label = label;
if hist in ('110') then label = label;

Note HIST is a character variable so when comparing it to a constant like 100, it needs to be enclosed in quotes.

I have no idea why you are trying to code this via DOSUBL. Code a solution that works without DOSUBL before burying it in that function.

TC_ · Posted 02-20-2022 01:49 AM

I know my attempt might look convoluted and unnecessary. It's because the datasets I have are more complicated than what I have shown. The sample data are simplified to show just enough details so people understand what type of data I am working with. For IN2, I wrote "100-105,108" and I actually have something like "9500-9580,9599,9610-9650,9652,9655" and on. And I have over 100 rows and 3 columns of data that are of a similar format.

Thank you for the input, I get all your point but it's not the solution I am looking for. If you have a smarter way than my attempt, I'd love to hear it.

Kurt_Bremser · Posted 02-20-2022 06:04 AM

Spread your IN2 dataset out vertically, so you can use it as a CNTLIN dataset for PROC FORMAT. Then apply that format, either to the variable or in a PUT function.

data input;
input hist $;
cards;
100
105
110
110
;

data  in2;
length hist $15.;
input hist label;
cards;
100-105,108 1
110 3
;

data cntlin;
set in2 end=done;
fmtname = "$check";
do i1 = 1 to countw(hist,",");
  substring = scan(hist,i1,",");
  if indexc(substring,"-")
  then do i2 = input(scan(substring,1,"-"),3.) to input(scan(substring,2,"-"),3.);
    start = put(i2,3.);
    output;
  end;
  else do;
    start = substring;
    output;
  end;
end;
if done
then do;
  label = .;
  start = "OTHER";
  hlo = "O";
  output;
end;
keep start label fmtname hlo;
run;

proc format cntlin=cntlin;
run;

data want;
set input;
label = put(hist,$check.);
run;

Similarly, you can use the vertical dataset for a JOIN/MERGE or as input to a hash object. As usual, see Maxim 19: Long Beats Wide.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

TC_ · Posted 02-20-2022 12:19 PM

Thank you for the input, Kurt. I did try to expand my IN2 dataset. But my actual IN2 dataset is actually a lot more complicated than the sample I have provided. After expanding all the ranges as you did, my IN2 dataset has over 400 million rows which is really hard to work with. And that's why I try to work with what's in IN2 as is.

Kurt_Bremser · Posted 02-20-2022 03:09 PM

If you have much wider ranges in your in2, you can use the end values as start and end in the CNTLIN dataset to define ranges. Be careful with ranges with character values, as "1500" will fall in the range "150" to "200". This can't happen with numeric values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SASKiwi · Posted 02-20-2022 03:44 PM

@Kurt_Bremser is definitely pointing you in the right direction regarding the use of ranges in PROC FORMAT's CNTLIN option for creating user-defined formats. Please check out the documentation that I've linked to so you can understand how it works better. You will need to split ranges and single values into separate rows for example.

TC_ · Posted 02-20-2022 04:27 PM

Thank you both for the responses. Creating a format does work for the purpose of labeling the data when a single column of data is used to determine the label. And unfortunately the data I work with needs 3 columns to determine the corresponding label (Sorry I should have made this clear enough in my question). For example

if Col1=10 & Col2=11 & Col3=10 then label =1;
if Col1=10 & Col2=12 & Col3=10 then label =2;

So applying a format as you two suggested does not work that well for this situation. But I agree that, when dealing with a single column, a format is a more straightforward and efficient approach.

SASKiwi · Posted 02-20-2022 05:20 PM

You can still use a PROC FORMAT lookup with multiple columns by concatenating:

label = put(strip(col1) !! strip(col2) !! strip(col3), $Lookup.);

Do col1, col2 and col3 all have ranges or just col1? How many unique values are there for col1, col2, col3 combined?

TC_ · Posted 02-20-2022 08:23 PM

Hmm wow this opens up new possibilities... They all have ranges and individual values, separated by commas. Such as col1 may have 9800,9810-9820,9830-9835,9840,9850-9900 and col2 may have 0,2-3,5-10 etc.

SASKiwi · Posted 02-20-2022 09:45 PM

I would help if you posted more representative data for your three columns so we can get a clearer picture.

TC_ · Posted 02-20-2022 10:32 PM

My bad.

This is some actual data I have:

data input;
	length ICD_O_3_Site $20.;
	length ICD_O_3_Histology $20.;
	length ICD_O_3_Behavior $10.;
	input ICD_O_3_Site $ ICD_O_3_Histology $ ICD_O_3_Behavior $;
	cards;
	C000 9835 3
	C420 9811 3
	C001 9831 3
	C300 9383 0
	C723 9380 1
;
run;

The IN2 dataset is attached in Excel format, you may directly import it. I wanted to get the "Extended Classification" for the cases listed above.

In case you wonder, ICD_O_3_Site all starts with a "C" and it's safe to be removed when matching.

SASKiwi · Posted 02-20-2022 10:57 PM

Can you supply the IN2 as DATALINES? Opening unknown spreadsheets is a security risk.

TC_ · Posted 02-20-2022 11:06 PM

OK here is a small portion:

data IN2;
	length ICD_O_3_Site $20.;
	length ICD_O_3_Histology $100.;
	length ICD_O_3_Behavior $10.;
	input ICD_O_3_Site $ ICD_O_3_Histology $ ICD_O_3_Behavior $ Extended_Classification $;
	cards;
	C000-C809 9835-9836 3 001
	C420-C421,C424 9811-9818,9837 3 001
	C001 9831 3 003
	C000-C809 9383,9391-9394 0,1,3 017
	C723 9380,9384,9400-9411,9420-9424 0,1,3 019
;
run;

Tom · Posted 02-20-2022 11:45 PM

I highly recommend taking the time to build a categorized list of valid codes and use that.

The abbreviated strings you currently using as the IN2 dataset could be saved and used as a DESCRIPTIONS you could print for HUMANS to understand what the categories are. That is what the XSLX file you posted is intended for.

That said to split that data into something you could program from turn each range into LOW/HIGH pair.

Then you task of categorization because a simple join or you could create a FORMAT.

data IN2;
  length row 8 category $32 item 8 low high $5 ;
  row+1;
  do category='ICD_O_3_Site','ICD_O_3_Histology','ICD_O_3_Behavior' ;
    input list :$200. @;
    do item=1 to countw(list,',');
      range=scan(list,item,',');
      low=scan(range,1,'-');
      high=scan(range,-1,'-');
      output;
    end;
  end;
  drop list range;
cards;
C000-C809      9835-9836                     3     001
C420-C421,C424 9811-9818,9837                3     001
C001           9831                          3     003
C000-C809      9383,9391-9394                0,1,3 017
C723           9380,9384,9400-9411,9420-9424 0,1,3 019
;

Results

Obs    row        category         item    low     high

  1     1     ICD_O_3_Site           1     C000    C809
  2     1     ICD_O_3_Histology      1     9835    9836
  3     1     ICD_O_3_Behavior       1     3       3
  4     2     ICD_O_3_Site           1     C420    C421
  5     2     ICD_O_3_Site           2     C424    C424
  6     2     ICD_O_3_Histology      1     9811    9818
  7     2     ICD_O_3_Histology      2     9837    9837
  8     2     ICD_O_3_Behavior       1     3       3
  9     3     ICD_O_3_Site           1     C001    C001
 10     3     ICD_O_3_Histology      1     9831    9831
 11     3     ICD_O_3_Behavior       1     3       3
 12     4     ICD_O_3_Site           1     C000    C809
 13     4     ICD_O_3_Histology      1     9383    9383
 14     4     ICD_O_3_Histology      2     9391    9394
 15     4     ICD_O_3_Behavior       1     0       0
 16     4     ICD_O_3_Behavior       2     1       1
 17     4     ICD_O_3_Behavior       3     3       3
 18     5     ICD_O_3_Site           1     C723    C723
 19     5     ICD_O_3_Histology      1     9380    9380
 20     5     ICD_O_3_Histology      2     9384    9384
 21     5     ICD_O_3_Histology      3     9400    9411
 22     5     ICD_O_3_Histology      4     9420    9424
 23     5     ICD_O_3_Behavior       1     0       0
 24     5     ICD_O_3_Behavior       2     1       1
 25     5     ICD_O_3_Behavior       3     3       3

SAS Innovate 2025: Call for Content

Classroom Training Available!