BookmarkSubscribeRSS Feed
TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Hello,

I have the following datasets "INPUT" and "IN2" 

data INPUT;
	input hist $;
	cards;
	100
	105
	110
	110
;
run;

data IN2;
length hist $15.; input hist $ label; cards; 100-105,108 1 110 3 ; run;

My goal is to get “label” of IN2 by matching "hist". "hist" of IN2 is given in a compact format with ranges and individual values. 100-105 means 100,101 ... 105. 

Inspired by another post, I was trying to use a DATA NULL step within a DATA step and convert "hist" into a list so I can check whether a value is in that list. Below is what I was trying to do after all things get resolved:

if hist in (100-105,108) then label = IN2.label
if hist in (110) then label = IN2.label

Here is my attempt:

data test;
	set input;
	RC=dosubl(cats('data _null_;                                     '
                ,'  set in2; '
				,'call symput("hist_list",hist);'
				,'if ', hist,  ' in (&hist_list) then do;'
				,'call symputx("lab",label);' 
				,'stop;'
				,'call symputx("lab"," ");'
				,'run;'));
	label=symget('lab');
run;

 

Here's one piece (out of 4) of error message:

ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, INPUT, PUT.
ERROR 22-322: Syntax error, expecting one of the following: +, =.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
ERROR: Undeclared array referenced: if110in.
WARNING: Apparent symbolic reference HIST_LIST not resolved.
ERROR: Variable if110in has not been declared as an array.

 

Can someone help me troubleshoot this?

Thanks!

 

16 REPLIES 16
SASKiwi
PROC Star

Looks like you don't know the correct syntax of the IN operator or the IF statement:

if hist in (100-105,108) then label = IN2.label
if hist in (110) then label = IN2.label

This is correct syntax but is nonsense code because LABEL is being assigned to itself:

if hist in ('100','101','102','103','104','105','108') then label = label;
if hist in ('110') then label = label;

Note HIST is a character variable so when comparing it to a constant like 100, it needs to be enclosed in quotes.

 

I have no idea why you are trying to code this via DOSUBL. Code a solution that works without DOSUBL before burying it in that function.

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

I know my attempt might look convoluted and unnecessary. It's because the datasets I have are more complicated than what I have shown. The sample data are simplified to show just enough details so people understand what type of data I am working with. For IN2, I wrote "100-105,108" and I actually have something like "9500-9580,9599,9610-9650,9652,9655" and on. And I have over 100 rows and 3 columns of data that are of a similar format.

Thank you for the input, I get all your point but it's not the solution I am looking for. If you have a smarter way than my attempt, I'd love to hear it.

Kurt_Bremser
Super User

Spread your IN2 dataset out vertically, so you can use it as a CNTLIN dataset for PROC FORMAT. Then apply that format, either to the variable or in a PUT function.

data input;
input hist $;
cards;
100
105
110
110
;

data  in2;
length hist $15.;
input hist label;
cards;
100-105,108 1
110 3
;

data cntlin;
set in2 end=done;
fmtname = "$check";
do i1 = 1 to countw(hist,",");
  substring = scan(hist,i1,",");
  if indexc(substring,"-")
  then do i2 = input(scan(substring,1,"-"),3.) to input(scan(substring,2,"-"),3.);
    start = put(i2,3.);
    output;
  end;
  else do;
    start = substring;
    output;
  end;
end;
if done
then do;
  label = .;
  start = "OTHER";
  hlo = "O";
  output;
end;
keep start label fmtname hlo;
run;

proc format cntlin=cntlin;
run;

data want;
set input;
label = put(hist,$check.);
run;

Similarly, you can use the vertical dataset for a JOIN/MERGE or as input to a hash object. As usual, see Maxim 19: Long Beats Wide.

 

 

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Thank you for the input, Kurt. I did try to expand my IN2 dataset. But my actual IN2 dataset is actually a lot more complicated than the sample I have provided. After expanding all the ranges as you did, my IN2 dataset has over 400 million rows which is really hard to work with. And that's why I try to work with what's in IN2 as is.

Kurt_Bremser
Super User

If you have much wider ranges in your in2, you can use the end values as start and end in the CNTLIN dataset to define ranges. Be careful with ranges with character values, as "1500" will fall in the range "150" to "200". This can't happen with numeric values.

SASKiwi
PROC Star

@Kurt_Bremser is definitely pointing you in the right direction regarding the use of ranges in PROC FORMAT's CNTLIN option for creating user-defined formats. Please check out the documentation that I've linked to so you can understand how it works better. You will need to split ranges and single values into separate rows for example. 

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Thank you both for the responses. Creating a format does work for the purpose of labeling the data when a single column of data is used to determine the label. And unfortunately the data I work with needs 3 columns to determine the corresponding label (Sorry I should have made this clear enough in my question). For example 

if Col1=10 & Col2=11 & Col3=10 then label =1;
if Col1=10 & Col2=12 & Col3=10 then label =2;

So applying a format as you two suggested does not work that well for this situation. But I agree that, when dealing with a single column, a format is a more straightforward and efficient approach.

 

SASKiwi
PROC Star

You can still use a PROC FORMAT lookup with multiple columns by concatenating:

label = put(strip(col1) !! strip(col2) !! strip(col3), $Lookup.);

 Do col1, col2 and col3 all have ranges or just col1? How many unique values are there for col1, col2, col3 combined?

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7
Hmm wow this opens up new possibilities... They all have ranges and individual values, separated by commas. Such as col1 may have 9800,9810-9820,9830-9835,9840,9850-9900 and col2 may have 0,2-3,5-10 etc.
SASKiwi
PROC Star

I would help if you posted more representative data for your three columns so we can get a clearer picture.

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

My bad. 

This is some actual data I have:

 

data input;
	length ICD_O_3_Site $20.;
	length ICD_O_3_Histology $20.;
	length ICD_O_3_Behavior $10.;
	input ICD_O_3_Site $ ICD_O_3_Histology $ ICD_O_3_Behavior $;
	cards;
	C000 9835 3
	C420 9811 3
	C001 9831 3
	C300 9383 0
	C723 9380 1
;
run;

The IN2 dataset is attached in Excel format, you may directly import it. I wanted to get the "Extended Classification" for the cases listed above.

 

In case you wonder,  ICD_O_3_Site all starts with a "C" and it's safe to be removed when matching.

SASKiwi
PROC Star

Can you supply the IN2 as DATALINES? Opening unknown spreadsheets is a security risk.

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

OK here is a small portion:

data IN2;
	length ICD_O_3_Site $20.;
	length ICD_O_3_Histology $100.;
	length ICD_O_3_Behavior $10.;
	input ICD_O_3_Site $ ICD_O_3_Histology $ ICD_O_3_Behavior $ Extended_Classification $;
	cards;
	C000-C809 9835-9836 3 001
	C420-C421,C424 9811-9818,9837 3 001
	C001 9831 3 003
	C000-C809 9383,9391-9394 0,1,3 017
	C723 9380,9384,9400-9411,9420-9424 0,1,3 019
;
run;
Tom
Super User Tom
Super User

I highly recommend taking the time to build a categorized list of valid codes and use that.

The abbreviated strings you currently using as the IN2 dataset could be saved and used as a DESCRIPTIONS you could print for HUMANS to understand what the categories are.  That is what the XSLX file you posted is intended for.

 

That said to split that data into something you could program from turn each range into LOW/HIGH pair.

Then you task of categorization because a simple join or you could create a FORMAT.

 

data IN2;
  length row 8 category $32 item 8 low high $5 ;
  row+1;
  do category='ICD_O_3_Site','ICD_O_3_Histology','ICD_O_3_Behavior' ;
    input list :$200. @;
    do item=1 to countw(list,',');
      range=scan(list,item,',');
      low=scan(range,1,'-');
      high=scan(range,-1,'-');
      output;
    end;
  end;
  drop list range;
cards;
C000-C809      9835-9836                     3     001
C420-C421,C424 9811-9818,9837                3     001
C001           9831                          3     003
C000-C809      9383,9391-9394                0,1,3 017
C723           9380,9384,9400-9411,9420-9424 0,1,3 019
;

Results

Obs    row        category         item    low     high

  1     1     ICD_O_3_Site           1     C000    C809
  2     1     ICD_O_3_Histology      1     9835    9836
  3     1     ICD_O_3_Behavior       1     3       3
  4     2     ICD_O_3_Site           1     C420    C421
  5     2     ICD_O_3_Site           2     C424    C424
  6     2     ICD_O_3_Histology      1     9811    9818
  7     2     ICD_O_3_Histology      2     9837    9837
  8     2     ICD_O_3_Behavior       1     3       3
  9     3     ICD_O_3_Site           1     C001    C001
 10     3     ICD_O_3_Histology      1     9831    9831
 11     3     ICD_O_3_Behavior       1     3       3
 12     4     ICD_O_3_Site           1     C000    C809
 13     4     ICD_O_3_Histology      1     9383    9383
 14     4     ICD_O_3_Histology      2     9391    9394
 15     4     ICD_O_3_Behavior       1     0       0
 16     4     ICD_O_3_Behavior       2     1       1
 17     4     ICD_O_3_Behavior       3     3       3
 18     5     ICD_O_3_Site           1     C723    C723
 19     5     ICD_O_3_Histology      1     9380    9380
 20     5     ICD_O_3_Histology      2     9384    9384
 21     5     ICD_O_3_Histology      3     9400    9411
 22     5     ICD_O_3_Histology      4     9420    9424
 23     5     ICD_O_3_Behavior       1     0       0
 24     5     ICD_O_3_Behavior       2     1       1
 25     5     ICD_O_3_Behavior       3     3       3

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 2224 views
  • 5 likes
  • 4 in conversation