Hello,
I have the following datasets "INPUT" and "IN2"
data INPUT;
input hist $;
cards;
100
105
110
110
;
run;
data IN2;
length hist $15.;
input hist $ label;
cards;
100-105,108 1
110 3
;
run;
My goal is to get “label” of IN2 by matching "hist". "hist" of IN2 is given in a compact format with ranges and individual values. 100-105 means 100,101 ... 105.
Inspired by another post, I was trying to use a DATA NULL step within a DATA step and convert "hist" into a list so I can check whether a value is in that list. Below is what I was trying to do after all things get resolved:
if hist in (100-105,108) then label = IN2.label
if hist in (110) then label = IN2.label
Here is my attempt:
data test;
set input;
RC=dosubl(cats('data _null_; '
,' set in2; '
,'call symput("hist_list",hist);'
,'if ', hist, ' in (&hist_list) then do;'
,'call symputx("lab",label);'
,'stop;'
,'call symputx("lab"," ");'
,'run;'));
label=symget('lab');
run;
Here's one piece (out of 4) of error message:
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, INPUT, PUT.
ERROR 22-322: Syntax error, expecting one of the following: +, =.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
ERROR: Undeclared array referenced: if110in.
WARNING: Apparent symbolic reference HIST_LIST not resolved.
ERROR: Variable if110in has not been declared as an array.
Can someone help me troubleshoot this?
Thanks!
Looks like you don't know the correct syntax of the IN operator or the IF statement:
if hist in (100-105,108) then label = IN2.label
if hist in (110) then label = IN2.label
This is correct syntax but is nonsense code because LABEL is being assigned to itself:
if hist in ('100','101','102','103','104','105','108') then label = label;
if hist in ('110') then label = label;
Note HIST is a character variable so when comparing it to a constant like 100, it needs to be enclosed in quotes.
I have no idea why you are trying to code this via DOSUBL. Code a solution that works without DOSUBL before burying it in that function.
I know my attempt might look convoluted and unnecessary. It's because the datasets I have are more complicated than what I have shown. The sample data are simplified to show just enough details so people understand what type of data I am working with. For IN2, I wrote "100-105,108" and I actually have something like "9500-9580,9599,9610-9650,9652,9655" and on. And I have over 100 rows and 3 columns of data that are of a similar format.
Thank you for the input, I get all your point but it's not the solution I am looking for. If you have a smarter way than my attempt, I'd love to hear it.
Spread your IN2 dataset out vertically, so you can use it as a CNTLIN dataset for PROC FORMAT. Then apply that format, either to the variable or in a PUT function.
data input;
input hist $;
cards;
100
105
110
110
;
data in2;
length hist $15.;
input hist label;
cards;
100-105,108 1
110 3
;
data cntlin;
set in2 end=done;
fmtname = "$check";
do i1 = 1 to countw(hist,",");
substring = scan(hist,i1,",");
if indexc(substring,"-")
then do i2 = input(scan(substring,1,"-"),3.) to input(scan(substring,2,"-"),3.);
start = put(i2,3.);
output;
end;
else do;
start = substring;
output;
end;
end;
if done
then do;
label = .;
start = "OTHER";
hlo = "O";
output;
end;
keep start label fmtname hlo;
run;
proc format cntlin=cntlin;
run;
data want;
set input;
label = put(hist,$check.);
run;
Similarly, you can use the vertical dataset for a JOIN/MERGE or as input to a hash object. As usual, see Maxim 19: Long Beats Wide.
Thank you for the input, Kurt. I did try to expand my IN2 dataset. But my actual IN2 dataset is actually a lot more complicated than the sample I have provided. After expanding all the ranges as you did, my IN2 dataset has over 400 million rows which is really hard to work with. And that's why I try to work with what's in IN2 as is.
If you have much wider ranges in your in2, you can use the end values as start and end in the CNTLIN dataset to define ranges. Be careful with ranges with character values, as "1500" will fall in the range "150" to "200". This can't happen with numeric values.
@Kurt_Bremser is definitely pointing you in the right direction regarding the use of ranges in PROC FORMAT's CNTLIN option for creating user-defined formats. Please check out the documentation that I've linked to so you can understand how it works better. You will need to split ranges and single values into separate rows for example.
Thank you both for the responses. Creating a format does work for the purpose of labeling the data when a single column of data is used to determine the label. And unfortunately the data I work with needs 3 columns to determine the corresponding label (Sorry I should have made this clear enough in my question). For example
if Col1=10 & Col2=11 & Col3=10 then label =1;
if Col1=10 & Col2=12 & Col3=10 then label =2;
So applying a format as you two suggested does not work that well for this situation. But I agree that, when dealing with a single column, a format is a more straightforward and efficient approach.
You can still use a PROC FORMAT lookup with multiple columns by concatenating:
label = put(strip(col1) !! strip(col2) !! strip(col3), $Lookup.);
Do col1, col2 and col3 all have ranges or just col1? How many unique values are there for col1, col2, col3 combined?
I would help if you posted more representative data for your three columns so we can get a clearer picture.
My bad.
This is some actual data I have:
data input;
length ICD_O_3_Site $20.;
length ICD_O_3_Histology $20.;
length ICD_O_3_Behavior $10.;
input ICD_O_3_Site $ ICD_O_3_Histology $ ICD_O_3_Behavior $;
cards;
C000 9835 3
C420 9811 3
C001 9831 3
C300 9383 0
C723 9380 1
;
run;
The IN2 dataset is attached in Excel format, you may directly import it. I wanted to get the "Extended Classification" for the cases listed above.
In case you wonder, ICD_O_3_Site
all starts with a "C" and it's safe to be removed when matching.
Can you supply the IN2 as DATALINES? Opening unknown spreadsheets is a security risk.
OK here is a small portion:
data IN2;
length ICD_O_3_Site $20.;
length ICD_O_3_Histology $100.;
length ICD_O_3_Behavior $10.;
input ICD_O_3_Site $ ICD_O_3_Histology $ ICD_O_3_Behavior $ Extended_Classification $;
cards;
C000-C809 9835-9836 3 001
C420-C421,C424 9811-9818,9837 3 001
C001 9831 3 003
C000-C809 9383,9391-9394 0,1,3 017
C723 9380,9384,9400-9411,9420-9424 0,1,3 019
;
run;
I highly recommend taking the time to build a categorized list of valid codes and use that.
The abbreviated strings you currently using as the IN2 dataset could be saved and used as a DESCRIPTIONS you could print for HUMANS to understand what the categories are. That is what the XSLX file you posted is intended for.
That said to split that data into something you could program from turn each range into LOW/HIGH pair.
Then you task of categorization because a simple join or you could create a FORMAT.
data IN2;
length row 8 category $32 item 8 low high $5 ;
row+1;
do category='ICD_O_3_Site','ICD_O_3_Histology','ICD_O_3_Behavior' ;
input list :$200. @;
do item=1 to countw(list,',');
range=scan(list,item,',');
low=scan(range,1,'-');
high=scan(range,-1,'-');
output;
end;
end;
drop list range;
cards;
C000-C809 9835-9836 3 001
C420-C421,C424 9811-9818,9837 3 001
C001 9831 3 003
C000-C809 9383,9391-9394 0,1,3 017
C723 9380,9384,9400-9411,9420-9424 0,1,3 019
;
Results
Obs row category item low high 1 1 ICD_O_3_Site 1 C000 C809 2 1 ICD_O_3_Histology 1 9835 9836 3 1 ICD_O_3_Behavior 1 3 3 4 2 ICD_O_3_Site 1 C420 C421 5 2 ICD_O_3_Site 2 C424 C424 6 2 ICD_O_3_Histology 1 9811 9818 7 2 ICD_O_3_Histology 2 9837 9837 8 2 ICD_O_3_Behavior 1 3 3 9 3 ICD_O_3_Site 1 C001 C001 10 3 ICD_O_3_Histology 1 9831 9831 11 3 ICD_O_3_Behavior 1 3 3 12 4 ICD_O_3_Site 1 C000 C809 13 4 ICD_O_3_Histology 1 9383 9383 14 4 ICD_O_3_Histology 2 9391 9394 15 4 ICD_O_3_Behavior 1 0 0 16 4 ICD_O_3_Behavior 2 1 1 17 4 ICD_O_3_Behavior 3 3 3 18 5 ICD_O_3_Site 1 C723 C723 19 5 ICD_O_3_Histology 1 9380 9380 20 5 ICD_O_3_Histology 2 9384 9384 21 5 ICD_O_3_Histology 3 9400 9411 22 5 ICD_O_3_Histology 4 9420 9424 23 5 ICD_O_3_Behavior 1 0 0 24 5 ICD_O_3_Behavior 2 1 1 25 5 ICD_O_3_Behavior 3 3 3
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.