Hello experts,
Suppose I have a table with the names of a few high schools and the average SAT scores of the students from each of these schools:
data test;
input school $ avg_sat;
datalines;
a1 1513
a2 1320
a3 1476
a4 1280
a5 1520
b1 1378
b2 1410
;
run;
Now, suppose I want to subset to only include only a1 - a5, I would write something like:
data want;
set test;
where school in ('a1', 'a2', 'a3', 'a4', 'a5');
run;
This is totally OK in this scenario since I only have 5 elements within the in operator (a1, a2, a3, a4, a5). Nonetheless, this can quickly become impractical if I am doing this for a1 - a100 instead, for example, for which I will have to type 100 elements within the in operator.
I know that if the variable school is a numeric variable instead, I can use something like:
where school in (1:5);
but since it is a character variable in this case, I cannot use the ':' sign to indicate a range. I know I can still use a substr() function to separate the numbers from the school variable and go from there, but i am just wondering if there's a shortcut that allows something like:
where school in ('a1' : 'a5');
Thanks,
If you have a pattern with regard to the values you want to include in your search - for example, if they all start with a specific prefix or form a prefixed numbered list - you already have your answers.
However, I can't help but point out that even if you have a pattern, let alone if the value list is arbitrary, using the IN operator in this manner still amounts to hard coding. Such practice is okay for a one-time ad hoc program, but it's surely not a good thing if you (and especially someone else) should revisit the program in the future, if it's part of production, and/or is under change control.
A much saner and more scalable way of program organization is to store the search-for values in an editable control file outside the change control encumbrance. It's specific format is of secondary importance, as long as your program can read it properly to input the values into the SAS system (though since it's SAS, storing a search list in a SAS data set rather than, say, in a flat file seems more than reasonable). This way, if you need to augment or update the list of search values, you only have to edit the control file and never touch the program itself.
Once you've got the values on the file, there're a myriad of ways to let SAS tell what they are in order to search for them in the WHERE clause, such as:
For example, in your case, it could look like:
data values ;
input value $3. ;
cards ;
a1
a2
a3
a4
a5
run ;
proc sql noprint ;
select quote (strip (value)) into :inlist separated by " " from values ;
quit ;
data want ;
set test ;
where school in (&inlist) ;
run ;
You'd store the file VALUES on the side (outside change control) and simply edit it as you wish if you needed to add, delete, or change any search-for values in the future. And if it's a list shared across a team, department, etc., other programmers would be able to simply read it instead of copying a hard coded list into their programs from yours.
Kind regards
Paul D.
Interesting question. How about a work around with a colon modifier ( : )
data test;
input school $ avg_sat;
datalines;
a1 1513
a2 1320
a3 1476
a4 1280
a5 1520
b1 1378
b2 1410
;
run;
data want;
set test;
where school in :('a');
run;
or Something fancy
%let n=5;
data want;
set test;
array a(&n)$ ;
retain a;
if _n_=1 then do _n_=1 to dim(a);
a(_n_)=vname(a(_n_));
end;
if school in a;
drop a:;
run;
Can you change N to 80 ?
%let n=80;
and execute the previous array method?
How about
data want;
set test;
where input(compress(school,,'kd'),3.) le 80;
run;
If you know all of the names involved then create a numeric code and use that if you really need the list type of description.
If your "list" is supposed to represent a group, such as members of a school district, then you may be better off with an additional variable.
You may also be implying that you want only the A, only the B etc.
That could be
where school=: 'a';
I have tended to create separate grouping variables and then done things with By group processing on the group and/or the group and school , region, county, school district, school type actually. With a whole lot more things to summarize than a single value like SAT.
If you have a pattern with regard to the values you want to include in your search - for example, if they all start with a specific prefix or form a prefixed numbered list - you already have your answers.
However, I can't help but point out that even if you have a pattern, let alone if the value list is arbitrary, using the IN operator in this manner still amounts to hard coding. Such practice is okay for a one-time ad hoc program, but it's surely not a good thing if you (and especially someone else) should revisit the program in the future, if it's part of production, and/or is under change control.
A much saner and more scalable way of program organization is to store the search-for values in an editable control file outside the change control encumbrance. It's specific format is of secondary importance, as long as your program can read it properly to input the values into the SAS system (though since it's SAS, storing a search list in a SAS data set rather than, say, in a flat file seems more than reasonable). This way, if you need to augment or update the list of search values, you only have to edit the control file and never touch the program itself.
Once you've got the values on the file, there're a myriad of ways to let SAS tell what they are in order to search for them in the WHERE clause, such as:
For example, in your case, it could look like:
data values ;
input value $3. ;
cards ;
a1
a2
a3
a4
a5
run ;
proc sql noprint ;
select quote (strip (value)) into :inlist separated by " " from values ;
quit ;
data want ;
set test ;
where school in (&inlist) ;
run ;
You'd store the file VALUES on the side (outside change control) and simply edit it as you wish if you needed to add, delete, or change any search-for values in the future. And if it's a list shared across a team, department, etc., other programmers would be able to simply read it instead of copying a hard coded list into their programs from yours.
Kind regards
Paul D.
Guru @hashman Not that this is important. I just recalled what you said in Proc SQL: No LOG Note for Missing values? thread "I'd rather say it's a system of tools, each with its own language sharing certain main syntactic features with the others. "
Pretty impeccable statement.
While it irks me as it confuses me trying to familiarizing certain functionalities, it's great to note the above being so real as we can see that in proc iml, the char continuous sequence works 🙂
proc iml;
want='a1':'a10';
print want;
quit;
want a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
A good plan for next time: use different data values: a001, a002, a003, etc. Then subsetting becomes much easier:
where ('a001' <= school <= 'a080');
@Astounding wrote:
A good plan for next time: use different data values: a001, a002, a003, etc. Then subsetting becomes much easier:
where ('a001' <= school <= 'a080');
If the OP school data is anything like the school data I have worked with the "next time" first step is identifying which school identifiers changed and provide a mapping from last year to this year, or similar. But perhaps he doesn't have an Education department that intermixes any or 3 different "identification" schemes.
@aaronh wrote:
Thanks a lot!
where('a1' <= school <= 'a5') also worked just fine.
Danger Will Robinson: If you have school named 'a10' or 'a20' or 'a3456' these will be included in the range.
data example; input school $; if ('a1' le school le 'a5') then put School= 'is in the range specified'; datalines; a1 a2 a3 a4 a5 a6 a10 a300 ;
Read the log.
Character values are compared character by character with the range values. As soon as one character in the same position is out of range it is not inside the range. So since some of these examples are longer than the range values the comparison stops at two characters.
Perhaps your actual data does not go to 'a10'. But be very wary of any character comparison involving > or < whether in combination of an equal or not.
I want to thank everyone for your input.
Apparently my question was very open-ended, as many of you have provided some solution that'd solve my question. I personally like the 'control file' idea the best, as it seems to be more flexible and generalizable, in addition to being helpful in project management.
Thanks again!
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: