Hi, I'm having trouble making this work. The data has claim id and hcc. Each claim id can have multiple hcc's. I want to keep records where the claim id only has hcc 11. Here's an example dataset and the non-functioning code:
DATA in_data ;
input edg_Claim_ID $13. hcc 2. ;
datalines ;
1234567890123 11
1234567890123 12
1234567890123 13
3210987654321 11
3210987654321 12
5678901234567 11
;
RUN ;
%macro clm_rev(clm) ;
proc sql noprint ;
select distinct hcc
into :hccs separated by ','
from in_data
where edg_Claim_ID = '&clm.' ;
quit ;
%if &hccs. = 11 %then output ;
%mend clm_rev ;
DATA out_data ;
SET in_data ;
call execute('%nrstr(%clm_rev('||edg_Claim_ID||'))');
RUN ;
The only claim id in out_data would be 5678901234567 since the only hcc affiliated with that claim id is 11.
Here is the log:
1
2 DATA in_data ;
3 input edg_Claim_ID $13. hcc 2. ;
4 datalines ;
NOTE: Compression was disabled for data set WORK.IN_DATA because compression overhead would increase the size of the data set.
NOTE: The data set WORK.IN_DATA has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.00 seconds
11 ;
12 RUN ;
13
14 %macro clm_rev(clm) ;
15
16 proc sql noprint ;
17 select distinct hcc
18 into :hccs separated by ','
19 from in_data
20 where edg_Claim_ID = '&clm.' ;
21 quit ;
22
23 %if &hccs. = 11 %then output ;
24
25 %mend clm_rev ;
26
27 DATA out_data ;
28 SET in_data ;
29 call execute('%nrstr(%clm_rev('||edg_Claim_ID||'))');
30 RUN ;
NOTE: Compression was disabled for data set WORK.OUT_DATA because compression overhead would increase the size of the data set.
NOTE: There were 6 observations read from the data set WORK.IN_DATA.
NOTE: The data set WORK.OUT_DATA has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.01 seconds
NOTE: CALL EXECUTE generated line.
1 + %clm_rev(1234567890123)
NOTE: The macro CLM_REV is executing from memory.
16 instructions 432 bytes.
MPRINT(CLM_REV): proc sql noprint ;
MPRINT(CLM_REV): select distinct hcc into :hccs separated by ',' from in_data where edg_Claim_ID = '&clm.' ;
NOTE: No rows were selected.
MPRINT(CLM_REV): quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.02 seconds
WARNING: Apparent symbolic reference HCCS not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &hccs. = 11
ERROR: The macro CLM_REV will stop executing.
2 + %clm_rev(1234567890123)
NOTE: The macro CLM_REV is executing from memory.
16 instructions 432 bytes.
MPRINT(CLM_REV): proc sql noprint ;
MPRINT(CLM_REV): select distinct hcc into :hccs separated by ',' from in_data where edg_Claim_ID = '&clm.' ;
NOTE: No rows were selected.
MPRINT(CLM_REV): quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds
WARNING: Apparent symbolic reference HCCS not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &hccs. = 11
ERROR: The macro CLM_REV will stop executing.
3 + %clm_rev(1234567890123)
NOTE: The macro CLM_REV is executing from memory.
16 instructions 432 bytes.
MPRINT(CLM_REV): proc sql noprint ;
MPRINT(CLM_REV): select distinct hcc into :hccs separated by ',' from in_data where edg_Claim_ID = '&clm.' ;
NOTE: No rows were selected.
MPRINT(CLM_REV): quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.02 seconds
WARNING: Apparent symbolic reference HCCS not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &hccs. = 11
ERROR: The macro CLM_REV will stop executing.
4 + %clm_rev(3210987654321)
NOTE: The macro CLM_REV is executing from memory.
16 instructions 432 bytes.
MPRINT(CLM_REV): proc sql noprint ;
MPRINT(CLM_REV): select distinct hcc into :hccs separated by ',' from in_data where edg_Claim_ID = '&clm.' ;
NOTE: No rows were selected.
MPRINT(CLM_REV): quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
WARNING: Apparent symbolic reference HCCS not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &hccs. = 11
ERROR: The macro CLM_REV will stop executing.
5 + %clm_rev(3210987654321)
NOTE: The macro CLM_REV is executing from memory.
16 instructions 432 bytes.
MPRINT(CLM_REV): proc sql noprint ;
MPRINT(CLM_REV): select distinct hcc into :hccs separated by ',' from in_data where edg_Claim_ID = '&clm.' ;
NOTE: No rows were selected.
MPRINT(CLM_REV): quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds
WARNING: Apparent symbolic reference HCCS not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &hccs. = 11
ERROR: The macro CLM_REV will stop executing.
6 + %clm_rev(5678901234567)
NOTE: The macro CLM_REV is executing from memory.
16 instructions 432 bytes.
MPRINT(CLM_REV): proc sql noprint ;
MPRINT(CLM_REV): select distinct hcc into :hccs separated by ',' from in_data where edg_Claim_ID = '&clm.' ;
NOTE: No rows were selected.
MPRINT(CLM_REV): quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.02 seconds
WARNING: Apparent symbolic reference HCCS not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &hccs. = 11
ERROR: The macro CLM_REV will stop executing.
Hi
I think you have choosen an unnecessarily complicated approach with maxro + call execute. It is difficult to get to work - as demonstrated - and also very inefficient because the the whole input data set is read in the macro for every input observation.
So the suggestion by @xxformat_com is a better solution except for the little problem that it selects all observations with hcc = 11, not those with hcc = 11 only.Try this:
proc sql;
create table out_data as
select edg_Claim_ID
from in_data
where edg_Claim_ID not in (
select edg_Claim_ID
from in_data
where hcc ne 11
);
quit;
Macro variables don't resolve inside single quotes:
where edg_Claim_ID = '&clm.' ; * Current;
where edg_Claim_ID = "&clm." ; * Fixed;
There are definitely a few things to clean up about this program.
First, the program you have supplied doesn't read the data correctly. It takes HCC from columns 14 and 15, and you need it to come from columns 15 and 16. Perhaps it is not the actual program you ran, just a way to illustrate the problem you are encountering.
Second, this line will not work:
where edg_Claim_ID = '&clm.' ;
Macro variable references within single quotes do not get resolved. You would need to replace the single quotes with double quotes.
Third, CALL EXECUTE runs macro language statements right away. This statement runs while the DATA step is executing:
%if &hccs. = 11 %then output ;
At this point, PROC SQL has not run yet. So &HCCS has not been created.
Finally, macro language is not capable of executing an OUTPUT statement. It is capable of generating the word OUTPUT, but that would be a line by itself following PROC SQL. So to make the results useful, you might want to clarify what you would like your code to accomplish.
So you do have multiple issues to attend to. Get at least a few of them fixed, and we can revisit what remains.
proc sql;
create table new as
select distinct edg_Claim_ID
from in_data
where edg_Claim_ID in (select edg_Claim_ID from in_data where hcc=11);
quit;
Hi
I think you have choosen an unnecessarily complicated approach with maxro + call execute. It is difficult to get to work - as demonstrated - and also very inefficient because the the whole input data set is read in the macro for every input observation.
So the suggestion by @xxformat_com is a better solution except for the little problem that it selects all observations with hcc = 11, not those with hcc = 11 only.Try this:
proc sql;
create table out_data as
select edg_Claim_ID
from in_data
where edg_Claim_ID not in (
select edg_Claim_ID
from in_data
where hcc ne 11
);
quit;
Seeing your answer Erik, now I realise that I misunderstood the question.
I was wondering why I had to add an extra entry in the test data to test my code. Now it's clear 😉 Cheers
Thanks for everyone's input.
Given that your dataset is already sorted by edg_claim_id, a single data step does it:
data want;
merge
in_data
in_data (
in=exclude
keep=edg_claim_id hcc
rename=(hcc=_hcc)
where=(_hcc ne 11)
)
;
by edg_claim_id;
if not exclude;
drop _hcc;
run;
Untested, posted from my tablet.
You will find that, with larger data, the data step outperforms the SQL sub-select.
Edit: tested now, removed quotes around 11.
You are as always right (an annoying habit of yours - should be included as Maxim 53 😊), the data step outperforms sql. I made a test to see how much faster it is, and it gave the following results, which shows that the data step is about 3 times as fast:
As a principle one should always seek the most effecient solution, and in an ideal world your elegant solution is the right choice, and besides it is so much more fun to code data steps.
But there is a trade-off between effeciency and maintainability, and the world isn't ideal, not mine at least. It is infested with point-and-click tools like SAS DI Studio, where data steps aren't supported, and while most people (at least those indulging into coding) understands SQL, very few really knows the ins and outs of the data step. We have a SAS Data Warehouse with about 20 persons using DI Studio to build production jobs, and I think that very few - if any - would be able to read your code and understand what it was doing, not to mention getting the idea in the first place and figuring out how to do it in the point-and-click tool.
So if the data set was really large and the production job was a performance issue, I would try to optimize it (and hang on it afterwards), but otherwise I would let it be as coded (which would be SQL) for ease of maintenance, and I would also recommend using SQL if any of our developers cared to ask me, because I would be able to explain the without a 3-day course.
respectfully
Erik
You raise some interesting points here.
One is that point-and-click tools can only go a certain distance, and there are always things that require you to think and design a solution that is "out of the box".
Being a "classic" programmer who started with a procedural, imperative language (PASCAL) and concepts like structural programming, this "designing of a solution where there was none" always lies at the heart of my thinking. With SAS, the tool for this is the DATA step, and anybody who wants to step up from being a "SAS illiterate" who communicates with the system solely through ideograms (think: paintings on cave walls) has to sooner or later learn it. And as you say, they will discover the pure fun of wrapping your head around the concepts built into the data step language.
On top of that, certain issues involving order of data and sequential processing (think of FIRST./LAST., LAG, RETAIN) are very hard to do in SQL. So I find that data steps like the one I posted here are (in the end) much easier to understand and maintain than the SQL solution.
Tools like DI Studio will do well to enable a SAS developer to seemlessly integrate the data step into the concept of metadata-driven development (user-written code nodes with well-documented data trails that do not prevent solid impact analysis).
You say you have 20 people working on your data warehouse. How many production jobs do you maintain?
Answer to your question is 4.795. And now something about my non-ideal world to elaborate on my argumentation in the previous post.
I work in a large municipality administration (about 20.000 employees) with a Mayor’s office and 5 main departments. There is a central IT-department where administration of the data warehouse is placed, but all development is done independently in the 5 main departments, and the 20 developers are not IT-professionals, but educated in economics or public administration of some kind, and they are not all assigned to ETL work as a full-time task. They have analytical skills, but a poor understanding of data modelling and algorithms, which is also considered unnecessary, because ETL work in DI Studio is seen as something like assembling cupboards from IKEA.
No need to say that the result is not a proper data warehouse built on Kimball’s rules. The developers think in spreadsheets, and they build data marts as denormalized tables, which results in a huge number of tables developed for specific uses. And with all those tables comes a lot of jobs to produce them.
The number of production jobs today is 4.795 producing 12.019 SAS tables, and more jobs are coming with a daily average of at least 10 new or changed jobs to put into production. The number of new jobs in development today is 297 not counting newer versions of existing jobs, so there is a lot in the pipeline.
The production jobs are all developed in DI Studio and have a total of 23.070 transformations, of which 1242 contains user written code. We use LSF, and the production jobs are contained in 661 flows, each writing to one libname (a rule). The number of external data sources (not tables, but odbc connections) are 160, to which comes FTP connections, HTTP service calls, local files etc.
We are 5 persons trying to keep a certain order in chaos and get it all running in daily batch besides installing client programs, helping with user written code and -transformations etc, And I am the resident dinosaur with 40 years of SAS experience beginning with SAS V79. It works amazingly well, all things considered, but that is mainly because we have automated a lot of manual processes over the years, among others the two most useful are:
1) A lot rules for building jobs and an automated process to analyze spk-packages submitted to promotion, so the requests are returned to sender with error messages and not forwarded to us before everything is in order.
2) An automated scheduler that decides when to run a flow or omit it based on the dependence between flows extracted from metadata + the result of previous flows in the current batch, so we never have to bother with scheduling rules or dependencies in LSF, all flows are defined as “run manually”.
Ouch. That explains a lot.
Thank $DEITY that our data warehouse is run from the IT department, and what you call "developers" are "end users" here who do their work through Enterprise Guide.
When one of their regularly run programs turns static (not much changes anymore) and business-critical, we take it over, optimize it, and have it run by the data center's scheduler.
And if your dataset is not sorted, use a hash object:
data want2;
set in_data;
if _n_ = 1
then do;
declare hash ex (dataset:"in_data (where=(hcc ne 11))");
ex.definekey("edg_claim_id");
ex.definedone();
end;
if ex.check() ne 0;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.