Hi,
I need help with subsetting a data using ARRAY and WHERE function
DATA Inpatient_2013_new; SET Inpatient_2013; ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12; DO i = 1 to 12; WHERE ICD {12} IN ("45341", "45340", "45342", "45350", "45351", "45352", "45181", "45111", "45119", "41511", "41512", "41513", "41519"); RUN;
In the above code, I'm trying to use array and where function to subset a huge data set. I want to look at 25 variables out of 250 for some values of interest and subset those matching value rows with all 250 variables in a new data set. When I run this code I'm getting the following error
ERROR: Syntax error while parsing WHERE clause. ERROR 22-322: Syntax error, expecting one of the following: ;, !, !!, &, (, *, **, +, -, /, <, <=, <>, =, >, >=, ?, AND, BETWEEN, CONTAINS, EQ, GE, GT, IN, IS, LE, LIKE, LT, NE, NOT, NOTIN, OR, ^, ^=, |, ||, ~, ~=. ERROR 200-322: The symbol is not recognized and will be ignored. 429 RUN; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.INPATIENT_2013_NEW may be incomplete. When this step was stopped there were 0 observations and 257 variables. WARNING: Data set WORK.INPATIENT_2013_NEW was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.06 seconds
Can some one help me with this problem?
Thanks!
1. Do loop doesn't have matching END
2. Incorrect usage of WHERE - you'll need to create a variable and then drop the records you're not interested in.
3. When looping through array variables, you usually want to use the counter as your array index - not 12 in this case.
See the code below.
You may be able to simplify this using WHICHC.
DATA Inpatient_2013_new;
SET Inpatient_2013;
ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12;
keep_record = 0;
DO i = 1 to 12;
if ICD (i) IN ("45341", "45340", "45342", "45350", "45351", "45352", "45181", "45111", "45119", "41511", "41512", "41513", "41519") then keep_record=1;
END;
if keep_record=1;
RUN;
Hello Reeza,
I tried but I'm getting errors
DATA Inpatient_2013_new; SET Inpatient_2013; ARRAY ICD {12} ICD_DGNS_CD1 $ ICD_DGNS_CD2 $ ICD_DGNS_CD3 $ ICD_DGNS_CD4 $ ICD_DGNS_CD5 $ ICD_DGNS_CD6 $ ICD_DGNS_CD7 $ ICD_DGNS_CD8 $ ICD_DGNS_CD9 $ ICD_DGNS_CD10 $ ICD_DGNS_CD11 $ ICD_DGNS_CD12 $; Keep_record = 0; DO i = 1 to 12; IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", "45111", "45119", "41511", "41512", "41513", "41519"); END; IF keep_record = 1; RUN;
I'm getting following error!!
500 DATA Inpatient_2013_new; 501 SET Inpatient_2013; 502 ARRAY ICD {12} ICD_DGNS_CD1 $ ICD_DGNS_CD2 $ ICD_DGNS_CD3 $ ICD_DGNS_CD4 $ ICD_DGNS_CD5 $ - - - - - - - - - - - - 22 22 22 22 22 22 22 22 22 22 22 22 200 200 200 200 200 200 200 200 200 200 200 200 502! ICD_DGNS_CD6 $ ICD_DGNS_CD7 $ ICD_DGNS_CD8 $ ICD_DGNS_CD9 $ ICD_DGNS_CD10 $ ICD_DGNS_CD11 $ 502! ICD_DGNS_CD12 $; ERROR 22-322: Syntax error, expecting one of the following: a name, (, -, :, ;, _ALL_, _CHARACTER_, _CHAR_, _NUMERIC_. ERROR 200-322: The symbol is not recognized and will be ignored. 503 Keep_record = 0; 504 DO i = 1 to 12; 505 IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", 505! "45111", "45119", "41511", "41512", "41513", "41519"); 506 END; 507 IF keep_record = 1; 508 RUN; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.INPATIENT_2013_NEW may be incomplete. When this step was stopped there were 0 observations and 258 variables. WARNING: Data set WORK.INPATIENT_2013_NEW was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.08 seconds cpu time 0.09 seconds
You cannot put those dollar signs into the middle of your ARRAY statement.
Also save yourself some typing by using a variable list. Also you don't need to tell SAS how variables are in an array when you are listing the variables. SAS can count them for itself.
ARRAY ICD ICD_DGNS_CD1-ICD_DGNS_CD12;
1. I had to write manually because ICD_DGNS_CD1 to 12 aren't in a order in the dataset, when I say
ICD_DGNS_CD1-ICD_DGNS_CD12
It includes other variables in between this range.
2. I kept $ because those variables are charachters, now I tried without $ sign, code ran successfuly but 0 observations are in the output
DATA Inpatient_2013_new; SET Inpatient_2013; ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12; Keep_record = 0; DO i = 1 to 12; IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", "45111", "45119", "41511", "41512", "41513", "41519"); END; IF keep_record = 1; RUN;
3. Can you please explain me the importance of Keep_record in the code?
509 DATA Inpatient_2013_new; 510 SET Inpatient_2013; 511 ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 511! ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12 511! ; 512 Keep_record = 0; 513 DO i = 1 to 12; 514 IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", 514! "45111", "45119", "41511", "41512", "41513", "41519"); 515 END; 516 IF keep_record = 1; 517 RUN; NOTE: There were 582422 observations read from the data set WORK.INPATIENT_2013. NOTE: The data set WORK.INPATIENT_2013_NEW has 0 observations and 258 variables. NOTE: DATA statement used (Total process time): real time 1.15 seconds cpu time 1.09 seconds
Thank you!!
A variable ilist of the format A1-A5 is asking for the sequence of variables formed by changing the numeric suffix of the name. So the position of the variables in tha current data vector , or even their existance, does not matter. To make a variable list based on position you need to use two hyphens lke NAME -- HEIGHT
When you are referencing existing variables in an ARRAY statement you don't need to tell SAS wheher they are numeric or character, SAS already knows how the variables are defined and it is too late to change the variables definitions.
But if you did want to use an ARRAY statement to define new character variables then can indication that by placing the type/length declaration BEFORE the variable list. Such as
array newvar (12) $5 ;
Which would define 12 variables named NEWVAR1-NEWVAR12 where each one is character of length 5.
I assume you are trying to check if any of the variables match any of the values in your list? If so then set the condition FALSE be starting to look and set it to TRUE when you find one. Then you can decide whether you want to keep based on the condition flag variable.
data want;
set have;
array codes icd_dgns_cd1-icd_dgns_cd12;
has_condition = 0;
do i = 1 to dim(codes) ;
if codes(i) in ("45341" "45340" "45342" "45350" "45351" "45352"
"45181" "45111" "45119" "41511" "41512" "41513" "41519")
then has_condition=1;
end;
if has_condition then output ;
run;
Hello,
I've 257 variables and millions of rows, my interest is to extract the rows that match the value of my interest in 12 variables out of 257.
But I want the extracted rows to have all 250 variables data. Is there any function that would do this? I tried using ARRAY and DO loop together but I'm getting frustrated with the errors. Can someone help me out? Thanks
DATA Inpatient_2013_new;
SET Inpatient_2013; *Inpatient_2013 contains 257 variables and millions of rows;
ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12;
DO i = 1 to 12;
IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", "45111", "45119", "41511", "41512", "41513", "41519"); *If any of the variables have the following values, I want the entire row with all 250 varible info;
END;
RUN;
I'm getting the following error
563 DATA Inpatient_2013_new; 564 SET Inpatient_2013; *Inpatient_2013 contains 250 variables and millions of rows; 565 ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 565! ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12 565! ; 566 DO i = 1 to 12; 567 IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", 567! "45111", "45119", "41511", "41512", "41513", "41519"); *If any of the variables have the 567! following values, I want the entire row with all 250 varible info; 568 END; 569 RUN; NOTE: There were 582422 observations read from the data set WORK.INPATIENT_2013. NOTE: The data set WORK.INPATIENT_2013_NEW has 0 observations and 257 variables. NOTE: DATA statement used (Total process time): real time 1.08 seconds cpu time 1.07 seconds
DATA Inpatient_2013_new; SET Inpatient_2013; *Inpatient_2013 contains 257 variables and millions of rows; ARRAY ICD {12} ICD_DGNS_CD1 ICD_DGNS_CD2 ICD_DGNS_CD3 ICD_DGNS_CD4 ICD_DGNS_CD5 ICD_DGNS_CD6 ICD_DGNS_CD7 ICD_DGNS_CD8 ICD_DGNS_CD9 ICD_DGNS_CD10 ICD_DGNS_CD11 ICD_DGNS_CD12; DO i = 1 to 12; IF ICD {i} in ("45341", "45340", "45342", "45350", "45351", "45352", "45181", "45111", "45119", "41511", "41512", "41513", "41519") then do; output;leave;end; END; RUN;
@Ksharp That could potentially output the same row multiple times.
Answer is at the other version of this question.
https://communities.sas.com/t5/Base-SAS-Programming/Need-help-with-DO-loop/m-p/299440#M63154
@Tom I think the LEAVE prevents that?
@Sujithpeta Please don't post the same questions multiple times.
@Ksharp Didn't see the LEAVE command.
I ran a sample code like this
DATA Inpatient_2011_new2;
SET Inpatient_2011;
KEEP ICD_DGNS_CD1 -- ICD_DGNS_CD12;
RUN;
Instead of giving 12 variables it included other variables in between those 12 variables
676 DATA Inpatient_2011_new2; 677 SET Inpatient_2011; 678 KEEP ICD_DGNS_CD1 -- ICD_DGNS_CD12; 679 RUN; NOTE: There were 5000 observations read from the data set WORK.INPATIENT_2011. NOTE: The data set WORK.INPATIENT_2011_NEW2 has 5000 observations and 34 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 second
--Sujith
@Sujithpeta That is correct behaviour. 2 dashes include all variables between the two. A single dash includes only variables with a consecutive suffix.
Thanks to all of you guys for helping me out
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.