e dSAS folks-
I’ve been picking at this awhile, but I'm sure there is a better way.
I have a text field that is composed of concatenated numeric codes that indicate the quality of data. The individual codes are separated by asterisks, and there is no order. I am trying to pick out the highest code, and keep or eliminate rows based on it (<220).
Any insight you can give me would be much appreciated.
WendyT
DATA FINAL ; SET HAVE ;
HICODE = 1 ;
NCODES=COUNTW(CAT_CODES,'*')+1 ;
DO I = 1 TO NCODES ;
NEWCODE = (SCAN(CAT_CODES, I, '*')) ;
HICODE = MAX(NEWCODE, HICODE) ;
END ;
IF HICODE GE 220 THEN DELETE ;
DROP I NCODES NEWCODE HICODE CAT_CODES ;
RUN ;
I have something like this
LOC | DATE | VAL | CAT_CODES |
AAA | 1-Jul-15 | 17.3 | 1*2 |
AAA | 2-Jul-15 | 2.1 | 2*150*7 |
AAA | 3-Jul-15 | 6.8 | |
AAA | 4-Jul-15 | 2.3 | 211 |
BBB | 1-Jul-15 | 0.1 | 255*100 |
BBB | 2-Jul-15 | 3.2 | 2*1*151 |
picking out the highest code
LOC | DATE | VAL | HI_CODE |
AAA | 1-Jul-15 | 17.3 | 2 |
AAA | 2-Jul-15 | 2.1 | 150 |
AAA | 3-Jul-15 | 6.8 | 1 |
AAA | 4-Jul-15 | 2.3 | 211 |
BBB | 1-Jul-15 | 0.1 | 255 |
BBB | 2-Jul-15 | 3.2 | 151 |
final dataset
LOC | DATE | VAL |
AAA | 1-Jul-15 | 17.3 |
AAA | 2-Jul-15 | 2.1 |
AAA | 3-Jul-15 | 6.8 |
AAA | 4-Jul-15 | 2.3 |
BBB | 2-Jul-15 | 3.2 |
Given your "final" dataset you're actually not really interested in the highest value but you want simply to delete all records where you find something higher than 220. If so then below code should work:
data have;
infile datalines truncover dsd dlm=' ';
input LOC $ DATE:anydtdte. VAL CAT_CODES:$20.;
format date date9.;
datalines;
AAA 1-Jul-15 17.3 1*2
AAA 2-Jul-15 2.1 2*150*7
AAA 3-Jul-15 6.8
AAA 4-Jul-15 2.3 211
BBB 1-Jul-15 0.1 255*100
BBB 2-Jul-15 3.2 2*1*151
;
run;
data want(drop=_:);
set have;
do _i=1 by 1;
_code_val=input(scan(CAT_CODES,_i,'*'), best32.);
if _code_val>=220 then delete;
if missing(_code_val) then leave;
end;
run;
I think you may need to go into some more detail about HOW you are using that "highest" code value.
It appears that LOC makes no difference, correct?
Your attempted code of:
IF HICODE GE 200 THEN DELETE ;
doesn't seem to quite match the results you indicate are desired because the AAA 4-Jul-15 record has a code of 211 but is kept.
Ballardw-
You're absolutely right... I tried to make the example simple, and didn't get it right (now edited to be 220). The LOC and DATE variables were stand-ins for a group of identification variables.
The general idea is that the lower the code, the better the data. The dataset I'm working with is pre-summarized, and all codes that apply during the summarization period are shown in the CAT_CODES variable.
Thanks!
WendyT
data want;
set have;
if not missing(cat_code)then do;
num=count(cat_code,'*');
do i=1 to num;
hi_code=max(hi_code,scan(cat_code,i,'*'));
end;
end;
if index(cat_code,'*')=0 then hi_code=cat_code;
if hi_code>200 then delete;
drop i num cat_code hi_code;
run;
Hi Wendy,
There are many ways to skin your cat. I think the excellent answer provided by SLCHEN is going to give you what you need, with one small amendment.
The statement:
do i=1 to num;
should be changed to:
do i=1 to num+1;
Without this minor change, any instances like your first test record, will not necessarily pick out the highest value from the CAT_CODE variable.
e.g CAT_CODE=1*2
The original code will return 1 as the highest value, which is not what you want.
Why is this ? The NUM variable contains the number of <asterisk> characters in CAT_CODE, not the number of codes concatenated together. When NUM=1, you have 2 codes to work through. By using the amended version, you will be able to examine each code in the concatenated CAT_CODE value, irrespective of the order that they appear.
This problem will occur when you have any number of codes greater than 1 in CAT_CODE.
I hope that this helps.
Downunder Dave.
Wellington
Slightly more compact.
data HAVE;
infile cards pad;
input @15 VAL @21 CAT_CODES $20.;
cards;
AAA 1-Jul-15 17.3 1*2
AAA 2-Jul-15 2.1 2*150*7
AAA 3-Jul-15 6.8
AAA 4-Jul-15 2.3 211
BBB 1-Jul-15 0.1 255*100
BBB 2-Jul-15 3.2 2*1*151
run;
data WANT ;
set HAVE ;
do until(missing(CODE));
I = sum(I,1);
CODE = input(scan(CAT_CODES, I, '*'),?? 32.) ;
HICODE = max (0, CODE, HICODE) ;
end ;
if HICODE lt 220;
keep VAL;
run ;
Given your "final" dataset you're actually not really interested in the highest value but you want simply to delete all records where you find something higher than 220. If so then below code should work:
data have;
infile datalines truncover dsd dlm=' ';
input LOC $ DATE:anydtdte. VAL CAT_CODES:$20.;
format date date9.;
datalines;
AAA 1-Jul-15 17.3 1*2
AAA 2-Jul-15 2.1 2*150*7
AAA 3-Jul-15 6.8
AAA 4-Jul-15 2.3 211
BBB 1-Jul-15 0.1 255*100
BBB 2-Jul-15 3.2 2*1*151
;
run;
data want(drop=_:);
set have;
do _i=1 by 1;
_code_val=input(scan(CAT_CODES,_i,'*'), best32.);
if _code_val>=220 then delete;
if missing(_code_val) then leave;
end;
run;
data want;
set have;
cnt = countc(cat_codes,'*');
do i = 1 to cnt+1;
code = input(scan(cat_codes,i),8.);
if code gt 220 then delete;
end;
keep Loc Date Val;
run;
Thanks so much to everyone - I learned something new from each one of your answers.
I awarded Patrick as 'correct' because his solution ran the fastest on my system.
Wendy
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.