Solved: Re: help with picking out numeric codes from a text field

WendyT · Posted 07-22-2015 03:35 PM

e dSAS folks-

I’ve been picking at this awhile, but I'm sure there is a better way.

I have a text field that is composed of concatenated numeric codes that indicate the quality of data. The individual codes are separated by asterisks, and there is no order. I am trying to pick out the highest code, and keep or eliminate rows based on it (<220).

Any insight you can give me would be much appreciated.

WendyT

DATA FINAL ; SET HAVE ;

HICODE = 1 ;

NCODES=COUNTW(CAT_CODES,'*')+1 ;

DO I = 1 TO NCODES ;

NEWCODE = (SCAN(CAT_CODES, I, '*')) ;

HICODE = MAX(NEWCODE, HICODE) ;

END ;

IF HICODE GE 220 THEN DELETE ;

DROP I NCODES NEWCODE HICODE CAT_CODES ;

RUN ;

I have something like this

LOC	DATE	VAL	CAT_CODES
AAA	1-Jul-15	17.3	1*2
AAA	2-Jul-15	2.1	21507
AAA	3-Jul-15	6.8
AAA	4-Jul-15	2.3	211
BBB	1-Jul-15	0.1	255*100
BBB	2-Jul-15	3.2	21151

picking out the highest code

LOC	DATE	VAL	HI_CODE
AAA	1-Jul-15	17.3	2
AAA	2-Jul-15	2.1	150
AAA	3-Jul-15	6.8	1
AAA	4-Jul-15	2.3	211
BBB	1-Jul-15	0.1	255
BBB	2-Jul-15	3.2	151

final dataset

LOC	DATE	VAL
AAA	1-Jul-15	17.3
AAA	2-Jul-15	2.1
AAA	3-Jul-15	6.8
AAA	4-Jul-15	2.3
BBB	2-Jul-15	3.2

Patrick · Posted 07-23-2015 07:11 AM

Given your "final" dataset you're actually not really interested in the highest value but you want simply to delete all records where you find something higher than 220. If so then below code should work:

data have;

infile datalines truncover dsd dlm=' ';

input LOC $ DATE:anydtdte. VAL CAT_CODES:$20.;

format date date9.;

datalines;

AAA 1-Jul-15 17.3 1*2

AAA 2-Jul-15 2.1 2*150*7

AAA 3-Jul-15 6.8

AAA 4-Jul-15 2.3 211

BBB 1-Jul-15 0.1 255*100

BBB 2-Jul-15 3.2 2*1*151

;

run;

data want(drop=_:);

set have;

do _i=1 by 1;

_code_val=input(scan(CAT_CODES,_i,'*'), best32.);

if _code_val>=220 then delete;

if missing(_code_val) then leave;

end;

run;

View solution in original post

ballardw · Posted 07-22-2015 04:12 PM

I think you may need to go into some more detail about HOW you are using that "highest" code value.

It appears that LOC makes no difference, correct?

Your attempted code of:

IF HICODE GE 200 THEN DELETE ;

doesn't seem to quite match the results you indicate are desired because the AAA 4-Jul-15 record has a code of 211 but is kept.

WendyT · Posted 07-22-2015 04:30 PM

Ballardw-

You're absolutely right... I tried to make the example simple, and didn't get it right (now edited to be 220). The LOC and DATE variables were stand-ins for a group of identification variables.

The general idea is that the lower the code, the better the data. The dataset I'm working with is pre-summarized, and all codes that apply during the summarization period are shown in the CAT_CODES variable.

Thanks!

WendyT

slchen · Posted 07-22-2015 04:16 PM

data want;

set have;

if not missing(cat_code)then do;

num=count(cat_code,'*');

do i=1 to num;

hi_code=max(hi_code,scan(cat_code,i,'*'));

end;

if index(cat_code,'*')=0 then hi_code=cat_code;

if hi_code>200 then delete;

drop i num cat_code hi_code;

run;

DaveShea · Posted 07-23-2015 04:37 AM

Hi Wendy,

There are many ways to skin your cat. I think the excellent answer provided by SLCHEN is going to give you what you need, with one small amendment.

The statement:

do i=1 to num;

should be changed to:

do i=1 to num+1;

Without this minor change, any instances like your first test record, will not necessarily pick out the highest value from the CAT_CODE variable.

e.g CAT_CODE=1*2

The original code will return 1 as the highest value, which is not what you want.

Why is this ? The NUM variable contains the number of <asterisk> characters in CAT_CODE, not the number of codes concatenated together. When NUM=1, you have 2 codes to work through. By using the amended version, you will be able to examine each code in the concatenated CAT_CODE value, irrespective of the order that they appear.

This problem will occur when you have any number of codes greater than 1 in CAT_CODE.

I hope that this helps.

Downunder Dave.

Wellington

ChrisNZ · Posted 07-23-2015 07:12 AM

Slightly more compact.

data HAVE;

infile cards pad;

input @15 VAL @21 CAT_CODES $20.;

cards;

AAA 1-Jul-15 17.3 1*2

AAA 2-Jul-15 2.1 2*150*7

AAA 3-Jul-15 6.8

AAA 4-Jul-15 2.3 211

BBB 1-Jul-15 0.1 255*100

BBB 2-Jul-15 3.2 2*1*151

run;

data WANT ;

set HAVE ;

do until(missing(CODE));

I = sum(I,1);

CODE = input(scan(CAT_CODES, I, '*'),?? 32.) ;

HICODE = max (0, CODE, HICODE) ;

end ;

if HICODE lt 220;

keep VAL;

run ;

High-Performance SAS Coding - Third Edition

Patrick · Posted 07-23-2015 07:11 AM

Given your "final" dataset you're actually not really interested in the highest value but you want simply to delete all records where you find something higher than 220. If so then below code should work:

data have;

infile datalines truncover dsd dlm=' ';

input LOC $ DATE:anydtdte. VAL CAT_CODES:$20.;

format date date9.;

datalines;

AAA 1-Jul-15 17.3 1*2

AAA 2-Jul-15 2.1 2*150*7

AAA 3-Jul-15 6.8

AAA 4-Jul-15 2.3 211

BBB 1-Jul-15 0.1 255*100

BBB 2-Jul-15 3.2 2*1*151

;

run;

data want(drop=_:);

set have;

do _i=1 by 1;

_code_val=input(scan(CAT_CODES,_i,'*'), best32.);

if _code_val>=220 then delete;

if missing(_code_val) then leave;

end;

run;

MadhuKorni · Posted 07-23-2015 08:38 AM

data want;

set have;

cnt = countc(cat_codes,'*');

do i = 1 to cnt+1;

code = input(scan(cat_codes,i),8.);

if code gt 220 then delete;

end;

keep Loc Date Val;

run;

WendyT · Posted 07-24-2015 11:19 AM

Thanks so much to everyone - I learned something new from each one of your answers.

I awarded Patrick as 'correct' because his solution ran the fastest on my system.

Wendy

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away