Re: How to extract desired numericals in irregular cells?

agodba · Posted 06-02-2015 07:53 AM

Hi all,

I have an excel data file. In that file there are two columns I need to use and one column of it is irregular. As you guess I need to extract some numerical variables from that irregular contained column. For example I need to find the zipcodes which starts with "60". I have managed to write the formula of it in excel which is

"=IFERROR(IF(AND(LEN(TRIM(SUBSTITUTE(MID(K2,FIND(TEXT($L$1,0),K2,1)-1,1)&MID(K2,FIND(TEXT($L$1,0),K2,1)+5,1),CHAR(160),"")))=0,MID(K2,FIND(TEXT($L$1,0),K2,1),2)<>RIGHT(K2,2)),$L$1,"X"),"X")"

*Zip codes are 5 digits.

What does this formula do? : It finds the starting position of desired 2 digit then

1- Check if the letter before that 2 zip codes is empty and
2- Check if the 3 letter after that 2 zip code is empty and
3- Check if there is any letter after 2 zip code just in case those digits are the last letter of that cell.

You can see some lines of the source excel file. If you are not able to download it I added some lines as well.

Irregular Column	Desired zip codes Starting with
a-haUKxUt 60000 WawawxwK tIhwahwTw 60	60
KxIwaww 66000 yawtIh yInbaTwTw 66-60	60
wtwwwxhh 00666 WaIwxw twIawahwwTw 6	60
wUwt Vth 060006 00000 * *	60
awaKx 06060 IhtttwTxwT tawtttIhtaw wTw 666	60
KxIwaww 60066 yawtIh YtwnbKwTw 00	60
wUwt Vth 660666 00000 * *	60
wUwt Vth 600660 00000 * *	60
awaKx twUyta 00666 yxw atwTaw yxtawwTw 60	60
awaKx twtwwaw 06600 xUtwyUwt ytyIhtaw wTw 66	60
TatUT-hxy+tUT 00006 wUaytyxUwah/TyUawIh txhtahwxtnbxaw tx	60
Iwwx-wxwKT 66660 ttwahyUwt yttyawtatwaw wTw 600	60

Any help or idea is appreciated. Thanks in advance, have a nice day.

RW9 · Posted 06-02-2015 08:04 AM

Hi,

You seem to be asking for advice on Excel issues, maybe post it on an Excel forum? To do this in SAS then you could use several functions, or perl regular expressions.

agodba · Posted 06-02-2015 09:02 AM

Hi,

Sorry if I caused any missunderstanding. I don't ask for excel advices. I need to use that excel source file in SAS as it is.

I have written the excel solution that i used in order to give some ideas about my questions, also I thought that some SAS codes that i don't know yet can be written by understanding the rules that i used in excel formulas.

So all i want to learn is those several functions, or perl regular expressions that you mentioned.

Kurt_Bremser · Posted 06-02-2015 09:20 AM

Use the scan() function to extract the "words" from the long string (do it iteratively from 1 to countw()).

Then check each "word" for a length of 5 and that it is numeric (use the notdigit() function).

Since you now know that you have 5 digits, you only need to compare the substr(string_var,1,2) to your reference value ('60').

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 06-02-2015 01:15 PM

To add a bit of context to KurtBremser's answer:

data want;

set have;

length nextword $ 6; /* don't need to extract any more than 6 characters */

length zip_code $ 5;

if irregular_string > ' ' then do i=1 to countw(irregular_string, ' ');

nextword = scan(irregular_string, i, ' ');

if length(nextword)=5 and notdigit(next_word) = 6 then do;

if next_word =: '60' then zip_code = next_word;

end;

drop next_word i;

run;

The logic might be a bit trickier than it looks. Once the length of a word is established as 5, the 6th character must be a blank. So the NOTDIGIT function must return 6 (since blanks are not digits) to identify a 5-digit number. Also, by defining the length of zip_code as $ 5, that trailing blank is automatically dropped from its value. Also note that COUNTW uses a set of characters as default delimiters, so specify a blank as the only allowable delimiter. Using =: is a faster way to examine the beginning of a character string (instead of substr).

Good luck.

How to extract desired numericals in irregular cells?