06-02-2015 07:53 AM
I have an excel data file. In that file there are two columns I need to use and one column of it is irregular. As you guess I need to extract some numerical variables from that irregular contained column. For example I need to find the zipcodes which starts with "60". I have managed to write the formula of it in excel which is
*Zip codes are 5 digits.
What does this formula do? : It finds the starting position of desired 2 digit then
1- Check if the letter before that 2 zip codes is empty and
2- Check if the 3 letter after that 2 zip code is empty and
3- Check if there is any letter after 2 zip code just in case those digits are the last letter of that cell.
You can see some lines of the source excel file. If you are not able to download it I added some lines as well.
|Irregular Column||Desired zip codes Starting with|
|a-haUKxUt 60000 WawawxwK tIhwahwTw 60||60|
|KxIwaww 66000 yawtIh yInbaTwTw 66-60||60|
|wtwwwxhh 00666 WaIwxw twIawahwwTw 6||60|
|wUwt Vth 060006 00000 *** ***||60|
|awaKx 06060 IhtttwTxwT tawtttIhtaw wTw|
|KxIwaww 60066 yawtIh YtwnbKwTw 00||60|
|wUwt Vth 660666 00000 *** ***||60|
|wUwt Vth 600660 00000 *** ***||60|
|awaKx twUyta 00666 yxw atwTaw yxtawwTw 60||60|
|awaKx twtwwaw 06600 xUtwyUwt ytyIhtaw wTw|
|TatUT-hxy+tUT 00006 wUaytyxUwah/TyUawIh|
|Iwwx-wxwKT 66660 ttwahyUwt yttyawtatwaw|
Any help or idea is appreciated. Thanks in advance, have a nice day.
06-02-2015 08:04 AM
You seem to be asking for advice on Excel issues, maybe post it on an Excel forum? To do this in SAS then you could use several functions, or perl regular expressions.
06-02-2015 09:02 AM
Sorry if I caused any missunderstanding. I don't ask for excel advices. I need to use that excel source file in SAS as it is.
I have written the excel solution that i used in order to give some ideas about my questions, also I thought that some SAS codes that i don't know yet can be written by understanding the rules that i used in excel formulas.
So all i want to learn is those several functions, or perl regular expressions that you mentioned.
06-02-2015 09:20 AM
Use the scan() function to extract the "words" from the long string (do it iteratively from 1 to countw()).
Then check each "word" for a length of 5 and that it is numeric (use the notdigit() function).
Since you now know that you have 5 digits, you only need to compare the substr(string_var,1,2) to your reference value ('60').
06-02-2015 01:15 PM
To add a bit of context to KurtBremser's answer:
length nextword $ 6; /* don't need to extract any more than 6 characters */
length zip_code $ 5;
if irregular_string > ' ' then do i=1 to countw(irregular_string, ' ');
nextword = scan(irregular_string, i, ' ');
if length(nextword)=5 and notdigit(next_word) = 6 then do;
if next_word =: '60' then zip_code = next_word;
drop next_word i;
The logic might be a bit trickier than it looks. Once the length of a word is established as 5, the 6th character must be a blank. So the NOTDIGIT function must return 6 (since blanks are not digits) to identify a 5-digit number. Also, by defining the length of zip_code as $ 5, that trailing blank is automatically dropped from its value. Also note that COUNTW uses a set of characters as default delimiters, so specify a blank as the only allowable delimiter. Using =: is a faster way to examine the beginning of a character string (instead of substr).