Re: inporting character data

alanj · Posted 03-27-2008 07:26 PM

Greetings:

I am trying to import character with a field that has zeros on the left hand side instead of nulls. For example, the a value may have 000186TH instead of 186TH. What is the best way of handling this ? Should the entire field be parsed one character at a time, is there a string function that can access the left most fields or is there a flag that can be set which turns leading zeros to nulls ?

Thanks.

alan

deleted_user · Posted 03-28-2008 11:29 PM

Well a couple of options but it depends on the data.

Are you looking for the same number of left hand characters?
Or is it always a number followed by two characters?
Or, worst case, is it just any set of alphanumerics with leading zeros?

Case 1. There is a function for that
Case 2. There is a trick you can use
Case 3. I think you are stuck doing a character by character parsing.

Let me know more details and I'll try to help you out.

Ike Eisenhauer

alanj · Posted 03-31-2008 01:10 PM

Good morning:

Thanks, Ike.

The pattern is
zerosValidcharacrters

where zeros is anywhere from 1 to 5 zeros in a row.
Validcharacters is a street address and can be 186TH or Elm.

The following are possible in the file:
0186TH
00186TH
000186TH
0000186Th
00000186TH

Currently, I am using proc sql to process the data.

alan

Cynthia_sas · Posted 03-31-2008 02:06 PM

Hi:
The technique would vary. A simple translate function could get rid of zeroes, but would mess up on 00020TH or 000150TH depending on whether you'd ever have this:
zeroes part of ValidCharacters:
00020TH
00000150TH

(which it sounds like you could have the above).

But would you ever have this:

Elm00000 (zeroes after)
00000Apt10 (not ST, ND, RD, TH in the string)

cynthia

alanj · Posted 03-31-2008 02:17 PM

Good question, Cynthia. The apartment number is in a separate field and also has leading zeros.

You are absolutely right, that an address such as 1050 Elm would get lost if a translate were used. It's one of the reasons that I've found this problem to be such a sticky wicket.

Thanks for you reply.

Cynthia_sas · Posted 03-31-2008 02:52 PM

Ah, now we're getting closer...you've been showing just the "street" 000186TH or 00000Elm not the number and street. So I didn't consider that you could have this:

0001050 Elm

Is the Street NUMBER part of this field, too???? What's the max length of the field? Are there internal "spaces" so it is:
00001050 Elm??? or 000123 NE 1st St??? Is the field justified to the left, with leading zeroes, as you show -- what's trailing for short addresses??? spaces?

could you also have:
[pre]
123 Elm
0123 Elm
00123 Elm
000123 Elm
0000123 Elm
00000123 Elm
1050 Elm
001050 Elm
01050 Elm
00001050 Elm
1111 Twisted Turny Long Avenue, NE
[/pre]

That is a complication. In which case, looping through the field character by character might be your only choice. And, that means, you have to know when to set the bounds for stopping the loop -- because the loop will have the same issue with 0000150TH and 0001050 Elm if you don't stop it at the right place. And aren't these also possible valid street names:

0000NE 1st ST (so you'd stop at the alpha character, not at the 1)
000SE 10th Ave
000RR #3
0000SR 23
0000PO Box 61

I sort of expected that your leading zero data would look like this, so the number of zeroes in front of the address wouldn't vary the way you showed and that 123 Elm might have these variations:
[pre]
000000000000000000000000000123 Elm
000000000000000000000000123 Elm St
00000000000000000000000123 Elm St.
00000000000000000000123 Elm Street
000000000000000000000000001050 Elm
0000001234 NE 1st Street, Extended
1111 Twisted Turny Long Avenue, NE
[/pre]

I don't suppose you can go back to the folks who supply the data and have them change how they initialize the field -- with blanks instead of 0s??

cynthia

Tim_SAS · Posted 03-31-2008 03:30 PM

It's Regular Expressions to the rescue!

[pre]
data _null_;
infile datalines;
retain r;
if _n_ = 1 then
r = prxparse("/^0*(.*)$/");
input address $80.;
if prxmatch(r, address) then do;
call prxposn(r, 1, start, length);
address = substr(address, start, length);
end;
put address;
datalines;
000000000000000000000000000123 Elm
000000000000000000000000123 Elm St
00000000000000000000000123 Elm St.
00000000000000000000123 Elm Street
000000000000000000000000001050 Elm
0000001234 NE 1st Street, Extended
1111 Twisted Turny Long Avenue, NE
;;;;
[/pre]
produces
[pre]
42 data _null_;
43 infile datalines;
44 retain r;
45 if _n_ = 1 then
46 r = prxparse("/^0*(.*)$/");
47 input address $80.;
48 if prxmatch(r, address) then do;
49 call prxposn(r, 1, start, length);
50 address = substr(address, start, length);
51 end;
52 put address;
53 datalines;

123 Elm
123 Elm St
123 Elm St.
123 Elm Street
1050 Elm
1234 NE 1st Street, Extended
1111 Twisted Turny Long Avenue, NE
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

61 ;;;;
[/pre]

alanj · Posted 03-31-2008 03:53 PM

Thanks, Tim.

Very interesting solution. I must admit that I hadn't heard of these functions before.

alan

alanj · Posted 03-31-2008 03:47 PM

Cynthia:

The text file includes a separate field for street suffix. Here is an example of the street number, suffix and name. The fields are fixed width. Leading zeros fill the field..

street number street suffix streetname
0000029221 W 000186TH

I am going to check with the vendor tomorrow about the format. Currently, the script hacks is way through each field with something like this.
set propstreetname = substr(propstreetname,3,length(propstreetname))
where find(propstreetname,'00') = 1;
update sasuser.propinfo

Getting data with leading zeros would make data loading much faster and easier.

Again, thanks for the reply.

alan

Peter_C · Posted 03-31-2008 04:00 PM

some people recommend the address cleaning services. ~ Perhaps that is, or should be, part of the Data Integration/Data Flux, or the "Q in ETL to the power of Q".

Although this is more of a language thing, it might be nice to have a proc that would help with address cleaning!

The immediate parsing of the numbers in the address is less of a problem than deciding for what that number is provided, e.g.floor/apartment/building/business-park-block number!

OK here is the first number [pre] number1 = input( scan( full_address,1, 'a' ), 32. );[/pre]
Just alter that 1 to N to get the N-th number out of the address.

I shall have to wait for SAS9.2 to see if the addition of 'a' modifier to SCAN() works the way that suits this challenge.

Here is something I've tested in SAS9.1.3, with the examples Cynthia provided[pre]data trial;
infile cards;
input full_address $char80. ;
number1 = input( substr( full_address, 1
, findc( trim(full_address) !!'0-'
, ' 1234567890', 'v' ) -1
)
, 32. );
list;cards;
0001050 Elm
00001050 Elm??? or
...[/pre]

It is not as attractive as the SAS9.2 solution, but for now, it delivers. ~~ as below
[pre]+FSVIEW: WORK.TRIAL (B)--------------------------------------+
| number1 full_address |
| |
| 1050 0001050 Elm |
| 1050 00001050 Elm??? or |
| 123 000123 NE 1st St??? Is the field justified|
| 123 123 Elm |
| 123 0123 Elm |
| 123 00123 Elm |
| 123 000123 Elm |
| 123 0000123 Elm |
| 123 00000123 Elm |
| 1050 1050 Elm |
| 1050 001050 Elm |
| 1050 01050 Elm |
| 1050 00001050 Elm |
| 1111 1111 Twisted Turny Long Avenue, NE |
| . |
| 150 0000150TH and |
| 1050 0001050 Elm if you don't stop it at the rig|
| 0 0000NE 1st ST (so you'd stop at the alpha c|
| 0 000SE 10th Ave |
| 0 000RR #3 |
| 0 0000SR 23 |
| 0 0000PO Box 61 |
| 123 123 Elm might have these variations: |
| . |
| 123 000000000000000000000000000123 Elm |
| 123 000000000000000000000000123 Elm St |
| 123 00000000000000000000000123 Elm St. |
| 123 00000000000000000000123 Elm Street |
| 1050 000000000000000000000000001050 Elm |
| 1234 0000001234 NE 1st Street, Extended |
| 1111 1111 Twisted Turny Long Avenue, NE |
| |
+-------------------------------------------------------------+[/pre]
missing values for "number1" were created for blank datalines.

PeterC

alanj · Posted 03-31-2008 04:03 PM

Thanks, Peter.

That's a very clever solution. Hadn't occurred to me at all.

Cynthia_sas · Posted 03-31-2008 06:52 PM

Hi:
I was going to recommend the PRX (Perl Regular Expressions) method, too -- but Tim beat me to it. Perl regular expressions are cool and were introduced in SAS 9. For more info, see here:
http://www2.sas.com/proceedings/sugi29/265-29.pdf
http://support.sas.com/rnd/base/datastep/perl_regexp/index.html
http://analytics.ncsu.edu/sesug/2006/AP09_06.PDF
http://www.pharmasug.org/2005/TU02.pdf

I think that otherwise, something with scan, substr, findc, anyalpha and/or notalpha functions could help you divide the address into the leading zeroes and the rest of the field -- but not as elegantly as PRX functions.

cynthia

Registration is open

SAS Training: Just a Click Away