Text mining and content categorization

Extracting 15-16 digit number from TEXT

Reply
Contributor
Posts: 40

Extracting 15-16 digit number from TEXT

I created an Excel VBA to extract Outlook Email to Excel file. I am using PC-SAS to upload this into SQL server.

Then I am using EG to connect to SQL server and manipulate the table, which will be used to automate another report.

 

One of the columns in that table includes a text field (Body of the email originally), which contains 15-16 digit account or card number. Since the email sent out by different people, the body text varies and even some writes card#, others write card number, or account number or nothing at all, but there will be a 15 or 16 digit number in there.

 

I was susccessful in getting rid of all the other junk from the body of the email except numbers using the compress function:

 

body1 = compress(body, , "ap")

 

However, the new column column - body1 looks like this now:

30 2016 1106                 1234567891011121     032916 1   032916 1   032916 1                          358650             76

 

Here, I wanted only the 16 digit number (1234567891011121).

Is there any function or option I can use to extract that? This 16 digit number is not always in this position, so I can't use substr, index, scan etc.

 

Please advise. Thanks!

 

 

Super Contributor
Posts: 409

Re: Extracting 15-16 digit number from TEXT

Have a look at the PRX functions. Seems like a job for PRXMATCH().

 

If needed I can help with code but I do not want to deprive you of a solid learning experience. The PRX functions are worth the effort.

 

Good luck,

- Jan.

Super User
Posts: 5,323

Re: Extracting 15-16 digit number from TEXT

Well, you could use scan() in a do until/while loop, together with the length() function.

Data never sleeps
Super User
Posts: 5,260

Re: Extracting 15-16 digit number from TEXT

Are you saying that the data line only contains digits, or might you find:

 

12345 1 Card#1234567891011121 135

 

I agree that PRX is the best way to go, but it can be done with simpler functions.  But it's important to know what the data might contain.

 

If "Card#" is part of the data, would there be a space after "Card#" never, always, or sometimes?

 

Could there ever be two 16-digit fields on the same line of data?

Contributor
Posts: 40

Re: Extracting 15-16 digit number from TEXT

Yes, after using this

body1 = compress(body, , "ap")

 

I created another variable body1, which contains only numbers, but extraction includes other numbers in the Outlook email body ..like date and all, but none of them are 15 or 16 digit...for example like this shown below. All of them are numbers in this variable. It happens to be only one 15 or 16 digit in the email body. I am wondering how to extract that 16 digit number from this column and remove the rest. Since this number could be in the beginning or end or middle, I can't think of the common functions to use to extract. I will try looking into PRX function and other regex stuff. Thanks!

 

12332    12   12  44   556                   9876543231123456    4564   23

 

Super User
Posts: 5,260

Re: Extracting 15-16 digit number from TEXT

To embellish on the LinusH suggestion, then, you could try:

 

data want;

set have;

if body1 > ' ' then do i=1 to countw(body1);

   test_len = scan(body1, i, ' ' );

   if length(test_len) in (15, 16) then do;

      card = test_len;

      output;

      return;

   end;

end;

output;

run;

 

Contributor
Posts: 40

Re: Extracting 15-16 digit number from TEXT

Thanks, this worked, but I am having one problem in a few instances. When I use compress(body, ,"ap")  I get some of the date numbers coalesce together and shows up as 15-16 digit numbers or merge with the actual account number. I will try to see if I can change the compress option to keep the space, but remove all a-z, special characters, so that the account number is always separated from others.

 

Thanks again!

Super User
Posts: 5,260

Re: Extracting 15-16 digit number from TEXT

Rather than COMPRESS these characters, you might change them to blanks.  The SAS function name to do this escapes me off the top of my head, but it should be a good alternative.  (Of course, PRX is still a very good tool and doesn't require changing the data.)

Super User
Posts: 9,780

Re: Extracting 15-16 digit number from TEXT

Yeah. It is PRX thing.



data have;
x='12332    12   12  44   556                   9876543231123456    4564   23';
pid=prxparse('/\b\d{16}\b/');
call prxsubstr(pid,x,p,l);
if p gt 0 then want=substr(x,p,l);
drop pid p l;
run;


Trusted Advisor
Posts: 1,137

Re: Extracting 15-16 digit number from TEXT

As Jan righly mentioned PRXMATCH function would be a good choice in this scenario.

 

Please try

 

data have;
input text $ 1-75;
substring=substr(text,prxmatch("/(\d){16}/",text),16);
cards;
12332    12   12  44   556                   9876543231123456    4564   23
;

proc print;
run;

 

Thanks,
Jag
Contributor
Posts: 40

Re: Extracting 15-16 digit number from TEXT

Since compress was not working fine in all the situations, I kept trying PRXPARSE. However, I could not get this working.

The VBA code works fine pulling all the emails and the body of the email into SAS dataset. Since I was using compress(body, , "ap") initially, it was not working in some situations and outputing additional 15/16 digit numbers sometimes.

 

I then wanted to go directly to the body of the email and find either 15 or 16 digit numbers. Also, I noticed sometimes there are more than 1 15-16 digit numbers in a single email body. I want all the numbers into another column or columns.

 

Can someone provide a piece of code that works?

I tried this below and it kept extracting "1" in body2 for all of them. I was readying Ron Cody's paper on this and could not make it to work even then.

 

body2 = prxparse('/\d{15,16}/');

 

Thanks a lot!

Super User
Posts: 9,780

Re: Extracting 15-16 digit number from TEXT

Could you post some data and the output you want to see ?
Contributor
Posts: 40

Re: Extracting 15-16 digit number from TEXT

Sender Subject Date Body NEED TO EXTRACT CARD1 CARD2
senderemail@email.com Subject 1 21Apr2016 Customer Name - xyz xyz Number - 1234567891234567 (card) Flags- sent 04/15/16, calls 04/19/16 This is some other text. Some more texts here. There might be numbers in date format or any other format here. 1234567891234567 1234567891234567  
senderemail1@email.com Subject 2 21Apr2016 Customer Name - abc, efg Card Number - 987654321123456   ABC - 1234567891234567 No flag The Suspicion id based on - Both Some other number here, not 15 or 16 digits. Some dates here Additional comments here. 987654321123456 1234567891234567 987654321123456 1234567891234567

 

Thanks, The email extraction looks similar to above. I have hundreds of emails. Email body looks like the one above under "Body".

I moved this excel file through proc import codes, so this is 99% automated from Outlook to SAS to Sending out reports. I need to extract via SAS the rest of the columns shown on the right side of Body column above. I wanted all the 15 and 16 digit card numbers from the body of the email. Most of the emails will only have 1 number, but some will have more than 1 up to 3-4 max, but I wanted to extract them and put them into card1, card2, card3 etc., columns. Please let me know if it is not clear. Thanks again.

Super User
Posts: 9,780

Re: Extracting 15-16 digit number from TEXT

OK. What if there are some digit great than 16 , What you gotta do ?

data have;
infile cards truncover;
input (Sender Subject Date) (:$40.) x $300.;
n+1;
cards;
senderemail@email.com Subject1 21Apr2016 Customer Name - xyz xyz Number - 1234567891234567 (card) Flags- sent 04/15/16, calls 04/19/16 This is some other text. Some more texts here. There might be numbers in date format or any other format here.
senderemail1@email.com Subject2 21Apr2016 Customer Name - abc, efg Card Number - 987654321123456   ABC - 1234567891234567 No flag The Suspicion id based on - Both Some other number here, not 15 or 16 digits. Some dates here Additional comments here.
;
run;
data temp;
 set have;
 pid=prxparse('/\d+/');
 s=1;
 e=length(x);
 call prxnext(pid,s,e,x,p,l);
 do while(p>0);
  found=substr(x,p,l);
  if length(found) in (15,16) then output;
  call prxnext(pid,s,e,x,p,l);
 end;
 drop s e p l pid;
run;
proc transpose data=temp out=want prefix=card;
by n Sender Subject Date x notsorted;
var found;
run;
Contributor
Posts: 40

Re: Extracting 15-16 digit number from TEXT

You are a life saver. This works wonderfully. Thanks a lot again.

 

There is not going to be any number more than 16 digits, if it is there, I don't actually need them to be extracted, so this works with all the emails I have so far. I need to validate the data once again, but all seems to be pretty good with the quick scan.

Ask a Question
Discussion stats
  • 16 replies
  • 1081 views
  • 6 likes
  • 7 in conversation