Hi,
I tryed to import some text based report as attach, my script to import is:
data store (drop=Key0);
infile "C:\Users\Administrator\Desktop\YULIA\sample.txt" delimiter = ' ' MISSOVER DSD LRECL=10000 pad;
retain account_number date time amount type response thestore;
/*input key1 $2-3 @;
if key1 in ('3','4','5')then input
@2 account_number2 $19.;*/
input
@2 account_number $19.
@22 Date ddmmyy8.
@31 time $4.
@36 amount $15.
@53 type $8.
@62 response $9. ;
input key0 $52-59 @;
if key0='THESTORE' then input
@61 thestore $35.;
format date date9.;
if date ne .;
run;
my problem is the output missing some observation like below :
account_number | date | time | amount | type | response | thestore |
8-Jul-85 | 1210 | 999,800 | RETAIL | MORE | WALLMART MMM | |
0000-8888-9999-7990 | 8-Jul-85 | 1210 | 790,000 | RETAIL | MORE | PHILLIPS MMM |
0000-8888-3333-7853 | 8-Jul-85 | 1240 | 969,955 | RETAIL | MORE | WE HAVE ALL PRODUCT MMM |
the output should be like below :
account_number | date | time | amount | type | response | thestore |
0000-8888-9999-8896 | 8-Jul-85 | 1210 | 590,000 | RETAIL | MORE | TIMEZONE MMM |
8-Jul-85 | 1210 | 999,800 | RETAIL | MORE | WALLMART MMM | |
0000-8888-9999-7990 | 8-Jul-85 | 1210 | 790,000 | RETAIL | MORE | PHILLIPS MMM |
0000-8888-3333-0697 | 8-Jul-85 | 1230 | 57,000 | RETAIL | MORE | YOUR STORE AVAILABLE MMM |
0000-8888-3333-7853 | 8-Jul-85 | 1240 | 969,955 | RETAIL | MORE | WE HAVE ALL PRODUCT MMM |
have anyone any idea whats wrong with my code that causing missing some observation?
thanks before.
OK . Assuming date would always be non-missing .
data want; infile 'c:\temp\sample.txt' ; input; length account_number date time amount type response $ 100; retain account_number date time amount type response ; if prxmatch('/\d\d\/\d\d\/\d\d/o',_infile_) then do; account_number=substr(_infile_,2,19); date=left(substr(_infile_,22,8)); time=left(substr(_infile_,31,4)); amount=left(substr(_infile_,36,15)); type=left(substr(_infile_,51,10)); response=left(substr(_infile_,61,9)); end; else if prxmatch('/THESTORE:/io',_infile_) then do; thestore=left(compbl(scan(_infile_,-1,':'))); output; end; run;
Xia Keshan
Looks like you are getting a lot of bits in the dataset which you are dropping with the if date ne .. Have a look without that line in and you will see that the data doesn't make much sense.
So you can see, in the above, account_number is read in from the file as blank in row 2, this value is retained for the next row which has a date, and thus comes out but with missing as account number.
From my side, I would say this "report" will take a fair bit of working to get into any readable manner. It would be easier if you could get the source document. Alternatively if that is not possible then read each line in as $2000. Then post process it with further datasteps.
thanks for reply, I use date ne . to cleansing unnecessary data, because all transaction have date value, my problem is why account number 0000-8888-9999-8896 & 0000-8888-3333-0697 is missing from my dataset.
Per my mail, when you look at what is imported the account number is missing for the first row where date is not missing, hence there is nothing to put in that variable, or retain for next records. Hence why I suggest to read in all the data as long lengths, and then post-process the information to separate out the sections. Ksharp has given a good example of the post-processing with perl regular expressions above.
OK . Assuming date would always be non-missing .
data want; infile 'c:\temp\sample.txt' ; input; length account_number date time amount type response $ 100; retain account_number date time amount type response ; if prxmatch('/\d\d\/\d\d\/\d\d/o',_infile_) then do; account_number=substr(_infile_,2,19); date=left(substr(_infile_,22,8)); time=left(substr(_infile_,31,4)); amount=left(substr(_infile_,36,15)); type=left(substr(_infile_,51,10)); response=left(substr(_infile_,61,9)); end; else if prxmatch('/THESTORE:/io',_infile_) then do; thestore=left(compbl(scan(_infile_,-1,':'))); output; end; run;
Xia Keshan
work like a charm!! amazing, thanks KSharp..
Ksharp, can you elaborate more what "/\d\d\/\d\d\/\d\d/o" is stand for? thanks.
"/\d\d\/\d\d\/\d\d/o" is a pattern matching string describing what you are looking for.
"/.../o" frames the pattern and instruct PRXMATCH to compile the pattern only once, instead of every time the function is called.
\d matches a single digit
\/ matches s slash
So the pattern reads Look for two digits followed by a slash, followed by two digits, followed by a slash, followed by two digits.
PG
Thanks for the answer PG, one more question, so the perl expressions is match two digits followed by a slash, followed by two digits, followed by a slash, followed by two digits, in my case, how come perl expression recognized blank account number for second observation like result below?
account_number | date | time | amount | type | response | thestore |
0000-8888-9999-8896 | 8-Jul-85 | 1210 | 590,000 | RETAIL | MORE | TIMEZONE MMM |
8-Jul-85 | 1210 | 999,800 | RETAIL | MORE | WALLMART MMM |
Thanks PG.
Oh how stupid of me, sorry PG, yes now I understand that KSharp look for any date contain in text that have 2 digit with slash. Thanks man, youre all awesome.
The pattern is not aimed at the account number, it is looking for a date. It matches 08/07/85. - PG
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.