BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
leeklammer
Fluorite | Level 6

Hello,

This is my situation:

  1. My company stores large datasets (20M+ records) as zipped txt files on a daily basis.
  2. They are stored as zipped txt files, rather than keeping data in a database, to conserve disk space; so I cannot simply unzip all of the files outside of SAS and focus on uploading simple txt files.
  3. All of our machines have 7zip software, which SAS can utilize (I believe).
  4. The txt files are tab delimited and include 10 variables
  5. Additionally, I need to import the data for each day of a 120 day date-range into SAS (which I plan to accomplish with a DO loop)

My code seems to be unzipping the files correctly and the variables are correctly formatted, but every single value is missing.  Any ideas?

Here is my code:

       filename ZIPFILE pipe '"C:\7z.exe" e

                           [file path]\20130226_BKV_imps.zip

                            20130226_BKV_imps.txt

                            -y -so';

     Data work.temp;

  format

  UniqueID1 BEST9. UniqueID2 BEST10. Tentative BEST1. IPAddress $CHAR13. Time DATETime18.

  PlacementID BEST9. AdID BEST9. Click BEST1. AdvertiserID BEST6. AudienceID BEST10.;

  infile ZIPFILE dsd dlm=' ' lrecl=300 truncover missover;

  input

  UniqueID1 : ?? BEST9. UniqueID2 : ?? BEST10. Tentative : ?? BEST1. IPAddress : $CHAR13.

  Time : ?? ANYDTDTM23. PlacementID : ?? BEST9. AdID : ?? BEST9. Click : ?? BEST1.

  AdvertiserID : ?? BEST6. AudienceID : ?? BEST10. ;

run;

Here is the Log:

NOTE: The infile ZIPFILE is:

      Unnamed Pipe Access Device,

      PROCESS="C:\7z.exe" e                          

     [file path]\20121225_BKV_imps.zip 

                               20121225_BKV_imps.txt                           -y -so,

      RECFM=V,LRECL=300

Stderr output:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18

Processing archive:

[file path]\20121225_BKV_imps.zip

Extracting  20121225_BKV_imps.txt

Everything is Ok

Size:       245463226

Compressed: 30965060

NOTE: 1323286 records were read from the infile ZIPFILE.

      The minimum record length was 2.

      The maximum record length was 203.

      One or more lines were truncated.

NOTE: The data set WORK.TEMP has 1323286 observations and 10 variables.

NOTE: DATA statement used (Total process time):

      real time           4.64 seconds

      cpu time            3.24 seconds

Here is the output:

Capture.PNG

I am using Enterprise Guide 5.1 with SAS 9.3

1 ACCEPTED SOLUTION

Accepted Solutions
leeklammer
Fluorite | Level 6

Hey Tom,

Thanks very much for your insight on this.  I ended up going with X-command to complete this task.

x "cd C:\Program Files\7-Zip\";

x '7z.exe e -o[file path]\TEMP\OUTPUT [file path]\TEMP\20130226_BKV_imps.zip';

Data work.temp;

  length UniqueID1 UniqueID2 Tentative 8 IPAddress $13. Time

  PlacementID AdID Click AdvertiserID AudienceID 8;

  format IPAddress $CHAR13. Time DATETime18.;

  informat IPAddress $CHAR13. Time ANYDTDTM23.;

  infile '[file path]\TEMP\OUTPUT\20130226_BKV_imps.txt'

  dsd dlm='09'x lrecl=300 truncover;

  input UniqueID1 -- AudienceID;

run;

x "cd [file path]\TEMP\OUTPUT\";

x "del 20130226_BKV_imps.txt";

View solution in original post

12 REPLIES 12
Tom
Super User Tom
Super User

A couple of things to try.

1) Specify the delimiter using hexadecimal literal syntax to prevent tab character in your code from being converted to spaces.

    DLM='09'x

2) Remove the MISSOVER option as it is overriding the TRUNCOVER option that you want to use because you have listed it last in the INFILE statement.

3) Convert your FORMAT statement to a length statement.  Most of the formats listed are of no value and it will be clearer to you and SAS what type/length the variables should be it you state it explicitly using LENGTH (or ATTRIB) statement instead of implying if by the format you attach to the variable.

  length UniqueID1 UniqueID2 Tentative 8 IPAddress $13

         Time PlacementID AdID Click AdvertiserID AudienceID 8

  ;

  format IPAddress $CHAR13. Time DATETime18. ;


4) Use an INFORMAT statement to tell SAS how to read the data that needs special processing.  Simplify the INPUT statement that just list the variables.  You could even use variable list since you will have explicitly defined the variable order with the LENGTH statement above.


  informat IPAddress $CHAR13. Time DATETime18. ;

  input UniqueID1 -- AudienceID ;


Also since you will have removed the ?? input operator you will see in the SAS LOG any values that cause conversion failures that are currently generating the missing values.

leeklammer
Fluorite | Level 6

Thanks for responding Tom!

I made the edits you suggested; however, I am still getting a dataset full of missing values.

Code:

filename ZIPFILE pipe '"C:\7z.exe" e

                          [file path]\20121225_BKV_imps.zip

                           20121225_BKV_imps.txt

                           -y -so';

Data work.temp;

  length UniqueID1 UniqueID2 Tentative 8 IPAddress $13. Time

  PlacementID AdID Click AdvertiserID AudienceID 8;

  format IPAddress $CHAR13. Time DATETime18.;

  informat IPAddress $CHAR13. Time DATETime18.;

  infile ZIPFILE dsd dlm='09'x lrecl=300 truncover;

  input UniqueID1 -- AudienceID;

run;

Log:

NOTE: The infile ZIPFILE is:

      Unnamed Pipe Access Device,

 

      PROCESS="C:\7z.exe" e                      

     [file path]\20121225_BKV_imps.zip

                               20121225_BKV_imps.txt                           -y -so,

      RECFM=V,LRECL=300

NOTE: Invalid data for UniqueID1 in line 1 1-20.

NOTE: Invalid data for UniqueID2 in line 1 22-42.

NOTE: Invalid data for Tentative in line 1 44-46.

NOTE: Invalid data for Time in line 1 74-120.

NOTE: Invalid data for PlacementID in line 1 122-140.

NOTE: Invalid data for AdID in line 1 142-160.

NOTE: Invalid data for Click in line 1 162-164.

NOTE: Invalid data for AdvertiserID in line 1 166-178.

NOTE: Invalid data for AudienceID in line 1 180-184.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---

1   CHAR  ÿþ8.1.6.3.6.1.1.7.2...2.1.3.6.9.8.9.9.4.2...0...[.......IP.......]...2.0.1.2.-.1.2.

    ZONE  FF30303030303030303000303030303030303030300030003030203030302030303020300030303030203030

    NUMR  FE80106030601010702090201030609080909040209000901020E0200050E0204020E0009020001020D01020

89  -.2.5. .0.4.:.1.8.:.0.3...0.0.0...3.9.7.9.4.1.2.5.1...3.0.1.3.5.3.1.5.0...0...1.6.5.6.5.

    ZONE  2030302030303030303030302030303000303030303030303030003030303030303030300030003030303030

    NUMR  D02050000040A01080A00030E000000090309070904010205010903000103050301050009000901060506050

177  6...0... 184

    ZONE  30003000

    NUMR  609000D0

UniqueID1=. UniqueID2=. Tentative=. IPAddress= 1 2 . 2 0 5  Time=. PlacementID=. AdID=. Click=.

AdvertiserID=. AudienceID=. _ERROR_=1 _N_=1

[...]

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18

Processing archive:

[file path]\20121225_BKV_imps.zip

Extracting  20121225_BKV_imps.txt

Everything is Ok

Size:       245463226

Compressed: 30965060

NOTE: 1323286 records were read from the infile ZIPFILE.

      The minimum record length was 2.

      The maximum record length was 203.

      One or more lines were truncated.

NOTE: The data set WORK.TEMP has 1323286 observations and 10 variables.

NOTE: DATA statement used (Total process time):

      real time           5.23 seconds

      cpu time            3.91 seconds

Output:

Capture.PNG

Tom
Super User Tom
Super User

Looks like your data is encoded as UTF-16 or some other 2 byte coding sequence. Probably if you had expanded into a file  and then read from the file SAS or the operating system could have auto detected the file encoding and it would have adjusted that encoding as i read it into the input buffer.

The first two bytes 'FFFE'X is the BOM (Byte order mark - Wikipedia, the free encyclopedia) that indicates the encoding.  This indicates that it is UTF-16 (each character is 2 bytes) with the first byte being the least significant.  

Knowing this you can now see that the 9 digit UniqueID1 value is stored in 18 bytes with every other byte being binary zero ('00'x).

I do not know anything about 7z.exe, but perhaps the -scc or -scs option can change this and make it output normal characters?

You could also try adding ENCODING option to your INFILE or FILENAME statement.

leeklammer
Fluorite | Level 6

I get an error message when I try to use the ENCODING option in either INFILE or FILENAME (ERROR 23-2: Invalid option name ENCODING.) and the dataset has 0 observations when I try out the -scc and -scs options.

I had been using the example found on the bottom of page 4 of this paper (http://www2.sas.com/proceedings/sugi31/155-31.pdf) to model my code; but it may be that 7zip is not a good option for this.

Tom, if you were going to import a zipped txt file using SAS, how would you do it?

temp.jpg

Tom
Super User Tom
Super User

I use WinZIP on PC and unzip on Unix.

Did you try ENCODING=u16l ?

SAS(R) 9.2 National Language Support (NLS): Reference Guide

leeklammer
Fluorite | Level 6

I did try ENCODING="u16l", but I got errors in both the FILENAME and INFILE statements.

Try 1:

16         filename ZIPFILE pipe '"C:\7z.exe" e

17                                  [file path]\20121225_BKV_imps.zip

18                                    20121225_BKV_imps.txt

19                                    -y -so' encoding="u16l";

                                              ________

                                              23

ERROR: Error in the FILENAME statement.

ERROR 23-2: Invalid option name encoding.

Try 2:

21         Data work.temp;

22         length UniqueID1 UniqueID2 Tentative 8 IPAddress $13. Time

23         PlacementID AdID Click AdvertiserID AudienceID 8;

24         format IPAddress $CHAR13. Time DATETime18.;

25         informat IPAddress $CHAR13. Time ANYDTDTM23.;

26         infile ZIPFILE dsd dlm='09'x lrecl=300 truncover encoding="u16l";

                                                             ________

                                                             23

ERROR 23-2: Invalid option name ENCODING.

27         input UniqueID1 -- AudienceID;

28         run;

Tom
Super User Tom
Super User

What version of SAS are you running and what operating system?  Encoding option works for me using SAS 9.2 on Unix.  Try running the little program below to generate a file with UTF-16LE encoding and try to read it from a pipe.  (If you are using Windows then change the command used to dump the file to standard output from cat to appropriate Windows command.)  When I run it without the ENCODING option on the INFILE statement I get similar content as you had.  When I add ENCODING=U16L most of the data is now readable, but for some reason there is one extra character ('1A'x) in the beginning of the first line and a warning message in the log about a character not be transcoded.

WARNING: A character that could not be transcoded was encountered.

line="Alfred M 14 69 112.5"

RULE:     ----+----1----+----2----+----3----+----4

1   CHAR  .Alfred M 14 69 112.5 21

    ZONE  146676624233233233323

    NUMR  A1C62540D0140690112E5

line="Alice F 13 56.5 84"

2         Alice F 13 56.5 84 18

%let tmpfile=%sysfunc(pathname(work))/temp.txt;

options generic;

filename xx "&tmpfile";

filename yy pipe "cat &tmpfile";


* Create UTF-16LE file ;

data _null_;

  file xx encoding= U16L ;

  set sashelp.class (obs=2);

  put (_all_) (:) ;

run;

* Read from std input ;

data _null_;

  infile yy truncover ;

  input line $char80.;

  list;

  put line= $quote.;

run;

* Add ENCODING option ;

data _null_;

  infile yy encoding=U16L truncover;

  input line $char80.;

  list;

  put line= $quote.;

run;

* Skip extra strange character on first line ;

data _null_;

  infile yy encoding=U16L truncover ;

  if _n_=1 then input xxx $char1. @;

  input line $char80.;

  list;

  put line= $quote.;

run;


filename xx clear;

filename yy clear;

options nogeneric;

Tom
Super User Tom
Super User

So I played around with Windows SAS and apparently it only likes the ENCODING= option on an INFILE statement when referencing an actual physical file, not when using FILENAME or PIPE engine.

You might want to implement a two step process.

1) Extract the file.

filename zipfile temp;

data _null_;

  infile  %sysfunc(quote(

"C:\7z.exe" e "[file path]\20130226_BKV_imps.zip" 20130226_BKV_imps.txt -y -so > "%sysfunc(pathname(zipfile))"

          )) pipe ;

  input;

  put _infile_;

run;

2) Read it.

....

infile zipfile dlm='09'x dsd truncover ;

....

leeklammer
Fluorite | Level 6

Hey Tom,

Thanks very much for your insight on this.  I ended up going with X-command to complete this task.

x "cd C:\Program Files\7-Zip\";

x '7z.exe e -o[file path]\TEMP\OUTPUT [file path]\TEMP\20130226_BKV_imps.zip';

Data work.temp;

  length UniqueID1 UniqueID2 Tentative 8 IPAddress $13. Time

  PlacementID AdID Click AdvertiserID AudienceID 8;

  format IPAddress $CHAR13. Time DATETime18.;

  informat IPAddress $CHAR13. Time ANYDTDTM23.;

  infile '[file path]\TEMP\OUTPUT\20130226_BKV_imps.txt'

  dsd dlm='09'x lrecl=300 truncover;

  input UniqueID1 -- AudienceID;

run;

x "cd [file path]\TEMP\OUTPUT\";

x "del 20130226_BKV_imps.txt";

RAVINDERKUMAR
Calcite | Level 5

HI TOM,

I want to download a file from web which is in zip format then import to sas data set. I used your suggestion but my program not produce correct result.

Please help me to resolve the same. My program and log is below mention...

Program...........

FILENAME NIFDATA URL "%STR(http://nseindia.com/content/historical/DERIVATIVES/2014/JAN/fo21JAN2014bhav.csv.zip)" DEBUG lrecl=8192;

data _null_ ;

unzipcmd='"C:\Program Files\7-Zip\7z.exe" -min -e -o ' ;

zipfile= "NIFDATA";

whereto='C:\RT' ;

cmd=unzipcmd || ' ' || zipfile || ' ' || whereto ;

putlog "NOTE-Processing command " cmd ;

call system( cmd ) ;

run ;

PROC IMPORT DATAFILE='C:\RT\fo24JAN2014bhav.csv' OUT=NIFTY DBMS=CSV REPLACE ;

RUN;

LOG......

2083

2084  FILENAME NIFDATA URL "%STR(http://nseindia.com/content/historical/DERIVATIVES/2014/JAN/fo21JAN2014bhav.csv.zip)" DEBUG lrecl=8192;

2085

2086  data _null_ ;

2087  unzipcmd='"C:\Program Files\7-Zip\7z.exe" -min -e -o ' ;

2088  zipfile= "NIFDATA";

2089  whereto='C:\RT' ;

2090  cmd=unzipcmd || ' ' || zipfile || ' ' || whereto ;

2091  putlog "NOTE-Processing command " cmd ;

2092  call system( cmd ) ;

2093  run ;

     Processing command "C:\Program Files\7-Zip\7z.exe" -min -e -o  NIFDATA C:\RT

NOTE: DATA statement used (Total process time):

      real time           0.32 seconds

      cpu time            0.07 seconds

2094

2095

2096  PROC IMPORT DATAFILE='c:\rt\fo24JAN2014bhav.csv' OUT=NIFTY11 DBMS=csv REPLACE ;

2097

2098  RUN;

ERROR: Physical file does not exist, c:\rt\fo24JAN2014bhav.csv.

NOTE: Import Cancelled.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: PROCEDURE IMPORT used (Total process time):

      real time           0.15 seconds

      cpu time            0.15 seconds

Thanks..

Ravinder Kumar

Tom
Super User Tom
Super User

7z.exe is not going to be able to see a fileref that you have defined in SAS. 

You will need to add a step to copy the file from the URL to a physical file on your computer that 7z.exe can then read.

econ
Quartz | Level 8

Hi Tom,

I'm also trying to download a zip file from the web and then read it into EG. The download works but the file size is slightly smaller than what is listed on the website. There are 2 text files in the download.

This is the code I'm using to download the file.

options mlogic sslcalistloc="/ma/scott/fhlmc/fhlmc.cert"; /* location of certificate file */

filename fhlmc url "https://freddiemac.embs.com/FLoan/Data/historical_data1_Q11999.zip" debug;

data _null_;

    infile fhlmc recfm=n;

    input;

    file "/ma/scott/fhlmc/data/historical_data1_Q11999.zip";

    put _infile_;

run;

I get an error if I manually try to unzip the file and when I try to read it directly I get this error.

SYMBOLGEN:  Macro variable LIB resolves to /ma/scott/fhlmc/data/

SYMBOLGEN:  Macro variable AS_OF_QTR_YEAR resolves to Q11999

MPRINT(ORIG):   FileName zip Saszipam "/ma/scott/fhlmc/data/historical_data1_Q11999.zip";

MPRINT(ORIG):   data origfile;

SYMBOLGEN:  Macro variable AS_OF_QTR_YEAR resolves to Q11999

MPRINT(ORIG):   infile zip(historical_data1_Q11999.txt) dsd dlm='|' missover firstobs=2 lrecl=32767;

MPRINT(ORIG):   input fico : 8. dt_first_pi : 8. flag_fthb : $1. dt_matr : 8. cd_msa : 8. mi_pct : 8. cnt_units : 8. occpy_sts :

$1. cltv : 8. dti : 8. orig_upb : 8. ltv : 8. int_rt : 8. channel : $1. ppmt_pnlty : $1. prod_type : $5. st : $2. prop_type : $2.

zipcode : 8. id_loan : $16. loan_purpose : $5. orig_loan_term : 8. cnt_borr : $2. seller_name : $30. servicer_name : $30. ;

MPRINT(ORIG):   run;

2                                                          The SAS System                                09:07 Monday, June 16, 2014

ERROR: The central directory is invalid in /ma/scott/fhlmc/data/historical_data1_Q11999.zip.

WARNING: End of file.

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.ORIGFILE may be incomplete.  When this step was stopped there were 0 observations and 25 variables.

WARNING: Data set WORK.ORIGFILE was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 10778 views
  • 10 likes
  • 4 in conversation