BookmarkSubscribeRSS Feed
jl2978
Fluorite | Level 6

I'm trying to import a TSV format data using the following code

 

proc import datafile = 'E:\path\file.tsv'
  out = out1
  dbms = tab
  replace;
guessingrows=32767;
run; 
 
I also tried this:
 
proc import datafile = 'E:\path\file.tsv'
  out = out1
  dbms = dlm
  replace;
guessingrows=32767;
delimiter='09'x;
run; 
 
However, when there's a pair of double quotes in a string for a variable, SAS reports errors, and the resulting lines are incorrect. 
 
e.g, the value of APPLE IPAD COMPUTER SYS APPLE IPAD COMPUTER SYS "AIR/TABLET PC "
will only be imported as APPLE IPAD COMPUTER SYS APPLE IPAD COMPUTER SYS "AIR/TABLET PC.
the value for the next variables will be .
 
The data:
4434454 APPLE IPAD COMPUTER SYS APPLE IPAD COMPUTER SYS "AIR/TABLET PC " 1330622 2467340 B311 356
 
[there's a space then double quotes after "PC"]
 
SAS import results:
4434454 APPLE IPAD COMPUTER SYS APPLE IPAD COMPUTER SYS "AIR/TABLET PC  . 1330622 2467 .
 
Any way to solve this? Thanks!  
 
 
7 REPLIES 7
Tom
Super User Tom
Super User

To make your testing easier take PROC IMPORT out of the loop.  Just read the data using a data step to check.  For example if your file has 15 columns then just try something like:

data test;
  infile  'E:\path\file.tsv' dsd dlm='09'x truncover firstobs=2;
  input (var1-var15) (:$50.);
run;

I think that you are describing a known issue, but please provide a workable example.  For example take your problem line and convert the tabs to some other visible character and try reading it using in-line data.

data test;
  infile cards dsd dlm='|' truncover ;
  input (var1-var15) (:$20.);
  put (var1-var15) (=/);
cards4;
4434454|APPLE|IPAD|COMPUTER|SYS|APPLE|IPAD|COMPUTER|SYS|"AIR/TABLET|PC "|1330622|2467340|B311|356
;;;;

Results:

var1=4434454
var2=APPLE
var3=IPAD
var4=COMPUTER
var5=SYS
var6=APPLE
var7=IPAD
var8=COMPUTER
var9=SYS
var10=AIR/TABLET|PC
var11=1330622
var12=2467340
var13=B311
var14=356
var15=

The known issue is that when you have one field that starts with a quote and a later field that ends with a quote that SAS will see that as indicating one field that has the delimiter as actual data.

To fix the source file so that it does not cause this trouble you need to quote values that have the quote character (in addition to quoting values that have the delimiter). 

4434454|APPLE|IPAD|COMPUTER|SYS|APPLE|IPAD|COMPUTER|SYS|"""AIR/TABLET"|"PC """|1330622|2467340|B311|356

Check whether the creator of this file can re-create it following that rule.

 

You can try reading it without the DSD option.

74   data test;
75     infile cards dlm='|' truncover ;
76     input (var1-var15) (:$20.);
77     put (var1-var15) (=/);
78   cards4;


var1=4434454
var2=APPLE
var3=IPAD
var4=COMPUTER
var5=SYS
var6=APPLE
var7=IPAD
var8=COMPUTER
var9=SYS
var10="AIR/TABLET
var11=PC "
var12=1330622
var13=2467340
var14=B311
var15=356

This will only work if you never have the situation where there is an empty field that results in two adjacent tabs in the data.  You could try fixing that on the fly by modifying the line of data by using the _INFILE_ automatic variable.

data test;
  infile cards dlm='|' truncover ;
  input @;
  _infile_=tranwrd(_infile_,'||','|.|');
  input (var1-var15) (:$20.);
  put (var1-var15) (=/);
cards4;
4434454|APPLE||COMPUTER|SYS||IPAD||SYS|"AIR/TABLET|PC "|1330622|2467340|B311|356
;;;;
var1=4434454
var2=APPLE
var3=
var4=COMPUTER
var5=SYS
var6=
var7=IPAD
var8=
var9=SYS
var10="AIR/TABLET
var11=PC "
var12=1330622
var13=2467340
var14=B311
var15=

If you want to use PROC IMPORT and allow it to guess how to read the data instead of writing your own data step to read the data you could code the fix step to create a new file and then pass that to PROC IMPORT.

filename fixed temp;
data _null_;
  infile 'E:\path\file.tsv' ;
  file temp;
  input ;
  _infile_=tranwrd(_infile_,'0909'x,'09'x||'.'||'09'x);
  put _infile_;
run;
proc import datafile=fixed ....

 

jl2978
Fluorite | Level 6
Thank you so much for such a detailed reply!
Sorry I wasn't clear about the variables before.
APPLE IPAD COMPUTER SYS "AIR/TABLET PC " is actually the value of one variable. But what you suggest to do looks like a good idea!
I have a new reply below that shows more information about the data structure.
ballardw
Super User

Copy the entire line of the data. On the forum open a code box with the </>, paste the line. Better would be to include a couple lines before and a couple after.

The message boxes on this forum remove/replace characters and so we cannot tell the actual content of your data otherwise.

 

You may also want to copy the generated data step that Proc import creates to read the data an paste that into a separate code box.

I have a suspicion that the "there's a space then double quotes after "PC" " may actually be another tab that only appears to be a single space.

The format that you pasted the 'SAS data' does not tell us where any variable starts or ends and is again likely corrupted by the forum reformatting. That period that appears after PC is indicative of a missing numeric variable, which could occur if there were a tab meaning the Quote character after that "space" was attempted to be read into a numeric and would be missing as invalid numeric data.

jl2978
Fluorite | Level 6
Thanks! I have a new post below to show the information.

I don't think there's a tab instead of space because if I first transfer the data into CSV format then SAS can import perfectly. But I have other datasets that are much larger I can't even easily open them and transfer as CSV format.
jl2978
Fluorite | Level 6

 

More information

 

The original data.

4434424	STANTON RENAISSANCE BUILDERS	STANTON RENAISSANCE BUILDERS 	2325574	2325574	B111	193
4434454	APPLE IPAD COMPUTER SYS	APPLE IPAD COMPUTER SYS "AIR/TABLET PC	"	1330622	2467340	B311	356
4434573	241 PIZZA RESTAURANT	241 PIZZA RESTAURANT FESTIVE FAVOURITE	1541372	2467232	G330	1253

I'm using "|" to indicate separate variables. 

4434424|STANTON RENAISSANCE BUILDERS|STANTON RENAISSANCE BUILDERS|2325574|2325574|B111|193
4434454|APPLE IPAD COMPUTER SYS|APPLE IPAD COMPUTER SYS "AIR/TABLET PC	"|1330622|2467340|B311|356
4434573|241 PIZZA RESTAURANT|241 PIZZA RESTAURANT FESTIVE FAVOURITE|1541372|2467232|G330|1253

The generated data step:

1070      data WORK.BRAND    ;
1071      %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
1072      infile 'E:\path\file.tsv' delimiter='09'x MISSOVER
1072! DSD lrecl=32767 firstobs=2 ;
1073         informat v1 best32. ;
1074         informat v2 $80. ;
1075         informat v3 $91. ;
1076         informat v4 best32. ;
1077         informat v5 best32. ;
1078         informat v6 $4. ;
1079         informat v7 best32. ;
1080         format v1 best12. ;
1081         format v2 $80. ;
1082         format v3 $91. ;
1083         format v4 best12. ;
1084         format v5 best12. ;
1085         format v6 $4. ;
1086         format v7 best12. ;
1087      input
1088                  v1
1089                  v2  $
1090                  v3  $
1091                  v4
1092                  v5
1093                  v6  $
1094                  v7
1095      ;
1096      if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
1097      run;

NOTE: The infile 'E:\path\file.tsv' is:
      Filename=E:\path\file.tsv,
      RECFM=V,LRECL=32767,
      File Size (bytes)=340331123,
      Last Modified=02Apr2020:11:41:49,
      Create Time=13Jul2020:09:45:36

NOTE: Invalid data for v4 in line 3064481 72-72.
NOTE: Invalid data for v7 in line 3064481 90-93.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--

3064481   4434454.APPLE IPAD COMPUTER SYS.APPLE IPAD COMPUTER SYS "AIR/TABLET PC.".1330622.2467340.B31
    ZONE  33333330455442454424445554525550455442454424445554525552244525444452540203333333033333330433
    NUMR  44344549100C50901403FD0545203939100C50901403FD05452039302192F412C540039291330622924673409231

      93  1.356 97
    ZONE  30333
    NUMR  19356
v1=4434454 v2=APPLE IPAD COMPUTER SYS
v3=APPLE IPAD COMPUTER SYS "AIR/TABLET PC v4=. v5=1330622
v6=2467 v7=. _ERROR_=1 _N_=3064480

If I first transfer the TSV into CSV format data, then SAS can import the csv data without errors. 

 

ballardw
Super User

When you have diagnostics like

3064481   4434454.APPLE IPAD COMPUTER SYS.APPLE IPAD COMPUTER SYS "AIR/TABLET PC.".1330622.2467340.B31
    ZONE  33333330455442454424445554525550455442454424445554525552244525444452540203333333033333330433
    NUMR  44344549100C50901403FD0545203939100C50901403FD05452039302192F412C540039291330622924673409231

where you see the Zone = 0 over NUMR=9 there is a TAB character. So you have a tab between the PC and ", as I suspected.

 

Tom
Super User Tom
Super User

The character after PC is NOT a space, but another TAB.  So your attempt to replace the tabs with pipes is wrong since you didn't replace that one.

 

So the error with your file is a different error than what I said before.  The error is that when a value contains the delimiter then the value must be enclosed in quotes in the delimited file.  It does not matter whether the delimiter in the value already is quoted in the value itself like in your example (although any existing quote characters will need to be doubled up).

 

So the value

APPLE IPAD COMPUTER SYS "AIR/TABLET PC|"

should appear as 

"APPLE IPAD COMPUTER SYS ""AIR/TABLET PC|"""

or have the tab changed to a space

APPLE IPAD COMPUTER SYS "AIR/TABLET PC "

to work as a valid value in a delimited file.

 

Most likely the only reason that using CSV instead of tab delimited worked for you is that none of your actual data contains commas while at least this one value does contain a tab.

 

What is the source of this file? Is it being written by some software? Commercial/public software like a database system? Or custom code?  In either case request that they produce valid delimited files.  If the source system does not produce valid delimited files then they cannot be interpreted by anyone.  It you are using a modern software it should be able to know when to add the quotes around the values.  If not you might be able to ask it to add quotes around everything.  Although that will confusing PROC IMPORT into thinking that everything is a character variable.

 

You could try counting the quotes on the line and replacing any tabs inside the quotes with spaces, but I am not sure that you can be positive that every embedded tab in your data will actually appear inside of quotes in the value.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1017 views
  • 3 likes
  • 3 in conversation