I'm trying to import a TSV format data using the following code
To make your testing easier take PROC IMPORT out of the loop. Just read the data using a data step to check. For example if your file has 15 columns then just try something like:
data test;
infile 'E:\path\file.tsv' dsd dlm='09'x truncover firstobs=2;
input (var1-var15) (:$50.);
run;
I think that you are describing a known issue, but please provide a workable example. For example take your problem line and convert the tabs to some other visible character and try reading it using in-line data.
data test;
infile cards dsd dlm='|' truncover ;
input (var1-var15) (:$20.);
put (var1-var15) (=/);
cards4;
4434454|APPLE|IPAD|COMPUTER|SYS|APPLE|IPAD|COMPUTER|SYS|"AIR/TABLET|PC "|1330622|2467340|B311|356
;;;;
Results:
var1=4434454 var2=APPLE var3=IPAD var4=COMPUTER var5=SYS var6=APPLE var7=IPAD var8=COMPUTER var9=SYS var10=AIR/TABLET|PC var11=1330622 var12=2467340 var13=B311 var14=356 var15=
The known issue is that when you have one field that starts with a quote and a later field that ends with a quote that SAS will see that as indicating one field that has the delimiter as actual data.
To fix the source file so that it does not cause this trouble you need to quote values that have the quote character (in addition to quoting values that have the delimiter).
4434454|APPLE|IPAD|COMPUTER|SYS|APPLE|IPAD|COMPUTER|SYS|"""AIR/TABLET"|"PC """|1330622|2467340|B311|356
Check whether the creator of this file can re-create it following that rule.
You can try reading it without the DSD option.
74 data test; 75 infile cards dlm='|' truncover ; 76 input (var1-var15) (:$20.); 77 put (var1-var15) (=/); 78 cards4; var1=4434454 var2=APPLE var3=IPAD var4=COMPUTER var5=SYS var6=APPLE var7=IPAD var8=COMPUTER var9=SYS var10="AIR/TABLET var11=PC " var12=1330622 var13=2467340 var14=B311 var15=356
This will only work if you never have the situation where there is an empty field that results in two adjacent tabs in the data. You could try fixing that on the fly by modifying the line of data by using the _INFILE_ automatic variable.
data test;
infile cards dlm='|' truncover ;
input @;
_infile_=tranwrd(_infile_,'||','|.|');
input (var1-var15) (:$20.);
put (var1-var15) (=/);
cards4;
4434454|APPLE||COMPUTER|SYS||IPAD||SYS|"AIR/TABLET|PC "|1330622|2467340|B311|356
;;;;
var1=4434454 var2=APPLE var3= var4=COMPUTER var5=SYS var6= var7=IPAD var8= var9=SYS var10="AIR/TABLET var11=PC " var12=1330622 var13=2467340 var14=B311 var15=
If you want to use PROC IMPORT and allow it to guess how to read the data instead of writing your own data step to read the data you could code the fix step to create a new file and then pass that to PROC IMPORT.
filename fixed temp;
data _null_;
infile 'E:\path\file.tsv' ;
file temp;
input ;
_infile_=tranwrd(_infile_,'0909'x,'09'x||'.'||'09'x);
put _infile_;
run;
proc import datafile=fixed ....
Copy the entire line of the data. On the forum open a code box with the </>, paste the line. Better would be to include a couple lines before and a couple after.
The message boxes on this forum remove/replace characters and so we cannot tell the actual content of your data otherwise.
You may also want to copy the generated data step that Proc import creates to read the data an paste that into a separate code box.
I have a suspicion that the "there's a space then double quotes after "PC" " may actually be another tab that only appears to be a single space.
The format that you pasted the 'SAS data' does not tell us where any variable starts or ends and is again likely corrupted by the forum reformatting. That period that appears after PC is indicative of a missing numeric variable, which could occur if there were a tab meaning the Quote character after that "space" was attempted to be read into a numeric and would be missing as invalid numeric data.
More information
The original data.
4434424 STANTON RENAISSANCE BUILDERS STANTON RENAISSANCE BUILDERS 2325574 2325574 B111 193
4434454 APPLE IPAD COMPUTER SYS APPLE IPAD COMPUTER SYS "AIR/TABLET PC " 1330622 2467340 B311 356
4434573 241 PIZZA RESTAURANT 241 PIZZA RESTAURANT FESTIVE FAVOURITE 1541372 2467232 G330 1253
I'm using "|" to indicate separate variables.
4434424|STANTON RENAISSANCE BUILDERS|STANTON RENAISSANCE BUILDERS|2325574|2325574|B111|193
4434454|APPLE IPAD COMPUTER SYS|APPLE IPAD COMPUTER SYS "AIR/TABLET PC "|1330622|2467340|B311|356
4434573|241 PIZZA RESTAURANT|241 PIZZA RESTAURANT FESTIVE FAVOURITE|1541372|2467232|G330|1253
The generated data step:
1070 data WORK.BRAND ;
1071 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
1072 infile 'E:\path\file.tsv' delimiter='09'x MISSOVER
1072! DSD lrecl=32767 firstobs=2 ;
1073 informat v1 best32. ;
1074 informat v2 $80. ;
1075 informat v3 $91. ;
1076 informat v4 best32. ;
1077 informat v5 best32. ;
1078 informat v6 $4. ;
1079 informat v7 best32. ;
1080 format v1 best12. ;
1081 format v2 $80. ;
1082 format v3 $91. ;
1083 format v4 best12. ;
1084 format v5 best12. ;
1085 format v6 $4. ;
1086 format v7 best12. ;
1087 input
1088 v1
1089 v2 $
1090 v3 $
1091 v4
1092 v5
1093 v6 $
1094 v7
1095 ;
1096 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
1097 run;
NOTE: The infile 'E:\path\file.tsv' is:
Filename=E:\path\file.tsv,
RECFM=V,LRECL=32767,
File Size (bytes)=340331123,
Last Modified=02Apr2020:11:41:49,
Create Time=13Jul2020:09:45:36
NOTE: Invalid data for v4 in line 3064481 72-72.
NOTE: Invalid data for v7 in line 3064481 90-93.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
3064481 4434454.APPLE IPAD COMPUTER SYS.APPLE IPAD COMPUTER SYS "AIR/TABLET PC.".1330622.2467340.B31
ZONE 33333330455442454424445554525550455442454424445554525552244525444452540203333333033333330433
NUMR 44344549100C50901403FD0545203939100C50901403FD05452039302192F412C540039291330622924673409231
93 1.356 97
ZONE 30333
NUMR 19356
v1=4434454 v2=APPLE IPAD COMPUTER SYS
v3=APPLE IPAD COMPUTER SYS "AIR/TABLET PC v4=. v5=1330622
v6=2467 v7=. _ERROR_=1 _N_=3064480
If I first transfer the TSV into CSV format data, then SAS can import the csv data without errors.
When you have diagnostics like
3064481 4434454.APPLE IPAD COMPUTER SYS.APPLE IPAD COMPUTER SYS "AIR/TABLET PC.".1330622.2467340.B31 ZONE 33333330455442454424445554525550455442454424445554525552244525444452540203333333033333330433 NUMR 44344549100C50901403FD0545203939100C50901403FD05452039302192F412C540039291330622924673409231
where you see the Zone = 0 over NUMR=9 there is a TAB character. So you have a tab between the PC and ", as I suspected.
The character after PC is NOT a space, but another TAB. So your attempt to replace the tabs with pipes is wrong since you didn't replace that one.
So the error with your file is a different error than what I said before. The error is that when a value contains the delimiter then the value must be enclosed in quotes in the delimited file. It does not matter whether the delimiter in the value already is quoted in the value itself like in your example (although any existing quote characters will need to be doubled up).
So the value
APPLE IPAD COMPUTER SYS "AIR/TABLET PC|"
should appear as
"APPLE IPAD COMPUTER SYS ""AIR/TABLET PC|"""
or have the tab changed to a space
APPLE IPAD COMPUTER SYS "AIR/TABLET PC "
to work as a valid value in a delimited file.
Most likely the only reason that using CSV instead of tab delimited worked for you is that none of your actual data contains commas while at least this one value does contain a tab.
What is the source of this file? Is it being written by some software? Commercial/public software like a database system? Or custom code? In either case request that they produce valid delimited files. If the source system does not produce valid delimited files then they cannot be interpreted by anyone. It you are using a modern software it should be able to know when to add the quotes around the values. If not you might be able to ask it to add quotes around everything. Although that will confusing PROC IMPORT into thinking that everything is a character variable.
You could try counting the quotes on the line and replacing any tabs inside the quotes with spaces, but I am not sure that you can be positive that every embedded tab in your data will actually appear inside of quotes in the value.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.