Re: how to import a large .csv file into SAS if it includes the 'extra... - Page 3

France · Posted 03-27-2018 02:52 PM

the full email is following

the reason why you have the errors is because your loading procedure did not take into account that text embedded in double quotes should be considered as 1 data item.

If you don’t do that, then any comma in one of the text or string fields will create a new column and the lining is gone.

Here is an example from your table; observe that all “ have been removed --> import will not work correctly because appln_nr_original contained a commas.

appln_id	appln_auth	appln_nr	appln_kind	appln_filing_date	appln_filing_year	appln_nr_epodoc	appln_nr_original	ipr_type
3897853	CA	301	A	21-08-1970	1970	CA19700000301	091,301,	PI

The column appln_nr_original is defined as 100 characters, so it could as well be number as any other character. In this case there are 2 commas in the text. If your tool does not take double quotes into account, then it will create extra columns and alignment will be gone.

The original data on the delivered text files should look like this:

0521133,"DE","3062390","A ",1980-02-22,1980,"DE19803062390","80 80400253","PI",0,N,Y,Y,1979-03-02,1979

Observe the double quotes being used! Each string / text value is enclosed in double quotes, like: "Smith, John", “”900,00O” etc....

Within string values there are never double quotes; if by any reason, the original text contained double quotes (for example in the application abstracts), then they will have been replaced by a single quote.

thus, I think I may make some mistakes.

Tom · Posted 03-27-2018 03:13 PM

Did you process the file they sent in any way before trying to read it with SAS?

If not then they didn't send you what they thought they did.

If you did pre-process the file then attempt to read the original file with SAS and see if that fixes your issues.

JohnHoughton · Posted 03-27-2018 03:22 PM

To help things along I've registered for patstat and run the following query

Select * from tls201_appln where appln_id in (3574782,3897835)
UNION
SELECT TOP 10 * from tls201_appln where appln_nr_original like '%,%' and appln_nr_original not like '%,'
UNION
SELECT TOP 10 * from tls201_appln where appln_nr_original not like '%,%'

And attached the csv file ... and yes it does contain quotes! (and it's delimited with semi-colons)

Tom · Posted 03-27-2018 03:40 PM

I had to download that file to look at it because this **forum** displays it as if it was a spreadsheet instead of showing the actual text.

The file is delimited with semi-colons. It does have quotes.

11    options ls=132 ;
12    data _null_;
13      infile 'c:\\downloads\resulttable.csv' obs=3 ;
14      input;
15      list;
16    run;

NOTE: A byte-order mark in the file "c:\\downloads\resulttable.csv" (for fileref "#LN00045") indicates that the
      data is encoded in "utf-8".  This encoding will be used to process the file.
NOTE: The infile 'c:\downloads\resulttable.csv' is:
      Filename=c:\downloads\resulttable.csv,
      RECFM=V,LRECL=131068,File Size (bytes)=3803,
      Last Modified=27Mar2018:15:26:32,
      Create Time=27Mar2018:15:26:32

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         appln_id;appln_auth;appln_nr;appln_kind;appln_filing_date;appln_filing_year;appln_nr_epodoc;appln_nr
     101  _original;ipr_type;internat_appln_id;int_phase;reg_phase;nat_phase;earliest_filing_date;earliest_fil
     201  ing_year;earliest_filing_id;earliest_publn_date;earliest_publn_year;earliest_pat_publn_id;granted;do
     301  cdb_family_id;inpadoc_family_id;docdb_family_size;nb_citing_docdb_fam;nb_applicants;nb_inventors 396
2         0;"XX";"";"D ";9999-12-31;9999;"";"";"PI";0;"N";"N";"N";9999-12-31;9999;0;9999-12-31;9999;0;0;0;0;1;
     101  0;0;0 105
3         1;"EP";"00103094";"A ";2000-02-15;2000;"EP20000103094";"00103094";"PI";0;"N";"Y";"N";1999-03-11;1999
     101  ;18546269;2000-09-20;2000;293253293;1;8554171;1;6;72;1;4 156
NOTE: 3 records were read from the infile 'c:\\downloads\resulttable.csv'.

It actually has more quotes than it needs. There is no need to put quotes around EP or 00103094. Perhaps the software that generated thinks the quotes are somehow an indication that the data value is supposed to be interpreted as text?

The extra quotes do not cause an issue.

Another strange thing is the inclusion of "" as the value of the third column on the first data row. Normally you would just represent an empty value as two delimiters next to each other.

Again that should not cause any trouble.

Another strange thing is the fourth column. Why is there a space after the D and A? The extra spaces will be ignored. But perhaps PROC IMPORT might use them to decide that the fourth column should be defined with a length of $2 instead of $1.

Again spaces should not cause an issue.

Tom · Posted 03-27-2018 03:50 PM

Here are the lines from that 22 record sample that contain commas.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
12        3574782;"BR";"7600215";"A ";1976-04-08;1976;"BR19767600215";"760215,";"PI";0;"N";"N";"Y";1975-04-14;
     101  1975;41149045;1976-10-05;1976;312739797;0;19727905;327506;29;56;1;2 167
13        3897830;"CA";"133";"A ";1972-01-26;1972;"CA19720000133";"133,              *7";"PI";0;"N";"N";"Y";19
     101  72-01-26;1972;3897830;1974-02-12;1974;313964820;1;4083809;13098528;1;0;1;1 174
14        3897835;"CA";"182";"A ";1967-09-15;1967;"CA19670000182";"000,182";"PI";0;"N";"N";"Y";1966-09-16;1966
     101  ;22485460;1975-03-18;1975;315362359;1;10420177;1822558;14;0;1;1 163
15        3897838;"CA";"200";"A ";1974-05-16;1974;"CA19740000200";"200,10";"PI";0;"N";"N";"Y";1973-07-23;1973;
     101  50446315;1977-02-08;1977;313897066;1;23506530;1952392;1;0;1;1 161
16        3897856;"CA";"325";"A ";1967-09-18;1967;"CA19670000325";"000,325";"PI";0;"N";"N";"Y";1966-09-21;1966
     101  ;901804757;1976-08-24;1976;312974165;1;6939039;1822628;7;21;1;3 163
17        3897859;"CA";"651";"A ";1967-09-21;1967;"CA19670000651";"000,651";"PI";0;"N";"N";"Y";1966-10-03;1966
     101  ;905428887;1974-02-26;1974;314781130;1;24334948;3777769;5;6;1;1 163
18        3897860;"CA";"708";"A ";1967-09-22;1967;"CA19670000708";"000,708";"PI";0;"N";"N";"Y";1966-10-06;1966
     101  ;905433308;1974-05-14;1974;314549444;1;24338839;1952402;2;4;1;2 163
19        3897865;"CA";"1262";"A ";1967-09-29;1967;"CA19670001262";"001,262";"PI";0;"N";"N";"Y";1966-10-04;196
     101  6;905429980;1975-06-17;1975;313423709;1;24336185;1823098;3;0;1;1 164
20        3897866;"CA";"1390";"A ";1967-09-30;1967;"CA19670001390";"001,390";"PI";0;"N";"N";"Y";1966-10-01;196
     101  6;901833818;1975-01-21;1975;315245242;1;7103726;546654;15;1;1;2 163
21        3897867;"CA";"1463";"A ";1967-10-02;1967;"CA19670001463";"001,463";"PI";0;"N";"N";"Y";1966-12-05;196
     101  6;905453651;1977-01-25;1977;314775599;1;27083204;1825303;7;7;1;1 164
22        3897869;"CA";"1638";"A ";1967-10-04;1967;"CA19670001638";"001,638";"PI";0;"N";"N";"Y";1966-10-17;196
     101  6;905431461;1974-01-08;1974;314966616;1;24348073;1823471;12;20;1;1 166

Even though the use of semi-colons as the delimiter means the values with commas do not need the quotes.

Reeza · Posted 03-27-2018 03:18 PM

Can you attach a small sample of your file, the first 30 records. If it's confidential can you email it to one of us to test?

JohnHoughton · Posted 03-27-2018 04:20 PM

Hi @France.

Go back to the code you put in the very first post in this thread and change it to use a semi-colon delimiter, DLM=";".

This worked for me with the sample table I downloaded from patstat (and attached to earlier message)

Reeza · Posted 03-27-2018 06:30 PM

As someone on here helpfully pointed out to me at one point, CSV in Europe typically uses a semicolon whereas in North America it's comma.

France · Posted 03-29-2018 05:41 AM

thanks for your help, but the variable in my data is seperated by commas.

JohnHoughton · Posted 03-29-2018 08:09 AM

Ok , so you aren't getting it from Patstat online but on datasets that have been supplied possibly on a memory stick.

Try Using Encoding="unicode" option in Infile statement (as in this thread)

I've copied below from Patstat documentation which describes the data that I think you are using

2.1. Data files
You will find the data files in the folder Data. For every table there is one or more zipped data file(s). To keep all files (if unzipped) smaller than 2 GB, the data of the large tables are split into multiple files. Sometimes users report problems unzipping some files when using WinZip. In these cases it is advisable to try a different unpacking tool like the open source tool 7-zip. After unzipping, each file contains one header record comprising the column names of the table, followed by multiple data records.
 All characters are in Unicode with UTF-8 encoding
 The files are in MS-DOS format (i.e. each line ends with CR/LF)
 Each value is delimited by a comma ","
 Each string / text value is enclosed in double quotes, like: "Smith, John". Within string values there are never double quotes.
 All non-text values (numbers, dates) are not enclosed in quotes, like: 123, 2014-12-31  The decimal separator is the point ".", e.g. 0.125
 Line breaking characters (LF, CR) are replaced by " \n ". They occur most frequently in the abstract text and the NPL bibliographic text.
 Except "\n", no other characters are escaped.

JohnHoughton · Posted 05-12-2018 12:03 PM

Hi @France, So did this eventually work when you used the semicolon delimiter?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away