BookmarkSubscribeRSS Feed
France
Quartz | Level 8

the full email is following 

 

the reason why you have the errors is because your loading procedure did not take into account that text embedded in double quotes should be considered as 1 data item.

If you don’t do that, then any comma in one of the text or string fields will create a new column and the lining is gone.

 

Here is an example from your table; observe that all “ have been removed --> import will not work correctly because appln_nr_original contained a commas.

 

appln_id

appln_auth

appln_nr

appln_kind

appln_filing_date

appln_filing_year

appln_nr_epodoc

appln_nr_original

ipr_type

3897853

CA

301

A

21-08-1970

1970

CA19700000301

091,301,

PI

 

 

The column appln_nr_original is defined as 100 characters, so it could as well be number as any other character.  In this case there are 2 commas in the text.  If your tool does not take double quotes into account, then it will create extra columns and alignment will be gone.

The original data on the delivered text files should look like this:

 

0521133,"DE","3062390","A ",1980-02-22,1980,"DE19803062390","80  80400253","PI",0,N,Y,Y,1979-03-02,1979

 

Observe the double quotes being used!  Each string / text value is enclosed in double quotes, like: "Smith, John", “”900,00O”  etc....

Within string values there are never double quotes; if by any reason, the original text contained double quotes (for example in the application abstracts), then they will have been replaced by a single quote.

 

thus, I think I may make some mistakes.

Tom
Super User Tom
Super User

Did you process the file they sent in any way before trying to read it with SAS?

If not then they didn't send you what they thought they did.

If you did pre-process the file then attempt to read the original file with SAS and see if that fixes your issues.

JohnHoughton
Quartz | Level 8

To help things along I've registered for patstat and run the following query

 

Select * from tls201_appln where appln_id in (3574782,3897835)
UNION
SELECT TOP 10 * from tls201_appln where appln_nr_original like '%,%' and appln_nr_original not like '%,'
UNION
SELECT TOP 10 * from tls201_appln where appln_nr_original not like '%,%'

 

And attached the csv file ... and yes it does contain quotes! (and it's delimited with semi-colons)

 

 

Tom
Super User Tom
Super User

I had to download that file to look at it because this **forum** displays it as if it was a spreadsheet instead of showing the actual text.

 

The file is delimited with semi-colons.  It does have quotes.

11    options ls=132 ;
12    data _null_;
13      infile 'c:\\downloads\resulttable.csv' obs=3 ;
14      input;
15      list;
16    run;

NOTE: A byte-order mark in the file "c:\\downloads\resulttable.csv" (for fileref "#LN00045") indicates that the
      data is encoded in "utf-8".  This encoding will be used to process the file.
NOTE: The infile 'c:\downloads\resulttable.csv' is:
      Filename=c:\downloads\resulttable.csv,
      RECFM=V,LRECL=131068,File Size (bytes)=3803,
      Last Modified=27Mar2018:15:26:32,
      Create Time=27Mar2018:15:26:32

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         appln_id;appln_auth;appln_nr;appln_kind;appln_filing_date;appln_filing_year;appln_nr_epodoc;appln_nr
     101  _original;ipr_type;internat_appln_id;int_phase;reg_phase;nat_phase;earliest_filing_date;earliest_fil
     201  ing_year;earliest_filing_id;earliest_publn_date;earliest_publn_year;earliest_pat_publn_id;granted;do
     301  cdb_family_id;inpadoc_family_id;docdb_family_size;nb_citing_docdb_fam;nb_applicants;nb_inventors 396
2         0;"XX";"";"D ";9999-12-31;9999;"";"";"PI";0;"N";"N";"N";9999-12-31;9999;0;9999-12-31;9999;0;0;0;0;1;
     101  0;0;0 105
3         1;"EP";"00103094";"A ";2000-02-15;2000;"EP20000103094";"00103094";"PI";0;"N";"Y";"N";1999-03-11;1999
     101  ;18546269;2000-09-20;2000;293253293;1;8554171;1;6;72;1;4 156
NOTE: 3 records were read from the infile 'c:\\downloads\resulttable.csv'.

It actually has more quotes than it needs.  There is no need to put quotes around EP or 00103094.  Perhaps the software that generated thinks the quotes are somehow an indication that the data value is supposed to be interpreted as text? 

 

The extra quotes do not cause an issue.  

 

Another strange thing is the inclusion of "" as the value of the third column on the first data row.  Normally you would just represent an empty value as two delimiters next to each other.

 

Again that should not cause any trouble.

 

Another strange thing is the fourth column.  Why is there a space after the D and A?   The extra spaces will be ignored. But perhaps PROC IMPORT might use them to decide that the fourth column should be defined with a length of $2 instead of $1.

 

Again spaces should not cause an issue.

 

 

Tom
Super User Tom
Super User

Here are the lines from that 22 record sample that contain commas.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
12        3574782;"BR";"7600215";"A ";1976-04-08;1976;"BR19767600215";"760215,";"PI";0;"N";"N";"Y";1975-04-14;
     101  1975;41149045;1976-10-05;1976;312739797;0;19727905;327506;29;56;1;2 167
13        3897830;"CA";"133";"A ";1972-01-26;1972;"CA19720000133";"133,              *7";"PI";0;"N";"N";"Y";19
     101  72-01-26;1972;3897830;1974-02-12;1974;313964820;1;4083809;13098528;1;0;1;1 174
14        3897835;"CA";"182";"A ";1967-09-15;1967;"CA19670000182";"000,182";"PI";0;"N";"N";"Y";1966-09-16;1966
     101  ;22485460;1975-03-18;1975;315362359;1;10420177;1822558;14;0;1;1 163
15        3897838;"CA";"200";"A ";1974-05-16;1974;"CA19740000200";"200,10";"PI";0;"N";"N";"Y";1973-07-23;1973;
     101  50446315;1977-02-08;1977;313897066;1;23506530;1952392;1;0;1;1 161
16        3897856;"CA";"325";"A ";1967-09-18;1967;"CA19670000325";"000,325";"PI";0;"N";"N";"Y";1966-09-21;1966
     101  ;901804757;1976-08-24;1976;312974165;1;6939039;1822628;7;21;1;3 163
17        3897859;"CA";"651";"A ";1967-09-21;1967;"CA19670000651";"000,651";"PI";0;"N";"N";"Y";1966-10-03;1966
     101  ;905428887;1974-02-26;1974;314781130;1;24334948;3777769;5;6;1;1 163
18        3897860;"CA";"708";"A ";1967-09-22;1967;"CA19670000708";"000,708";"PI";0;"N";"N";"Y";1966-10-06;1966
     101  ;905433308;1974-05-14;1974;314549444;1;24338839;1952402;2;4;1;2 163
19        3897865;"CA";"1262";"A ";1967-09-29;1967;"CA19670001262";"001,262";"PI";0;"N";"N";"Y";1966-10-04;196
     101  6;905429980;1975-06-17;1975;313423709;1;24336185;1823098;3;0;1;1 164
20        3897866;"CA";"1390";"A ";1967-09-30;1967;"CA19670001390";"001,390";"PI";0;"N";"N";"Y";1966-10-01;196
     101  6;901833818;1975-01-21;1975;315245242;1;7103726;546654;15;1;1;2 163
21        3897867;"CA";"1463";"A ";1967-10-02;1967;"CA19670001463";"001,463";"PI";0;"N";"N";"Y";1966-12-05;196
     101  6;905453651;1977-01-25;1977;314775599;1;27083204;1825303;7;7;1;1 164
22        3897869;"CA";"1638";"A ";1967-10-04;1967;"CA19670001638";"001,638";"PI";0;"N";"N";"Y";1966-10-17;196
     101  6;905431461;1974-01-08;1974;314966616;1;24348073;1823471;12;20;1;1 166

Even though the use of semi-colons as the delimiter means the values with commas do not need the quotes.

Reeza
Super User

Can you attach a small sample of your file, the first 30 records. If it's confidential can you email it to one of us to test?

JohnHoughton
Quartz | Level 8

Hi @France.

Go back to the code you put in the very first post in this thread and change it to use a semi-colon delimiter,  DLM=";".

This worked for me with the sample table I downloaded from patstat (and attached to earlier message)

Reeza
Super User

As someone on here helpfully pointed out to me at one point, CSV in Europe typically uses a semicolon whereas in North America it's comma. 

France
Quartz | Level 8

thanks for your help, but the variable in my data is seperated by commas. 

JohnHoughton
Quartz | Level 8

Ok , so you aren't getting it from Patstat online but on datasets that have been supplied possibly on a memory stick.

 

Try Using Encoding="unicode" option in Infile statement (as in this thread)

 

 

I've copied below from Patstat documentation which describes the data that I think you are using

2.1. Data files

You will find the data files in the folder Data. For every table there is one or more zipped data file(s). To keep all files (if unzipped) smaller than 2 GB, the data of the large tables are split into multiple files. Sometimes users report problems unzipping some files when using WinZip. In these cases it is advisable to try a different unpacking tool like the open source tool 7-zip. After unzipping, each file contains one header record comprising the column names of the table, followed by multiple data records.

 All characters are in Unicode with UTF-8 encoding

 The files are in MS-DOS format (i.e. each line ends with CR/LF)

 Each value is delimited by a comma ","

 Each string / text value is enclosed in double quotes, like: "Smith, John". Within string values there are never double quotes.

 All non-text values (numbers, dates) are not enclosed in quotes, like: 123, 2014-12-31  The decimal separator is the point ".", e.g. 0.125

 Line breaking characters (LF, CR) are replaced by " \n ". They occur most frequently in the abstract text and the NPL bibliographic text.

 Except "\n", no other characters are escaped.

JohnHoughton
Quartz | Level 8

Hi @France,  So did this eventually work when you used the semicolon delimiter? 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 40 replies
  • 3051 views
  • 8 likes
  • 5 in conversation