the full email is following
the reason why you have the errors is because your loading procedure did not take into account that text embedded in double quotes should be considered as 1 data item.
If you don’t do that, then any comma in one of the text or string fields will create a new column and the lining is gone.
Here is an example from your table; observe that all “ have been removed --> import will not work correctly because appln_nr_original contained a commas.
appln_id | appln_auth | appln_nr | appln_kind | appln_filing_date | appln_filing_year | appln_nr_epodoc | appln_nr_original | ipr_type |
3897853 | CA | 301 | A | 21-08-1970 | 1970 | CA19700000301 | 091,301, | PI |
The column appln_nr_original is defined as 100 characters, so it could as well be number as any other character. In this case there are 2 commas in the text. If your tool does not take double quotes into account, then it will create extra columns and alignment will be gone.
The original data on the delivered text files should look like this:
0521133,"DE","3062390","A ",1980-02-22,1980,"DE19803062390","80 80400253","PI",0,N,Y,Y,1979-03-02,1979
Observe the double quotes being used! Each string / text value is enclosed in double quotes, like: "Smith, John", “”900,00O” etc....
Within string values there are never double quotes; if by any reason, the original text contained double quotes (for example in the application abstracts), then they will have been replaced by a single quote.
thus, I think I may make some mistakes.
Did you process the file they sent in any way before trying to read it with SAS?
If not then they didn't send you what they thought they did.
If you did pre-process the file then attempt to read the original file with SAS and see if that fixes your issues.
To help things along I've registered for patstat and run the following query
Select * from tls201_appln where appln_id in (3574782,3897835)
UNION
SELECT TOP 10 * from tls201_appln where appln_nr_original like '%,%' and appln_nr_original not like '%,'
UNION
SELECT TOP 10 * from tls201_appln where appln_nr_original not like '%,%'
And attached the csv file ... and yes it does contain quotes! (and it's delimited with semi-colons)
I had to download that file to look at it because this **forum** displays it as if it was a spreadsheet instead of showing the actual text.
The file is delimited with semi-colons. It does have quotes.
11 options ls=132 ; 12 data _null_; 13 infile 'c:\\downloads\resulttable.csv' obs=3 ; 14 input; 15 list; 16 run; NOTE: A byte-order mark in the file "c:\\downloads\resulttable.csv" (for fileref "#LN00045") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile 'c:\downloads\resulttable.csv' is: Filename=c:\downloads\resulttable.csv, RECFM=V,LRECL=131068,File Size (bytes)=3803, Last Modified=27Mar2018:15:26:32, Create Time=27Mar2018:15:26:32 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 appln_id;appln_auth;appln_nr;appln_kind;appln_filing_date;appln_filing_year;appln_nr_epodoc;appln_nr 101 _original;ipr_type;internat_appln_id;int_phase;reg_phase;nat_phase;earliest_filing_date;earliest_fil 201 ing_year;earliest_filing_id;earliest_publn_date;earliest_publn_year;earliest_pat_publn_id;granted;do 301 cdb_family_id;inpadoc_family_id;docdb_family_size;nb_citing_docdb_fam;nb_applicants;nb_inventors 396 2 0;"XX";"";"D ";9999-12-31;9999;"";"";"PI";0;"N";"N";"N";9999-12-31;9999;0;9999-12-31;9999;0;0;0;0;1; 101 0;0;0 105 3 1;"EP";"00103094";"A ";2000-02-15;2000;"EP20000103094";"00103094";"PI";0;"N";"Y";"N";1999-03-11;1999 101 ;18546269;2000-09-20;2000;293253293;1;8554171;1;6;72;1;4 156 NOTE: 3 records were read from the infile 'c:\\downloads\resulttable.csv'.
It actually has more quotes than it needs. There is no need to put quotes around EP or 00103094. Perhaps the software that generated thinks the quotes are somehow an indication that the data value is supposed to be interpreted as text?
The extra quotes do not cause an issue.
Another strange thing is the inclusion of "" as the value of the third column on the first data row. Normally you would just represent an empty value as two delimiters next to each other.
Again that should not cause any trouble.
Another strange thing is the fourth column. Why is there a space after the D and A? The extra spaces will be ignored. But perhaps PROC IMPORT might use them to decide that the fourth column should be defined with a length of $2 instead of $1.
Again spaces should not cause an issue.
Here are the lines from that 22 record sample that contain commas.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 12 3574782;"BR";"7600215";"A ";1976-04-08;1976;"BR19767600215";"760215,";"PI";0;"N";"N";"Y";1975-04-14; 101 1975;41149045;1976-10-05;1976;312739797;0;19727905;327506;29;56;1;2 167 13 3897830;"CA";"133";"A ";1972-01-26;1972;"CA19720000133";"133, *7";"PI";0;"N";"N";"Y";19 101 72-01-26;1972;3897830;1974-02-12;1974;313964820;1;4083809;13098528;1;0;1;1 174 14 3897835;"CA";"182";"A ";1967-09-15;1967;"CA19670000182";"000,182";"PI";0;"N";"N";"Y";1966-09-16;1966 101 ;22485460;1975-03-18;1975;315362359;1;10420177;1822558;14;0;1;1 163 15 3897838;"CA";"200";"A ";1974-05-16;1974;"CA19740000200";"200,10";"PI";0;"N";"N";"Y";1973-07-23;1973; 101 50446315;1977-02-08;1977;313897066;1;23506530;1952392;1;0;1;1 161 16 3897856;"CA";"325";"A ";1967-09-18;1967;"CA19670000325";"000,325";"PI";0;"N";"N";"Y";1966-09-21;1966 101 ;901804757;1976-08-24;1976;312974165;1;6939039;1822628;7;21;1;3 163 17 3897859;"CA";"651";"A ";1967-09-21;1967;"CA19670000651";"000,651";"PI";0;"N";"N";"Y";1966-10-03;1966 101 ;905428887;1974-02-26;1974;314781130;1;24334948;3777769;5;6;1;1 163 18 3897860;"CA";"708";"A ";1967-09-22;1967;"CA19670000708";"000,708";"PI";0;"N";"N";"Y";1966-10-06;1966 101 ;905433308;1974-05-14;1974;314549444;1;24338839;1952402;2;4;1;2 163 19 3897865;"CA";"1262";"A ";1967-09-29;1967;"CA19670001262";"001,262";"PI";0;"N";"N";"Y";1966-10-04;196 101 6;905429980;1975-06-17;1975;313423709;1;24336185;1823098;3;0;1;1 164 20 3897866;"CA";"1390";"A ";1967-09-30;1967;"CA19670001390";"001,390";"PI";0;"N";"N";"Y";1966-10-01;196 101 6;901833818;1975-01-21;1975;315245242;1;7103726;546654;15;1;1;2 163 21 3897867;"CA";"1463";"A ";1967-10-02;1967;"CA19670001463";"001,463";"PI";0;"N";"N";"Y";1966-12-05;196 101 6;905453651;1977-01-25;1977;314775599;1;27083204;1825303;7;7;1;1 164 22 3897869;"CA";"1638";"A ";1967-10-04;1967;"CA19670001638";"001,638";"PI";0;"N";"N";"Y";1966-10-17;196 101 6;905431461;1974-01-08;1974;314966616;1;24348073;1823471;12;20;1;1 166
Even though the use of semi-colons as the delimiter means the values with commas do not need the quotes.
Can you attach a small sample of your file, the first 30 records. If it's confidential can you email it to one of us to test?
Hi @France.
Go back to the code you put in the very first post in this thread and change it to use a semi-colon delimiter, DLM=";".
This worked for me with the sample table I downloaded from patstat (and attached to earlier message)
As someone on here helpfully pointed out to me at one point, CSV in Europe typically uses a semicolon whereas in North America it's comma.
thanks for your help, but the variable in my data is seperated by commas.
Ok , so you aren't getting it from Patstat online but on datasets that have been supplied possibly on a memory stick.
Try Using Encoding="unicode" option in Infile statement (as in this thread)
I've copied below from Patstat documentation which describes the data that I think you are using
2.1. Data files
You will find the data files in the folder Data. For every table there is one or more zipped data file(s). To keep all files (if unzipped) smaller than 2 GB, the data of the large tables are split into multiple files. Sometimes users report problems unzipping some files when using WinZip. In these cases it is advisable to try a different unpacking tool like the open source tool 7-zip. After unzipping, each file contains one header record comprising the column names of the table, followed by multiple data records.
All characters are in Unicode with UTF-8 encoding
The files are in MS-DOS format (i.e. each line ends with CR/LF)
Each value is delimited by a comma ","
Each string / text value is enclosed in double quotes, like: "Smith, John". Within string values there are never double quotes.
All non-text values (numbers, dates) are not enclosed in quotes, like: 123, 2014-12-31 The decimal separator is the point ".", e.g. 0.125
Line breaking characters (LF, CR) are replaced by " \n ". They occur most frequently in the abstract text and the NPL bibliographic text.
Except "\n", no other characters are escaped.
Hi @France, So did this eventually work when you used the semicolon delimiter?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.