BookmarkSubscribeRSS Feed
France
Quartz | Level 8

when I am importing a large .csv file into SAS by using following codes,

 

data SASDATA.Applications ;
infile 'R:/Li/PATSTAT/Applications.csv' DLM = ',' DSD missover lrecl=32767 firstobs = 3 ;
input
  appln_id :29.
  appln_auth :$29.
  appln_nr :$29.
  appln_kind :$29.
  appln_filing_date :YYMMDD10.
  appln_filing_year
  appln_nr_epodoc :$50.
  appln_nr_original :$150.
  ipr_type :$29.
  internat_appln_id :29.
  int_phase :$29.
  reg_phase :$29.
  nat_phase :$29.
  earliest_filing_date :YYMMDD10.
  earliest_filing_year
  earliest_filing_id :29.
  earliest_publn_date :YYMMDD10.
  earliest_publn_year
  earliest_pat_publn_id :29.
  granted :29.
  docdb_family_id :29.
  inpadoc_family_id :29.
  docdb_family_size :29.
  nb_citing_docdb_fam :29.
  nb_applicants :29.
  nb_inventors :29.
;
format
  appln_filing_date :YYMMDDd10.
  appln_filing_year :YEAR10.
  earliest_filing_date :YYMMDDd10.
  earliest_filing_year :YEAR10.
  earliest_publn_date :YYMMDDd10.
  earliest_publn_year :YEAR10.
;
run ;

 

the log shows information like following:

NOTE: Invalid data for internat_appln_id in line 3457514 61-62.

NOTE: Invalid data for earliest_filing_date in line 3457514 70-70.
NOTE: Invalid data for earliest_filing_year in line 3457514 72-81.
NOTE: Invalid data for earliest_publn_date in line 3457514 88-95.
NOTE: Invalid data for earliest_publn_year in line 3457514 97-106.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
3457514 3574782,BR,7600215,A ,1976-04-08,1976,BR19767600215,760215,,PI,0,N,N,Y,1975-04-14,1975,411
91 49045,1976-10-05,1976,312739797,0,19727905,327504,29,56,1,2 149
appln_id=3574782 appln_auth=BR appln_nr=7600215 appln_kind=A appln_filing_date=1976-04-08
appln_filing_year=1965 appln_nr_epodoc=BR19767600215 appln_nr_original=760215 ipr_type=
internat_appln_id=. int_phase=0 reg_phase=N nat_phase=N earliest_filing_date=.
earliest_filing_year=. earliest_filing_id=1975 earliest_publn_date=. earliest_publn_year=.
earliest_pat_publn_id=1976 granted=312739797 docdb_family_id=0 inpadoc_family_id=19727905
docdb_family_size=327504 nb_citing_docdb_fam=29 nb_applicants=56 nb_inventors=1 _ERROR_=1
_N_=3457513
 
or like this
 
NOTE: Invalid data for internat_appln_id in line 3780547 57-58.
NOTE: Invalid data for earliest_filing_date in line 3780547 66-66.
NOTE: Invalid data for earliest_filing_year in line 3780547 68-77.
NOTE: Invalid data for earliest_publn_date in line 3780547 84-91.
NOTE: Invalid data for earliest_publn_year in line 3780547 93-102.
3780547 3897835,CA,182,A ,1967-09-15,1967,CA19670000182,000,182,PI,0,N,N,Y,1966-09-16,1966,2248546
91 0,1975-03-18,1975,315362359,1,10420177,1874780,14,0,1,1 145
appln_id=3897835 appln_auth=CA appln_nr=182 appln_kind=A appln_filing_date=1967-09-15
appln_filing_year=1965 appln_nr_epodoc=CA19670000182 appln_nr_original=000 ipr_type=182
internat_appln_id=. int_phase=0 reg_phase=N nat_phase=N earliest_filing_date=.
earliest_filing_year=. earliest_filing_id=1966 earliest_publn_date=. earliest_publn_year=.
earliest_pat_publn_id=1975 granted=315362359 docdb_family_id=1 inpadoc_family_id=10420177
docdb_family_size=1874780 nb_citing_docdb_fam=14 nb_applicants=0 nb_inventors=1 _ERROR_=1
 
as I confirm 'PI' should be recorded in the 'ipr_type' volume, the problem should be caused by 'extra' comma . besides the data set is too large to correct one by one.
what should I do for it?
 
thanks in advance.
40 REPLIES 40
Reeza
Super User

Doesn't PATSTAT also provide JSON files? Maybe that may be easier to parse out and won't have this issue. I'm surprised it does. 

Can you upload some records in a CSV file that we can work with.

France
Quartz | Level 8

I think there is no JOSN files. Besides, could you tell me how to split a csv file please? the all file is too big to upload.

Reeza
Super User

If you have powershell  (Windows will automatically) you can use this approach:

gc -path file_name.csv - head N > output.txt
France
Quartz | Level 8

1.Could you explain what is the meaning of this code please?

for example, should I use the following code if I want to extract volume from 3457514 to 3780578 ?

 

gc - R:/Li/PATSTAT/Applications.csv - head 3457514 to 3780578> output.txt

 

 

2. https://data.epo.org/access-control/welcome?lg=en this is the link to PATSTAT database. you can click the 'PATSTAT Online' on that page, and then you can register for a free trial of PATSTAT Online. Please note that PATSTAT Online free trials are valid for one month.

JohnHoughton
Quartz | Level 8

But the patstat manual says it can save as xml

Reeza
Super User
And does it occur in all records or only some? if only some make sure to include that variability in the sample data.
France
Quartz | Level 8

i think it is occurred in some records.This attribute is not created by PATSTAT but collect form different patent authorities. that might be the reason why these mistakes happened. 

Reeza
Super User

@France wrote:

i think it is occurred in some records.This attribute is not created by PATSTAT but collect form different patent authorities. that might be the reason why these mistakes happened. 


Is this one of their free files? If so, post the link then.

Tom
Super User Tom
Super User

@Reeza wrote:

@France wrote:

i think it is occurred in some records.This attribute is not created by PATSTAT but collect form different patent authorities. that might be the reason why these mistakes happened. 


Is this one of their free files? If so, post the link then.


If you PAID for this file then insist that they fix it.  They can either use a delimiter that does not appear in any field. Or add quotes around values that contain the delimiter. They could even use the goofy syntax that some database support of adding a backslash in front off delimiters that appear in the data.  At minimum they need to provide a file that can be parsed by a rule, which this file cannot.

France
Quartz | Level 8
thanks for your advice. I am not sure about it as it is given by my supervisor. but I will try to ask them to fix the data set.
ballardw
Super User

You don't show any value with imbedded commas. If there are none you might test a line to see if the number of expected is commas is too large or too small suppose that your data should have 27 variables with exactly 26 commas separating the values and NO commas inside any of the text field:

 

input @;

if countc(_infile_,',') ne 26 then do;

   put "WARNING: Unexpected number of variables on line " _n_;

   input;

end;

else input

<your variable list>

;

 

But this won't work if any of the text fields can have a comma in side a column. And really only cleans up the log as you would have to examine that data file closely to fix the text in whatever manner is needed.

Tom
Super User Tom
Super User

Didn't you ask this question (in different words) already?

Did you try adapting the solution from the other question to this data?

 

You really need to talk to the creators of the CSV file and ask that takes steps to insure they are creating valid CSV files. That is that any value that contains the delimiter is quoted.   

 

Or at least use some method that creates files that CAN be interpreted.

Reeza
Super User

Doesn't PATSTAT provide the SQL queries? With some small modifications those will run in SAS...

France
Quartz | Level 8

PATSTAT Online provide SQL queries. but I am using 2016 autumn edition while the PATSTAT online only provides 2017 spring edition and 2017 autumn edition. is that ok?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 40 replies
  • 2154 views
  • 8 likes
  • 5 in conversation