DATA Step, Macro, Functions and more

how to import a large .csv file into SAS if it includes the 'extra' comma?

Reply
Contributor
Posts: 36

how to import a large .csv file into SAS if it includes the 'extra' comma?

[ Edited ]

when I am importing a large .csv file into SAS by using following codes,

 

data SASDATA.Applications ;
infile 'R:/Li/PATSTAT/Applications.csv' DLM = ',' DSD missover lrecl=32767 firstobs = 3 ;
input
  appln_id :29.
  appln_auth :$29.
  appln_nr :$29.
  appln_kind :$29.
  appln_filing_date :YYMMDD10.
  appln_filing_year
  appln_nr_epodoc :$50.
  appln_nr_original :$150.
  ipr_type :$29.
  internat_appln_id :29.
  int_phase :$29.
  reg_phase :$29.
  nat_phase :$29.
  earliest_filing_date :YYMMDD10.
  earliest_filing_year
  earliest_filing_id :29.
  earliest_publn_date :YYMMDD10.
  earliest_publn_year
  earliest_pat_publn_id :29.
  granted :29.
  docdb_family_id :29.
  inpadoc_family_id :29.
  docdb_family_size :29.
  nb_citing_docdb_fam :29.
  nb_applicants :29.
  nb_inventors :29.
;
format
  appln_filing_date :YYMMDDd10.
  appln_filing_year :YEAR10.
  earliest_filing_date :YYMMDDd10.
  earliest_filing_year :YEAR10.
  earliest_publn_date :YYMMDDd10.
  earliest_publn_year :YEAR10.
;
run ;

 

the log shows information like following:

NOTE: Invalid data for internat_appln_id in line 3457514 61-62.

NOTE: Invalid data for earliest_filing_date in line 3457514 70-70.
NOTE: Invalid data for earliest_filing_year in line 3457514 72-81.
NOTE: Invalid data for earliest_publn_date in line 3457514 88-95.
NOTE: Invalid data for earliest_publn_year in line 3457514 97-106.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
3457514 3574782,BR,7600215,A ,1976-04-08,1976,BR19767600215,760215,,PI,0,N,N,Y,1975-04-14,1975,411
91 49045,1976-10-05,1976,312739797,0,19727905,327504,29,56,1,2 149
appln_id=3574782 appln_auth=BR appln_nr=7600215 appln_kind=A appln_filing_date=1976-04-08
appln_filing_year=1965 appln_nr_epodoc=BR19767600215 appln_nr_original=760215 ipr_type=
internat_appln_id=. int_phase=0 reg_phase=N nat_phase=N earliest_filing_date=.
earliest_filing_year=. earliest_filing_id=1975 earliest_publn_date=. earliest_publn_year=.
earliest_pat_publn_id=1976 granted=312739797 docdb_family_id=0 inpadoc_family_id=19727905
docdb_family_size=327504 nb_citing_docdb_fam=29 nb_applicants=56 nb_inventors=1 _ERROR_=1
_N_=3457513
 
or like this
 
NOTE: Invalid data for internat_appln_id in line 3780547 57-58.
NOTE: Invalid data for earliest_filing_date in line 3780547 66-66.
NOTE: Invalid data for earliest_filing_year in line 3780547 68-77.
NOTE: Invalid data for earliest_publn_date in line 3780547 84-91.
NOTE: Invalid data for earliest_publn_year in line 3780547 93-102.
3780547 3897835,CA,182,A ,1967-09-15,1967,CA19670000182,000,182,PI,0,N,N,Y,1966-09-16,1966,2248546
91 0,1975-03-18,1975,315362359,1,10420177,1874780,14,0,1,1 145
appln_id=3897835 appln_auth=CA appln_nr=182 appln_kind=A appln_filing_date=1967-09-15
appln_filing_year=1965 appln_nr_epodoc=CA19670000182 appln_nr_original=000 ipr_type=182
internat_appln_id=. int_phase=0 reg_phase=N nat_phase=N earliest_filing_date=.
earliest_filing_year=. earliest_filing_id=1966 earliest_publn_date=. earliest_publn_year=.
earliest_pat_publn_id=1975 granted=315362359 docdb_family_id=1 inpadoc_family_id=10420177
docdb_family_size=1874780 nb_citing_docdb_fam=14 nb_applicants=0 nb_inventors=1 _ERROR_=1
 
as I confirm 'PI' should be recorded in the 'ipr_type' volume, the problem should be caused by 'extra' comma . besides the data set is too large to correct one by one.
what should I do for it?
 
thanks in advance.
Super User
Posts: 22,874

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Doesn't PATSTAT also provide JSON files? Maybe that may be easier to parse out and won't have this issue. I'm surprised it does. 

Can you upload some records in a CSV file that we can work with.

Contributor
Posts: 36

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

I think there is no JOSN files. Besides, could you tell me how to split a csv file please? the all file is too big to upload.

Super User
Posts: 22,874

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

If you have powershell  (Windows will automatically) you can use this approach:

gc -path file_name.csv - head N > output.txt
Contributor
Posts: 36

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

1.Could you explain what is the meaning of this code please?

for example, should I use the following code if I want to extract volume from 3457514 to 3780578 ?

 

gc - R:/Li/PATSTAT/Applications.csv - head 3457514 to 3780578> output.txt

 

 

2. https://data.epo.org/access-control/welcome?lg=en this is the link to PATSTAT database. you can click the 'PATSTAT Online' on that page, and then you can register for a free trial of PATSTAT Online. Please note that PATSTAT Online free trials are valid for one month.

Contributor
Posts: 62

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

But the patstat manual says it can save as xml

Super User
Posts: 22,874

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

And does it occur in all records or only some? if only some make sure to include that variability in the sample data.
Contributor
Posts: 36

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

i think it is occurred in some records.This attribute is not created by PATSTAT but collect form different patent authorities. that might be the reason why these mistakes happened. 

Super User
Posts: 22,874

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?


@France wrote:

i think it is occurred in some records.This attribute is not created by PATSTAT but collect form different patent authorities. that might be the reason why these mistakes happened. 


Is this one of their free files? If so, post the link then.

Super User
Super User
Posts: 7,860

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

[ Edited ]

@Reeza wrote:

@France wrote:

i think it is occurred in some records.This attribute is not created by PATSTAT but collect form different patent authorities. that might be the reason why these mistakes happened. 


Is this one of their free files? If so, post the link then.


If you PAID for this file then insist that they fix it.  They can either use a delimiter that does not appear in any field. Or add quotes around values that contain the delimiter. They could even use the goofy syntax that some database support of adding a backslash in front off delimiters that appear in the data.  At minimum they need to provide a file that can be parsed by a rule, which this file cannot.

Contributor
Posts: 36

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

thanks for your advice. I am not sure about it as it is given by my supervisor. but I will try to ask them to fix the data set.
Super User
Posts: 13,084

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

You don't show any value with imbedded commas. If there are none you might test a line to see if the number of expected is commas is too large or too small suppose that your data should have 27 variables with exactly 26 commas separating the values and NO commas inside any of the text field:

 

input @;

if countc(_infile_,',') ne 26 then do;

   put "WARNING: Unexpected number of variables on line " _n_;

   input;

end;

else input

<your variable list>

;

 

But this won't work if any of the text fields can have a comma in side a column. And really only cleans up the log as you would have to examine that data file closely to fix the text in whatever manner is needed.

Super User
Super User
Posts: 7,860

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Didn't you ask this question (in different words) already?

Did you try adapting the solution from the other question to this data?

 

You really need to talk to the creators of the CSV file and ask that takes steps to insure they are creating valid CSV files. That is that any value that contains the delimiter is quoted.   

 

Or at least use some method that creates files that CAN be interpreted.

Super User
Posts: 22,874

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

Doesn't PATSTAT provide the SQL queries? With some small modifications those will run in SAS...

Contributor
Posts: 36

Re: how to import a large .csv file into SAS if it includes the 'extra' comma?

PATSTAT Online provide SQL queries. but I am using 2016 autumn edition while the PATSTAT online only provides 2017 spring edition and 2017 autumn edition. is that ok?

Ask a Question
Discussion stats
  • 40 replies
  • 261 views
  • 8 likes
  • 5 in conversation