BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
jimbobob
Quartz | Level 8

I have a very large json in a single line, I have problems with illegal characters, turn out I have a lot of DC3, EM, FS, GS hidden ascii symbols, so I got rid of those and some other special characters that are making it hard to parse the json, then I ran into issues with the escape characters so I also just replaced those with a space. Now I'm stuck, it appears to be unbalanced quotes. normally I would use some code got from you super smart folks that fixes CL LF in the middle of a quoted string, but that with multiple lines. How do I do that with one giant line? 

 

Is there a way to split the json into multiple lines then see where I have the unbalanced qoute and replace it before importing?

Code i used to start cleaning up the data

       %let dsnnme="/mnt/SAS/Shared Services/SAS_Datasets/tempfiles/tempjson/&nextitem";
        data _null_;
            infile &dsnnme recfm=n sharebuffers;
            file &dsnnme recfm=n;
			input a $char1.;
			if a = '\' then a=&repA;
			else if a = '13'x then a=&repA;
			else if a = '1C'x then a=&repA;
			else if a = '19'x then a=&repA;
			else if a = '1D'x then a=&repA;
			else if a = 'F3'x then a=&repA;
			else if a = 'A0'x then a=&repA;
			else if a = 'E6'x then a=&repA;
			else if a = 'F7'x then a=&repA;
			else if a = 'F2'x then a=&repA;
			else if a = 'FD'x then a=&repA;
			else if a = 'F1'x then a=&repA;

			put a $char1.;
        run;

code I've used to fix cr Lf between qoutes:

	data _null_;
		infile &dsnnme recfm=n sharebuffers;
		file &dsnnme recfm=n;
		retain open 0;
		input a $char1.;
		if a = '"' then open = ^(open);
		if open then do;
			if a = '0D'x then put &repD;
			else if a = '0A'x then put &repA;
		end;
	run;

Not really sure this will work, but I'm out of ideas. I'll try to upload sample but its PII data.

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

Since JSON uses backslash (\) to escape double quotes, this line:

if a = '\' then a=&repA;

may be the reason of your "unbalanced quotes".

 

If your JSON have something like:

{"data" : "Dwayne \"the rock\" Johnson"}

inside it, after that line in data step (assuming &repA. is for example a space) it gets:

{"data" : "Dwayne  "the rock " Johnson"}

 

 

Just guessing.

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

7 REPLIES 7
Patrick
Opal | Level 21

If confidentiality allows then ideally share your json with us as attachment. 

 

If that's a one off then I'd be using a tool like Notepad++ showing all characters to investigate and clean-up what's happening (selected with wrap text and show all characters).

There is also a json plugin for Notepad++ that allows you to add line breaks (pretty print).

 

I would also look at the encoding of your json file. Is it eventually in UTF-8 but your SAS session runs in single byte mode and these non-print ASCII characters you encounter are actually the 2nd and 3rd byte of multibyte characters?

Tom
Super User Tom
Super User

Is there a reason you are trying to read the JSON with a data step instead of asking the JSON libref engine to interpret it for you?

filename json "/mnt/SAS/Shared Services/SAS_Datasets/tempfiles/tempjson/&nextitem" recfm=n;
libname json json;
proc copy inlib=json outlib=json;
run;
jimbobob
Quartz | Level 8

Hey Tom, yes I'm using the libname statement, but it keeps erroring out, at first it was hidden characters which I used Notepadd++ to find and then used the data step replace with a blank, then was the escape characters. This file is UTF-8 I'm using SAS viya 3.5/sas studio on unix server.

 

In notepad++ I've been going to each illegal error and seeing what it looks like, finding the character causing the issue and adding it to my replace code, now it appears to be quotes. the code in the notepad++ (which I set the language to json) also recognizes and issue the code is yellow (apologize hard to see). If I then search for next quote I get to here (second image) where it should be parsing at the comma, "user": field but it has an extra quote, If I remove that quote, it continues parsing until I run into the same issue.

 

 Screenshot 2024-02-22 081633.png

Screenshot 2024-02-22 083116b.png

 

 

The json is 680mb which is the smaller of 100's of jsons I need to parse, some are as large as 10gb, so at some point won't even be able to try and view it with notepad++

 

error:

 

89 filename fixmap "/mnt/SAS/Shared Services/SAS_Datasets/tempfiles/bob.map";
90 filename bobby "/mnt/SAS/Shared Services/SAS_Datasets/tempfiles/tempjson/ALERT_JSON_20231214_1.json" LRECL=1000000000
90 ! encoding='utf-8';
91 libname myjson json fileref=bobby map=fixmap automap=create ordinalcount=all;
NOTE: JSON data is only read once. To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 6709058: Encountered an illegal character.
ERROR: Error in the LIBNAME statement.
 
I've attached a sample of the json

 

 

jimbobob
Quartz | Level 8

Try this file instead, I think i messed up the json structure when I was trying to mask some data

Tom
Super User Tom
Super User

Your JSON is malformed.  Do you know what process created it? Can you fix that process to generate valid JSON files?

 

SAS it not the best tool for dealing with long character strings (like complete files) but you can use it to debug what is happening and perhaps fix some things.

 

First let's allow the JSON engine to tell us where to look.

1    filename x "C:\Downloads\ALERT_JSON_20231214_1 (1).json";
2    libname x json;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 12377: Encountered an illegal character.
ERROR: Error in the LIBNAME statement.

So something is wrong around byte 12,377 of that file.  So let's look into the file and see what is around there.  Let's just run a simple data step to read the file as 100 byte records and read a few of those 100 bytes chunks that are around that point.

3    data _null_;
4      infile x lrecl=100 recfm=f firstobs=122 obs=126 ;
5      input;
6      list;
7    run;

NOTE: The infile X is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
122       ","user":"45553fx5-xxxx-4455-53a5-a35x64442f56"},{"xate":"2525-55-55T56:52:56.555","xoxxent":"xxsrex
123       arx xrxor note xatex 55 /55.","user":"55ee55a5-555x-455f-a5xa-52353f5xxea5"},{"xate":"2525-55-55T56:
124       52:55.355","xoxxent":"Alert xas trxaxex xn Hx xxth the folloxxnx Hx note  n "Thxs alert trxxxerex fo
125       r Exxessxve xOS axouxt /OFx totalxxx $55,665.43 xetxeex 55 /35 /2525 axx 55 /25 /2525. x xA revxex o
126       f the xustoxer s xrofxle axx axxouxt revealex xustoxer xs 35 years olx axx resxxes xx Roxlaxx Hexxht
NOTE: 5 records were read from the infile (system-specific pathname).
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

So we can see that there is a long string with mismatched quotes.  For that to be valid JSON text the embedded double quote needs to prefixed with a backslash.

,"xoxxent":"Alert xas trxaxex xn Hx xxth the folloxxnx Hx note  n \"Thxs al....

 

So how can we find all of these improperly embedded double quote characters?

One way is to look for those that do not follow a comma or semicolon or closing bracket or an object or array.  Or that do not come before a comma or semicolon or opening bracket of an object or array.

8    data test;
9      infile x lrecl=100000 column=cc;
10     input @'"' +(-2) before $char1. +1 after $char1. @@;
11     col = cc - 2;
12     test1 = before in (',' ':' '[' '{');
13     test2 = after in (',' ':' ']' '}');
14     if (test1 or test2) and not (test1 and test2) then delete;
15   run;

NOTE: The infile X is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: 1 record was read from the infile (system-specific pathname).
      The minimum record length was 50064.
      The maximum record length was 50064.
NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line.
NOTE: The data set WORK.TEST has 2 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

So we found two places.

So let's make a new file by copying the old one and inserting the needed \ at those two places.

16   proc sql noprint;
17    select col into :collist separated by ' ' from test;
18   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


19
20   filename y temp;
21   data _null_;
22     infile x recfm=n;
23     file y recfm=n;
24     col+1;
25     input ch $char1. ;
26     if col in (&collist) then put '\';
27     put ch $char1. ;
28   run;

NOTE: UNBUFFERED is the default with RECFM=N.
NOTE: The infile X is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: UNBUFFERED is the default with RECFM=N.
NOTE: The file Y is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Now let's try to see if the new file works.

29   libname y json;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 13488: Encountered an illegal character.
ERROR: Error in the LIBNAME statement.

So we found something else now that is later in the file

So let's look into this file and see what is there.

30   data _null_;
31     infile y lrecl=100 recfm=f firstobs=133 obs=136 ;
32     input;
33     list;
34   run;

NOTE: The infile Y is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
133       te":"2525-55-55T56:53:55.554","xoxxext":"Alert Resolutxox: Alert xxsxxssex xxth reasox 'xo susxxxxou
134       s axtxvxty xxextxfxex.'.","user":"55ee55a5-555x-455f-a5xa-52353f5xxea5"}],"xroxessexxy":{"_xx":"55ee
135       55a5-555x-455f-a5xa-52353f5xxea5"},"xroxessexxate":"2525-55-55T56:53:55.555","suxState":xull,"stateH
136       xstoryxata":[{"xate":"2525-55-55T55:55:55.545","state":"xxsxxssex","userxx":"45553fx5-xxxx-4455-53a5
NOTE: 4 records were read from the infile (system-specific pathname).
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

This time it looks like you have the word xull where you probably mean null.

Hopefully that is just an artifact of whatever anonymization you tried to do.

If not try to figure out how to replace all of the 

:xull

 strings with

:null

instead.

Seems to work for this file:

Spoiler
35   data test2;
36     infile y lrecl=100000 column=cc ;
37     input @':xull' @@;
38     col = cc-4;
39   run;

NOTE: The infile Y is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: 1 record was read from the infile (system-specific pathname).
      The minimum record length was 50066.
      The maximum record length was 50066.
NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line.
NOTE: The data set WORK.TEST2 has 14 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


40   proc sql noprint;
41    select col into :collist2 separated by ' ' from test2;
42   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


43
44   filename z temp;
45   data _null_;
46     infile y recfm=n;
47     file z recfm=n;
48     col+1;
49     input ch $char1. ;
50     if col in (&collist2) then ch='n';
51     put ch $char1. ;
52   run;

NOTE: UNBUFFERED is the default with RECFM=N.
NOTE: The infile Y is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: UNBUFFERED is the default with RECFM=N.
NOTE: The file Z is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


53   libname z json;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
NOTE: Libref Z was successfully assigned as follows:
      Engine:        JSON
      Physical Name: (system-specific file/path name)

54   proc copy inlib=z outlib=work; run;

NOTE: Copying Z.ALLDATA to WORK.ALLDATA (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.ALLDATA has 1917 observations and 9 variables.
NOTE: Copying Z.ASSXXNOR to WORK.ASSXXNOR (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.ASSXXNOR has 2 observations and 3 variables.
NOTE: Copying Z.ASSXXXOR to WORK.ASSXXXOR (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.ASSXXXOR has 1 observations and 3 variables.
NOTE: Copying Z.AXXOUNTS to WORK.AXXOUNTS (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUNTS has 1 observations and 12 variables.
NOTE: Copying Z.AXXOUNTS_AXXXANAXERRELATXONSHX to WORK.AXXOUNTS_AXXXANAXERRELATXONSHX (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUNTS_AXXXANAXERRELATXONSHX has 1 observations and 3 variables.
NOTE: Copying Z.AXXOUNTS_AXXXANAXERRELATXONSHX2 to WORK.AXXOUNTS_AXXXANAXERRELATXONSHX2 (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUNTS_AXXXANAXERRELATXONSHX2 has 1 observations and 3 variables.
NOTE: Copying Z.AXXOUNTS_XNSTXTUTXON to WORK.AXXOUNTS_XNSTXTUTXON (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUNTS_XNSTXTUTXON has 1 observations and 15 variables.
NOTE: Copying Z.AXXOUNTS_XNSTXTUTXON2 to WORK.AXXOUNTS_XNSTXTUTXON2 (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUNTS_XNSTXTUTXON2 has 1 observations and 15 variables.
NOTE: Copying Z.AXXOUXTS to WORK.AXXOUXTS (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUXTS has 1 observations and 12 variables.
NOTE: Copying Z.AXXOUXTS_AXXXAXAXERRELATXOXSHX to WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX has 1 observations and 3 variables.
NOTE: Copying Z.AXXOUXTS_AXXXAXAXERRELATXOXSHX2 to WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX2 (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX2 has 4 observations and 3 variables.
NOTE: Copying Z.AXXXANAXERRELATXONSHX_XUSTOXER to WORK.AXXXANAXERRELATXONSHX_XUSTOXER (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXXANAXERRELATXONSHX_XUSTOXER has 1 observations and 16 variables.
NOTE: Copying Z.AXXXANAXERRELATXONSHX_XUSTOXER2 to WORK.AXXXANAXERRELATXONSHX_XUSTOXER2 (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXXANAXERRELATXONSHX_XUSTOXER2 has 1 observations and 16 variables.
NOTE: Copying Z.AXXXAXAXERRELATXOXSHX_XUSTOXER to WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER has 1 observations and 16 variables.
NOTE: Copying Z.AXXXAXAXERRELATXOXSHX_XUSTOXER2 to WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER2 (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER2 has 4 observations and 16 variables.
NOTE: Copying Z.OTHERXARTY_XNSTXTUTXON to WORK.OTHERXARTY_XNSTXTUTXON (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.OTHERXARTY_XNSTXTUTXON has 2 observations and 4 variables.
NOTE: Copying Z.OTHERXARTY_XXSTXTUTXOX to WORK.OTHERXARTY_XXSTXTUTXOX (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.OTHERXARTY_XXSTXTUTXOX has 11 observations and 4 variables.
NOTE: Copying Z.ROOT to WORK.ROOT (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.ROOT has 3 observations and 29 variables.
NOTE: Copying Z.RXSKFAXTORXETAXLS to WORK.RXSKFAXTORXETAXLS (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.RXSKFAXTORXETAXLS has 3 observations and 4 variables.
NOTE: Copying Z.STATEHXSTORYXATA to WORK.STATEHXSTORYXATA (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.STATEHXSTORYXATA has 9 observations and 5 variables.
NOTE: Copying Z.SUXJEXTS_RELATXONSHXXS to WORK.SUXJEXTS_RELATXONSHXXS (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.SUXJEXTS_RELATXONSHXXS has 2 observations and 3 variables.
NOTE: Copying Z.SUXJEXTS_RELATXOXSHXXS to WORK.SUXJEXTS_RELATXOXSHXXS (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: The data set WORK.SUXJEXTS_RELATXOXSHXXS has 1 observations and 3 variables.
NOTE: Copying Z.SUXJEXTS_XUSTOXER to WORK.SUXJEXTS_XUSTOXER (memtype=DATA).
....

Now al least the JSON is valid, even if it is not a very suitable structure for data analysis.

yabwon
Onyx | Level 15

Since JSON uses backslash (\) to escape double quotes, this line:

if a = '\' then a=&repA;

may be the reason of your "unbalanced quotes".

 

If your JSON have something like:

{"data" : "Dwayne \"the rock\" Johnson"}

inside it, after that line in data step (assuming &repA. is for example a space) it gets:

{"data" : "Dwayne  "the rock " Johnson"}

 

 

Just guessing.

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



jimbobob
Quartz | Level 8

Ok @yabwon you were spot on, I removed the escape from the code to make it a space, I thought these were special characters causing issues and it imported just fine. 

 

 

Thank you everyone for your help

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1448 views
  • 1 like
  • 4 in conversation