I have a very large json in a single line, I have problems with illegal characters, turn out I have a lot of DC3, EM, FS, GS hidden ascii symbols, so I got rid of those and some other special characters that are making it hard to parse the json, then I ran into issues with the escape characters so I also just replaced those with a space. Now I'm stuck, it appears to be unbalanced quotes. normally I would use some code got from you super smart folks that fixes CL LF in the middle of a quoted string, but that with multiple lines. How do I do that with one giant line?
Is there a way to split the json into multiple lines then see where I have the unbalanced qoute and replace it before importing?
Code i used to start cleaning up the data
%let dsnnme="/mnt/SAS/Shared Services/SAS_Datasets/tempfiles/tempjson/&nextitem";
data _null_;
infile &dsnnme recfm=n sharebuffers;
file &dsnnme recfm=n;
input a $char1.;
if a = '\' then a=&repA;
else if a = '13'x then a=&repA;
else if a = '1C'x then a=&repA;
else if a = '19'x then a=&repA;
else if a = '1D'x then a=&repA;
else if a = 'F3'x then a=&repA;
else if a = 'A0'x then a=&repA;
else if a = 'E6'x then a=&repA;
else if a = 'F7'x then a=&repA;
else if a = 'F2'x then a=&repA;
else if a = 'FD'x then a=&repA;
else if a = 'F1'x then a=&repA;
put a $char1.;
run;
code I've used to fix cr Lf between qoutes:
data _null_;
infile &dsnnme recfm=n sharebuffers;
file &dsnnme recfm=n;
retain open 0;
input a $char1.;
if a = '"' then open = ^(open);
if open then do;
if a = '0D'x then put &repD;
else if a = '0A'x then put &repA;
end;
run;
Not really sure this will work, but I'm out of ideas. I'll try to upload sample but its PII data.
Since JSON uses backslash (\) to escape double quotes, this line:
if a = '\' then a=&repA;
may be the reason of your "unbalanced quotes".
If your JSON have something like:
{"data" : "Dwayne \"the rock\" Johnson"}
inside it, after that line in data step (assuming &repA. is for example a space) it gets:
{"data" : "Dwayne "the rock " Johnson"}
Just guessing.
Bart
If confidentiality allows then ideally share your json with us as attachment.
If that's a one off then I'd be using a tool like Notepad++ showing all characters to investigate and clean-up what's happening (selected with wrap text and show all characters).
There is also a json plugin for Notepad++ that allows you to add line breaks (pretty print).
I would also look at the encoding of your json file. Is it eventually in UTF-8 but your SAS session runs in single byte mode and these non-print ASCII characters you encounter are actually the 2nd and 3rd byte of multibyte characters?
Is there a reason you are trying to read the JSON with a data step instead of asking the JSON libref engine to interpret it for you?
filename json "/mnt/SAS/Shared Services/SAS_Datasets/tempfiles/tempjson/&nextitem" recfm=n;
libname json json;
proc copy inlib=json outlib=json;
run;
Hey Tom, yes I'm using the libname statement, but it keeps erroring out, at first it was hidden characters which I used Notepadd++ to find and then used the data step replace with a blank, then was the escape characters. This file is UTF-8 I'm using SAS viya 3.5/sas studio on unix server.
In notepad++ I've been going to each illegal error and seeing what it looks like, finding the character causing the issue and adding it to my replace code, now it appears to be quotes. the code in the notepad++ (which I set the language to json) also recognizes and issue the code is yellow (apologize hard to see). If I then search for next quote I get to here (second image) where it should be parsing at the comma, "user": field but it has an extra quote, If I remove that quote, it continues parsing until I run into the same issue.
The json is 680mb which is the smaller of 100's of jsons I need to parse, some are as large as 10gb, so at some point won't even be able to try and view it with notepad++
error:
Try this file instead, I think i messed up the json structure when I was trying to mask some data
Your JSON is malformed. Do you know what process created it? Can you fix that process to generate valid JSON files?
SAS it not the best tool for dealing with long character strings (like complete files) but you can use it to debug what is happening and perhaps fix some things.
First let's allow the JSON engine to tell us where to look.
1 filename x "C:\Downloads\ALERT_JSON_20231214_1 (1).json"; 2 libname x json; NOTE: JSON data is only read once. To read the JSON again, reassign the JSON LIBNAME. ERROR: Invalid JSON in input near line 1 column 12377: Encountered an illegal character. ERROR: Error in the LIBNAME statement.
So something is wrong around byte 12,377 of that file. So let's look into the file and see what is around there. Let's just run a simple data step to read the file as 100 byte records and read a few of those 100 bytes chunks that are around that point.
3 data _null_; 4 infile x lrecl=100 recfm=f firstobs=122 obs=126 ; 5 input; 6 list; 7 run; NOTE: The infile X is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 122 ","user":"45553fx5-xxxx-4455-53a5-a35x64442f56"},{"xate":"2525-55-55T56:52:56.555","xoxxent":"xxsrex 123 arx xrxor note xatex 55 /55.","user":"55ee55a5-555x-455f-a5xa-52353f5xxea5"},{"xate":"2525-55-55T56: 124 52:55.355","xoxxent":"Alert xas trxaxex xn Hx xxth the folloxxnx Hx note n "Thxs alert trxxxerex fo 125 r Exxessxve xOS axouxt /OFx totalxxx $55,665.43 xetxeex 55 /35 /2525 axx 55 /25 /2525. x xA revxex o 126 f the xustoxer s xrofxle axx axxouxt revealex xustoxer xs 35 years olx axx resxxes xx Roxlaxx Hexxht NOTE: 5 records were read from the infile (system-specific pathname). NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
So we can see that there is a long string with mismatched quotes. For that to be valid JSON text the embedded double quote needs to prefixed with a backslash.
,"xoxxent":"Alert xas trxaxex xn Hx xxth the folloxxnx Hx note n \"Thxs al....
So how can we find all of these improperly embedded double quote characters?
One way is to look for those that do not follow a comma or semicolon or closing bracket or an object or array. Or that do not come before a comma or semicolon or opening bracket of an object or array.
8 data test; 9 infile x lrecl=100000 column=cc; 10 input @'"' +(-2) before $char1. +1 after $char1. @@; 11 col = cc - 2; 12 test1 = before in (',' ':' '[' '{'); 13 test2 = after in (',' ':' ']' '}'); 14 if (test1 or test2) and not (test1 and test2) then delete; 15 run; NOTE: The infile X is: (system-specific pathname), (system-specific file attributes) NOTE: 1 record was read from the infile (system-specific pathname). The minimum record length was 50064. The maximum record length was 50064. NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line. NOTE: The data set WORK.TEST has 2 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
So we found two places.
So let's make a new file by copying the old one and inserting the needed \ at those two places.
16 proc sql noprint; 17 select col into :collist separated by ' ' from test; 18 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 19 20 filename y temp; 21 data _null_; 22 infile x recfm=n; 23 file y recfm=n; 24 col+1; 25 input ch $char1. ; 26 if col in (&collist) then put '\'; 27 put ch $char1. ; 28 run; NOTE: UNBUFFERED is the default with RECFM=N. NOTE: The infile X is: (system-specific pathname), (system-specific file attributes) NOTE: UNBUFFERED is the default with RECFM=N. NOTE: The file Y is: (system-specific pathname), (system-specific file attributes) NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
Now let's try to see if the new file works.
29 libname y json; NOTE: JSON data is only read once. To read the JSON again, reassign the JSON LIBNAME. ERROR: Invalid JSON in input near line 1 column 13488: Encountered an illegal character. ERROR: Error in the LIBNAME statement.
So we found something else now that is later in the file
So let's look into this file and see what is there.
30 data _null_; 31 infile y lrecl=100 recfm=f firstobs=133 obs=136 ; 32 input; 33 list; 34 run; NOTE: The infile Y is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 133 te":"2525-55-55T56:53:55.554","xoxxext":"Alert Resolutxox: Alert xxsxxssex xxth reasox 'xo susxxxxou 134 s axtxvxty xxextxfxex.'.","user":"55ee55a5-555x-455f-a5xa-52353f5xxea5"}],"xroxessexxy":{"_xx":"55ee 135 55a5-555x-455f-a5xa-52353f5xxea5"},"xroxessexxate":"2525-55-55T56:53:55.555","suxState":xull,"stateH 136 xstoryxata":[{"xate":"2525-55-55T55:55:55.545","state":"xxsxxssex","userxx":"45553fx5-xxxx-4455-53a5 NOTE: 4 records were read from the infile (system-specific pathname). NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
This time it looks like you have the word xull where you probably mean null.
Hopefully that is just an artifact of whatever anonymization you tried to do.
If not try to figure out how to replace all of the
:xull
strings with
:null
instead.
Seems to work for this file:
35 data test2; 36 infile y lrecl=100000 column=cc ; 37 input @':xull' @@; 38 col = cc-4; 39 run; NOTE: The infile Y is: (system-specific pathname), (system-specific file attributes) NOTE: 1 record was read from the infile (system-specific pathname). The minimum record length was 50066. The maximum record length was 50066. NOTE: SAS went to a new line when INPUT @'CHARACTER_STRING' scanned past the end of a line. NOTE: The data set WORK.TEST2 has 14 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 40 proc sql noprint; 41 select col into :collist2 separated by ' ' from test2; 42 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 43 44 filename z temp; 45 data _null_; 46 infile y recfm=n; 47 file z recfm=n; 48 col+1; 49 input ch $char1. ; 50 if col in (&collist2) then ch='n'; 51 put ch $char1. ; 52 run; NOTE: UNBUFFERED is the default with RECFM=N. NOTE: The infile Y is: (system-specific pathname), (system-specific file attributes) NOTE: UNBUFFERED is the default with RECFM=N. NOTE: The file Z is: (system-specific pathname), (system-specific file attributes) NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 53 libname z json; NOTE: JSON data is only read once. To read the JSON again, reassign the JSON LIBNAME. NOTE: Libref Z was successfully assigned as follows: Engine: JSON Physical Name: (system-specific file/path name) 54 proc copy inlib=z outlib=work; run; NOTE: Copying Z.ALLDATA to WORK.ALLDATA (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.ALLDATA has 1917 observations and 9 variables. NOTE: Copying Z.ASSXXNOR to WORK.ASSXXNOR (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.ASSXXNOR has 2 observations and 3 variables. NOTE: Copying Z.ASSXXXOR to WORK.ASSXXXOR (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.ASSXXXOR has 1 observations and 3 variables. NOTE: Copying Z.AXXOUNTS to WORK.AXXOUNTS (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUNTS has 1 observations and 12 variables. NOTE: Copying Z.AXXOUNTS_AXXXANAXERRELATXONSHX to WORK.AXXOUNTS_AXXXANAXERRELATXONSHX (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUNTS_AXXXANAXERRELATXONSHX has 1 observations and 3 variables. NOTE: Copying Z.AXXOUNTS_AXXXANAXERRELATXONSHX2 to WORK.AXXOUNTS_AXXXANAXERRELATXONSHX2 (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUNTS_AXXXANAXERRELATXONSHX2 has 1 observations and 3 variables. NOTE: Copying Z.AXXOUNTS_XNSTXTUTXON to WORK.AXXOUNTS_XNSTXTUTXON (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUNTS_XNSTXTUTXON has 1 observations and 15 variables. NOTE: Copying Z.AXXOUNTS_XNSTXTUTXON2 to WORK.AXXOUNTS_XNSTXTUTXON2 (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUNTS_XNSTXTUTXON2 has 1 observations and 15 variables. NOTE: Copying Z.AXXOUXTS to WORK.AXXOUXTS (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUXTS has 1 observations and 12 variables. NOTE: Copying Z.AXXOUXTS_AXXXAXAXERRELATXOXSHX to WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX has 1 observations and 3 variables. NOTE: Copying Z.AXXOUXTS_AXXXAXAXERRELATXOXSHX2 to WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX2 (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXOUXTS_AXXXAXAXERRELATXOXSHX2 has 4 observations and 3 variables. NOTE: Copying Z.AXXXANAXERRELATXONSHX_XUSTOXER to WORK.AXXXANAXERRELATXONSHX_XUSTOXER (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXXANAXERRELATXONSHX_XUSTOXER has 1 observations and 16 variables. NOTE: Copying Z.AXXXANAXERRELATXONSHX_XUSTOXER2 to WORK.AXXXANAXERRELATXONSHX_XUSTOXER2 (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXXANAXERRELATXONSHX_XUSTOXER2 has 1 observations and 16 variables. NOTE: Copying Z.AXXXAXAXERRELATXOXSHX_XUSTOXER to WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER has 1 observations and 16 variables. NOTE: Copying Z.AXXXAXAXERRELATXOXSHX_XUSTOXER2 to WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER2 (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.AXXXAXAXERRELATXOXSHX_XUSTOXER2 has 4 observations and 16 variables. NOTE: Copying Z.OTHERXARTY_XNSTXTUTXON to WORK.OTHERXARTY_XNSTXTUTXON (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.OTHERXARTY_XNSTXTUTXON has 2 observations and 4 variables. NOTE: Copying Z.OTHERXARTY_XXSTXTUTXOX to WORK.OTHERXARTY_XXSTXTUTXOX (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.OTHERXARTY_XXSTXTUTXOX has 11 observations and 4 variables. NOTE: Copying Z.ROOT to WORK.ROOT (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.ROOT has 3 observations and 29 variables. NOTE: Copying Z.RXSKFAXTORXETAXLS to WORK.RXSKFAXTORXETAXLS (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.RXSKFAXTORXETAXLS has 3 observations and 4 variables. NOTE: Copying Z.STATEHXSTORYXATA to WORK.STATEHXSTORYXATA (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.STATEHXSTORYXATA has 9 observations and 5 variables. NOTE: Copying Z.SUXJEXTS_RELATXONSHXXS to WORK.SUXJEXTS_RELATXONSHXXS (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.SUXJEXTS_RELATXONSHXXS has 2 observations and 3 variables. NOTE: Copying Z.SUXJEXTS_RELATXOXSHXXS to WORK.SUXJEXTS_RELATXOXSHXXS (memtype=DATA). NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used. NOTE: The data set WORK.SUXJEXTS_RELATXOXSHXXS has 1 observations and 3 variables. NOTE: Copying Z.SUXJEXTS_XUSTOXER to WORK.SUXJEXTS_XUSTOXER (memtype=DATA). ....
Now al least the JSON is valid, even if it is not a very suitable structure for data analysis.
Since JSON uses backslash (\) to escape double quotes, this line:
if a = '\' then a=&repA;
may be the reason of your "unbalanced quotes".
If your JSON have something like:
{"data" : "Dwayne \"the rock\" Johnson"}
inside it, after that line in data step (assuming &repA. is for example a space) it gets:
{"data" : "Dwayne "the rock " Johnson"}
Just guessing.
Bart
Ok @yabwon you were spot on, I removed the escape from the code to make it a space, I thought these were special characters causing issues and it imported just fine.
Thank you everyone for your help
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.