Solved: Re: Parse and reformat JSON data

confooseddesi89 · Posted 08-11-2023 06:45 PM

Hello,

I have a .csv file with JSON data that looks as follows (sample of data below):

id	dep_date	dep_1_PA	date_2	dep_2_PA
1346	1/20/2019	['{"dep_count":4,"dep_duration":4,"happy_count":10,"qual":95,"end_datetime":"2019-01-25T06:44:30Z","date":"2019-01-25","mid":"2019-01-25T04:08:00Z","counter_report":0,"counter_nap":297,"counter_dep":15,"counter_to_eat":0}' '{"dep_count":0,"dep_duration":0,"happy_count":5,"qual":96,"end_datetime":"2019-01-26T07:15:30Z","date":"2019-01-26","mid":"2019-01-26T04:39:30Z","counter_report":0,"counter_nap":300,"counter_dep":13,"counter_to_eat":0}' '{"dep_count":1,"dep_duration":1,"happy_count":4,"qual":99,"end_datetime":"2019-01-27T06:46:00Z","date":"2019-01-27","mid":"2019-01-27T04:13:30Z","counter_report":6,"counter_nap":296,"counter_dep":2,"counter_to_eat":0}']	2/19/2019	['{"dep_count":0,"dep_duration":0,"happy_count":15,"qual":96,"end_datetime":"2019-02-04T12:25:30Z","date":"2019-02-04","mid":"2019-02-04T09:01:30Z","counter_report":0,"counter_nap":392,"counter_dep":17,"counter_to_eat":0} '{"dep_count":4,"dep_duration":7,"happy_count":13,"qual":93,"end_datetime":"2019-02-05T09:09:30Z","date":"2019-02-05","mid":"2019-02-05T06:13:00Z","counter_report":0,"counter_nap":327,"counter_dep":26,"counter_to_eat":0}' '{"dep_count":1,"dep_duration":3,"happy_count":12,"qual":95,"end_datetime":"2019-02-06T12:05:30Z","date":"2019-02-06","mid":"2019-02-06T08:12:00Z","counter_report":1,"counter_nap":444,"counter_dep":21,"counter_to_eat":0}']
1786	3/14/2019	['{"dep_count":2,"dep_duration":2,"happy_count":8,"qual":97,"end_datetime":"2019-03-10T05:39:00Z","date":"2019-03-10","mid":"2019-03-10T03:00:30Z","counter_report":0,"counter_nap":307,"counter_dep":9,"counter_to_eat":0}' '{"dep_count":0,"dep_duration":0,"happy_count":5,"qual":99,"end_datetime":"2019-03-11T04:38:30Z","date":"2019-03-11","mid":"2019-03-11T01:50:30Z","counter_report":0,"counter_nap":332,"counter_dep":5,"counter_to_eat":0}' '{"dep_count":1,"dep_duration":4,"happy_count":15,"qual":92,"end_datetime":"2019-03-12T05:49:00Z","date":"2019-03-12","mid":"2019-03-12T02:16:00Z","counter_report":0,"counter_nap":393,"counter_dep":34,"counter_to_eat":0}']	4/14/2019	[]

I need it to look like this:

ID	dep_date	dep_count	dep_duration	happy_count	qual	end_datetime	date	mid	counter_report	counter_nap	counter_dep	counter_to_eat
1346	1/20/2019	4	4	10	95	2019-01-25T06:44:30Z	1/25/2019	2019-01-25T04:08:00Z	0	297	15	0
1346	1/20/2019	0	0	5	96	2019-01-26T07:15:30Z	1/26/2019	2019-01-26T04:39:30Z	0	300	13	0
1346	1/20/2019	1	1	4	99	2019-01-27T06:46:00Z	1/27/2019	2019-01-27T04:13:30Z	6	296	2	0
1346	2/19/2019	0	0	15	96	2019-02-04T12:25:30Z	2/4/2019	2019-02-04T09:01:30Z	0	392	17	0
1346	2/19/2019	4	7	13	93	2019-02-05T09:09:30Z	2/5/2019	2019-02-05T06:13:00Z	0	327	26	0
1346	2/19/2019	1	3	12	95	2019-02-06T12:05:30Z	2/6/2019	2019-02-06T08:12:00Z	1	444	21	0
1786	3/14/2019	2	2	8	97	2019-03-10T05:39:00Z	3/10/2019	2019-03-10T03:00:30Z	0	307	9	0
1786	3/14/2019	0	0	5	99	2019-03-11T04:38:30Z	3/11/2019	2019-03-11T01:50:30Z	0	332	5	0
1786	3/14/2019	1	4	15	92	2019-03-12T05:49:00Z	3/12/2019	2019-03-12T02:16:00Z	0	393	34	0
1786	4/14/2019
1786	4/14/2019
1786	4/14/2019

Is there a way to parse this JSON data accordingly in SAS? I can't do it in Excel. Thanks.

Tom · Posted 08-16-2023 02:05 PM

Once you have the JSON data properly extracted from the CSV file you can finally turn to trying to parse it.

data tall;
  set json(keep=id type rep json);
  where json ne ' ';
  length line row 8 name $32 value $200 term $300 record $1500;
  do line=1 to countw(json,'|');
     record=left(scan(json,line,'|'));
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]"||'A0'x);
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
  end;
  drop json term record;
run;

proc sort;
  by id type rep line row;
run;

proc transpose data=tall out=wide(drop=_name_) ;
  where name ne 'id';
  by id type rep line ;
  id name;
  var value;
run;

Looks like you also might need to do a little more cleaning. It appears some of the JSON strings have MID instead of MIDPOINT for one the timestamp variables.

View solution in original post

Patrick · Posted 08-11-2023 07:05 PM

Please attach a representative sample .csv. We need to understand in detail how your .csv looks like - especially where the line feeds are. Make sure the data you share really look like the source the SAS code needs to process (=avoid any intermediary step using Excel for providing the data).

confooseddesi89 · Posted 08-12-2023 06:03 PM

I don't see how I can attach the .csv to a post, so I've uploaded it here

Patrick · Posted 08-12-2023 08:26 PM

@confooseddesi89 wrote:

I don't see how I can attach the .csv to a post, so I've uploaded it here

@confooseddesi89 Just drag&drop. I've done it here now because many people won't download from some link.

The following code won't return the final result you're likely after but should take you one step further. Please don't just state what's still missing but just where you need further help with.

One of the challenges with your data: The json in the data needs some clean-up to use with a json engine. The question is where all this additional quoting got added.

Could it be that the .csv you've shared is some export from a database where the json string is stored in a CLOB field? Would you in the end directly access the database? And if so: Which database?

%let source_path=C:\temp;
data step1;
  infile "&source_path\dep_PA to paste sample v3 for sas forum.csv" 
    firstobs=2
    lrecl=10000
    encoding="utf-8"
    termstr=crlf
    dlm=','
    truncover
    dsd 
    ;    

  length id dep 8;
  format dep_date date9.;

  input id dep_date:mmddyy10. dep_PA:$2000. @;
  dep=1;
  dep_pa=transtrn(dep_pa,"['",'[');
  dep_pa=transtrn(dep_pa,"']",']');
  dep_pa=transtrn(dep_pa,'""','"');
  output;
  input dep_date:mmddyy10. dep_PA:$2000.;
  dep=2;
  dep_pa=transtrn(dep_pa,"['",'[');
  dep_pa=transtrn(dep_pa,"']",']');
  dep_pa=transtrn(dep_pa,'""','"');
  output;

run;

data step2;
  set step1;
  length name $32 value $40 ;

  do i=1 by 1 until(name=' ');
    name=dequote(scan(scan(dep_pa,i,"{{,'}]"),1,':'));
    value=scan(scan(dep_pa,i,"{{,'}]"),2,':');
    if name ne ' ' then output;
  end;
run;

Tom · Posted 08-11-2023 07:54 PM

You have posted a listing, not a CSV file. A CSV files is a text file. A CSV file has only one line per observation.

The listing you posted appears to have line breaks in the middle of some of the cells. That will not work.

So first step is to make sure you actually have just one line of text in the CSV file.

Also it would be a lot easier to use something other than a COMMA is the separator in your text file since COMMA is a common character in JSON text. So perhaps use a pipe character as the delimiter. Or TAB if you never have to actually look at the file with an editor. TAB characters are normally invisible they make it hard to look at or modify files that have them.

The text you posted also does not look like JSON text to me. What is that extra single quote doing there in the beginning?

['{"dep_count":4,"dep_duration":4,"happy_count":10

Tom · Posted 08-11-2023 09:11 PM

It is hard to tell from your post what the values in the DEP_1_PA and DEP_2_PA fields really have in them, but if they are consistently structed as those NAME:VALUE pairs then perhaps you can just use SCAN() to help you.

Try doing something like this to see what you get.

data tall;
  set have;
  length variable name $32 value $40 ;
  array x DEP_1_PA DEP_2_PA;
  do over x;
     variable = scan(vname(x),2,'_');
     do i=1 by 1 until(name=' ');
        name=dequote(scan(scan(x,i,"{{,'}]"),1,':'));
        value=scan(scan(x,i,"{{,'}]"),2,':');
        if name ne ' ' then output;
    end;
  end;
  drop DEP_1_PA DEP_2_PA;
run;

So you should now have a dataset like with ID, DEP_DATE, DATE_2, VARIABLE, NAME and VALUE.

You can then try re-ordering and using PROC TRANPOSE.

proc sort data=tall;
  by id dep_date date2 name variable ;
run;
proc transpose data=tall out=want;
  by id dep_date date2;
  id name variable;
  var value;
run;

confooseddesi89 · Posted 08-12-2023 06:51 PM

Hello,

Using the code you gave me, I get this error:

ERROR: The ID value "__1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep2" occurs twice in the same BY group.
ERROR: The ID value "counter_dep2" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap2" occurs twice in the same BY group.
ERROR: The ID value "counter_nap2" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_report2" occurs twice in the same BY group.
ERROR: The ID value "counter_report2" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat2" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat2" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "date2" occurs twice in the same BY group.
ERROR: The ID value "date2" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_count2" occurs twice in the same BY group.
ERROR: The ID value "dep_count2" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration2" occurs twice in the same BY group.
ERROR: The ID value "dep_duration2" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime2" occurs twice in the same BY group.
ERROR: The ID value "end_datetime2" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "happy_count2" occurs twice in the same BY group.
ERROR: The ID value "happy_count2" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "mid2" occurs twice in the same BY group.
ERROR: The ID value "mid2" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
ERROR: The ID value "qual2" occurs twice in the same BY group.
ERROR: The ID value "qual2" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1346 dep_date=01/20/2019 date_2=02/19/2019
ERROR: The ID value "__1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1786 dep_date=03/14/2019 date_2=04/14/2019
ERROR: All BY groups were bad.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 109 observations read from the data set WORK.TALL.
WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 0
         observations and 0 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

Also, the "tall" dataset looks like below:

id	dep_date	date_2	variable	name	value	i
1346	1/20/2019	2/19/2019	1			13
1346	1/20/2019	2/19/2019	1			25
1346	1/20/2019	2/19/2019	2			25
1346	1/20/2019	2/19/2019	2			13
1346	1/20/2019	2/19/2019	1	[		1
1346	1/20/2019	2/19/2019	2	[		1
1346	1/20/2019	2/19/2019	1	counter_dep	15	11
1346	1/20/2019	2/19/2019	1	counter_dep	13	23
1346	1/20/2019	2/19/2019	1	counter_dep	2	35
1346	1/20/2019	2/19/2019	2	counter_dep	17	11
1346	1/20/2019	2/19/2019	2	counter_dep	26	23
1346	1/20/2019	2/19/2019	2	counter_dep	21	35
1346	1/20/2019	2/19/2019	1	counter_nap	297	10
1346	1/20/2019	2/19/2019	1	counter_nap	300	22
1346	1/20/2019	2/19/2019	1	counter_nap	296	34
1346	1/20/2019	2/19/2019	2	counter_nap	392	10
1346	1/20/2019	2/19/2019	2	counter_nap	327	22
1346	1/20/2019	2/19/2019	2	counter_nap	444	34
1346	1/20/2019	2/19/2019	1	counter_report	0	9
1346	1/20/2019	2/19/2019	1	counter_report	0	21
1346	1/20/2019	2/19/2019	1	counter_report	6	33
1346	1/20/2019	2/19/2019	2	counter_report	0	9
1346	1/20/2019	2/19/2019	2	counter_report	0	21
1346	1/20/2019	2/19/2019	2	counter_report	1	33
1346	1/20/2019	2/19/2019	1	counter_to_eat	0	12
1346	1/20/2019	2/19/2019	1	counter_to_eat	0	24
1346	1/20/2019	2/19/2019	1	counter_to_eat	0	36
1346	1/20/2019	2/19/2019	2	counter_to_eat	0	12
1346	1/20/2019	2/19/2019	2	counter_to_eat	0	24
1346	1/20/2019	2/19/2019	2	counter_to_eat	0	36
1346	1/20/2019	2/19/2019	1	date	"2019-01-25"	7
1346	1/20/2019	2/19/2019	1	date	"2019-01-26"	19
1346	1/20/2019	2/19/2019	1	date	"2019-01-27"	31
1346	1/20/2019	2/19/2019	2	date	"2019-02-04"	7
1346	1/20/2019	2/19/2019	2	date	"2019-02-05"	19
1346	1/20/2019	2/19/2019	2	date	"2019-02-06"	31
1346	1/20/2019	2/19/2019	1	dep_count	4	2
1346	1/20/2019	2/19/2019	1	dep_count	0	14
1346	1/20/2019	2/19/2019	1	dep_count	1	26
1346	1/20/2019	2/19/2019	2	dep_count	0	2
1346	1/20/2019	2/19/2019	2	dep_count	4	14
1346	1/20/2019	2/19/2019	2	dep_count	1	26
1346	1/20/2019	2/19/2019	1	dep_duration	4	3
1346	1/20/2019	2/19/2019	1	dep_duration	0	15
1346	1/20/2019	2/19/2019	1	dep_duration	1	27
1346	1/20/2019	2/19/2019	2	dep_duration	0	3
1346	1/20/2019	2/19/2019	2	dep_duration	7	15
1346	1/20/2019	2/19/2019	2	dep_duration	3	27
1346	1/20/2019	2/19/2019	1	end_datetime	"2019-01-25T06	6
1346	1/20/2019	2/19/2019	1	end_datetime	"2019-01-26T07	18
1346	1/20/2019	2/19/2019	1	end_datetime	"2019-01-27T06	30
1346	1/20/2019	2/19/2019	2	end_datetime	"2019-02-04T12	6
1346	1/20/2019	2/19/2019	2	end_datetime	"2019-02-05T09	18
1346	1/20/2019	2/19/2019	2	end_datetime	"2019-02-06T12	30
1346	1/20/2019	2/19/2019	1	happy_count	10	4
1346	1/20/2019	2/19/2019	1	happy_count	5	16
1346	1/20/2019	2/19/2019	1	happy_count	4	28
1346	1/20/2019	2/19/2019	2	happy_count	15	4
1346	1/20/2019	2/19/2019	2	happy_count	13	16
1346	1/20/2019	2/19/2019	2	happy_count	12	28
1346	1/20/2019	2/19/2019	1	mid	"2019-01-25T04	8
1346	1/20/2019	2/19/2019	1	mid	"2019-01-26T04	20
1346	1/20/2019	2/19/2019	1	mid	"2019-01-27T04	32
1346	1/20/2019	2/19/2019	2	mid	"2019-02-04T09	8
1346	1/20/2019	2/19/2019	2	mid	"2019-02-05T06	20
1346	1/20/2019	2/19/2019	2	mid	"2019-02-06T08	32
1346	1/20/2019	2/19/2019	1	qual	95	5
1346	1/20/2019	2/19/2019	1	qual	96	17
1346	1/20/2019	2/19/2019	1	qual	99	29
1346	1/20/2019	2/19/2019	2	qual	96	5
1346	1/20/2019	2/19/2019	2	qual	93	17
1346	1/20/2019	2/19/2019	2	qual	95	29
1786	3/14/2019	4/14/2019	1			13
1786	3/14/2019	4/14/2019	1			25
1786	3/14/2019	4/14/2019	1	[		1
1786	3/14/2019	4/14/2019	2	[		1
1786	3/14/2019	4/14/2019	1	counter_dep	9	11
1786	3/14/2019	4/14/2019	1	counter_dep	5	23
1786	3/14/2019	4/14/2019	1	counter_dep	34	35
1786	3/14/2019	4/14/2019	1	counter_nap	307	10
1786	3/14/2019	4/14/2019	1	counter_nap	332	22
1786	3/14/2019	4/14/2019	1	counter_nap	393	34
1786	3/14/2019	4/14/2019	1	counter_report	0	9
1786	3/14/2019	4/14/2019	1	counter_report	0	21
1786	3/14/2019	4/14/2019	1	counter_report	0	33
1786	3/14/2019	4/14/2019	1	counter_to_eat	0	12
1786	3/14/2019	4/14/2019	1	counter_to_eat	0	24
1786	3/14/2019	4/14/2019	1	counter_to_eat	0	36
1786	3/14/2019	4/14/2019	1	date	"2019-03-10"	7
1786	3/14/2019	4/14/2019	1	date	"2019-03-11"	19
1786	3/14/2019	4/14/2019	1	date	"2019-03-12"	31
1786	3/14/2019	4/14/2019	1	dep_count	2	2
1786	3/14/2019	4/14/2019	1	dep_count	0	14
1786	3/14/2019	4/14/2019	1	dep_count	1	26
1786	3/14/2019	4/14/2019	1	dep_duration	2	3
1786	3/14/2019	4/14/2019	1	dep_duration	0	15
1786	3/14/2019	4/14/2019	1	dep_duration	4	27
1786	3/14/2019	4/14/2019	1	end_datetime	"2019-03-10T05	6
1786	3/14/2019	4/14/2019	1	end_datetime	"2019-03-11T04	18
1786	3/14/2019	4/14/2019	1	end_datetime	"2019-03-12T05	30
1786	3/14/2019	4/14/2019	1	happy_count	8	4
1786	3/14/2019	4/14/2019	1	happy_count	5	16
1786	3/14/2019	4/14/2019	1	happy_count	15	28
1786	3/14/2019	4/14/2019	1	mid	"2019-03-10T03	8
1786	3/14/2019	4/14/2019	1	mid	"2019-03-11T01	20
1786	3/14/2019	4/14/2019	1	mid	"2019-03-12T02	32
1786	3/14/2019	4/14/2019	1	qual	97	5
1786	3/14/2019	4/14/2019	1	qual	99	17
1786	3/14/2019	4/14/2019	1	qual	92	29

The parsing by dep_date / date_2 is incorrect. For example, the values 15, 13, and 2 in the "value" column belong to date 1/20/2019, whereas the values 17, 26, and 21 belong to date 2/19/2019. None of these value belong to both of these dates (2nd and third columns should be collapsed). I used the code below:

data tall; set  tall; if Variable=2 then dep_date=date_2;drop date_2;run;

To get this:

id	dep_date	variable	name	value	i
1346	1/20/2019	1	[		1
1346	1/20/2019	1	dep_count	4	2
1346	1/20/2019	1	dep_duration	4	3
1346	1/20/2019	1	happy_count	10	4
1346	1/20/2019	1	qual	95	5
1346	1/20/2019	1	end_datetime	"2019-01-25T06	6
1346	1/20/2019	1	date	"2019-01-25"	7
1346	1/20/2019	1	mid	"2019-01-25T04	8
1346	1/20/2019	1	counter_report	0	9
1346	1/20/2019	1	counter_nap	297	10
1346	1/20/2019	1	counter_dep	15	11
1346	1/20/2019	1	counter_to_eat	0	12
1346	1/20/2019	1			13
1346	1/20/2019	1	dep_count	0	14
1346	1/20/2019	1	dep_duration	0	15
1346	1/20/2019	1	happy_count	5	16
1346	1/20/2019	1	qual	96	17
1346	1/20/2019	1	end_datetime	"2019-01-26T07	18
1346	1/20/2019	1	date	"2019-01-26"	19
1346	1/20/2019	1	mid	"2019-01-26T04	20
1346	1/20/2019	1	counter_report	0	21
1346	1/20/2019	1	counter_nap	300	22
1346	1/20/2019	1	counter_dep	13	23
1346	1/20/2019	1	counter_to_eat	0	24
1346	1/20/2019	1			25
1346	1/20/2019	1	dep_count	1	26
1346	1/20/2019	1	dep_duration	1	27
1346	1/20/2019	1	happy_count	4	28
1346	1/20/2019	1	qual	99	29
1346	1/20/2019	1	end_datetime	"2019-01-27T06	30
1346	1/20/2019	1	date	"2019-01-27"	31
1346	1/20/2019	1	mid	"2019-01-27T04	32
1346	1/20/2019	1	counter_report	6	33
1346	1/20/2019	1	counter_nap	296	34
1346	1/20/2019	1	counter_dep	2	35
1346	1/20/2019	1	counter_to_eat	0	36
1346	2/19/2019	2	[		1
1346	2/19/2019	2	dep_count	0	2
1346	2/19/2019	2	dep_duration	0	3
1346	2/19/2019	2	happy_count	15	4
1346	2/19/2019	2	qual	96	5
1346	2/19/2019	2	end_datetime	"2019-02-04T12	6
1346	2/19/2019	2	date	"2019-02-04"	7
1346	2/19/2019	2	mid	"2019-02-04T09	8
1346	2/19/2019	2	counter_report	0	9
1346	2/19/2019	2	counter_nap	392	10
1346	2/19/2019	2	counter_dep	17	11
1346	2/19/2019	2	counter_to_eat	0	12
1346	2/19/2019	2			13
1346	2/19/2019	2	dep_count	4	14
1346	2/19/2019	2	dep_duration	7	15
1346	2/19/2019	2	happy_count	13	16
1346	2/19/2019	2	qual	93	17
1346	2/19/2019	2	end_datetime	"2019-02-05T09	18
1346	2/19/2019	2	date	"2019-02-05"	19
1346	2/19/2019	2	mid	"2019-02-05T06	20
1346	2/19/2019	2	counter_report	0	21
1346	2/19/2019	2	counter_nap	327	22
1346	2/19/2019	2	counter_dep	26	23
1346	2/19/2019	2	counter_to_eat	0	24
1346	2/19/2019	2			25
1346	2/19/2019	2	dep_count	1	26
1346	2/19/2019	2	dep_duration	3	27
1346	2/19/2019	2	happy_count	12	28
1346	2/19/2019	2	qual	95	29
1346	2/19/2019	2	end_datetime	"2019-02-06T12	30
1346	2/19/2019	2	date	"2019-02-06"	31
1346	2/19/2019	2	mid	"2019-02-06T08	32
1346	2/19/2019	2	counter_report	1	33
1346	2/19/2019	2	counter_nap	444	34
1346	2/19/2019	2	counter_dep	21	35
1346	2/19/2019	2	counter_to_eat	0	36
1786	3/14/2019	1	[		1
1786	3/14/2019	1	dep_count	2	2
1786	3/14/2019	1	dep_duration	2	3
1786	3/14/2019	1	happy_count	8	4
1786	3/14/2019	1	qual	97	5
1786	3/14/2019	1	end_datetime	"2019-03-10T05	6
1786	3/14/2019	1	date	"2019-03-10"	7
1786	3/14/2019	1	mid	"2019-03-10T03	8
1786	3/14/2019	1	counter_report	0	9
1786	3/14/2019	1	counter_nap	307	10
1786	3/14/2019	1	counter_dep	9	11
1786	3/14/2019	1	counter_to_eat	0	12
1786	3/14/2019	1			13
1786	3/14/2019	1	dep_count	0	14
1786	3/14/2019	1	dep_duration	0	15
1786	3/14/2019	1	happy_count	5	16
1786	3/14/2019	1	qual	99	17
1786	3/14/2019	1	end_datetime	"2019-03-11T04	18
1786	3/14/2019	1	date	"2019-03-11"	19
1786	3/14/2019	1	mid	"2019-03-11T01	20
1786	3/14/2019	1	counter_report	0	21
1786	3/14/2019	1	counter_nap	332	22
1786	3/14/2019	1	counter_dep	5	23
1786	3/14/2019	1	counter_to_eat	0	24
1786	3/14/2019	1			25
1786	3/14/2019	1	dep_count	1	26
1786	3/14/2019	1	dep_duration	4	27
1786	3/14/2019	1	happy_count	15	28
1786	3/14/2019	1	qual	92	29
1786	3/14/2019	1	end_datetime	"2019-03-12T05	30
1786	3/14/2019	1	date	"2019-03-12"	31
1786	3/14/2019	1	mid	"2019-03-12T02	32
1786	3/14/2019	1	counter_report	0	33
1786	3/14/2019	1	counter_nap	393	34
1786	3/14/2019	1	counter_dep	34	35
1786	3/14/2019	1	counter_to_eat	0	36
1786	4/14/2019	2	[		1

I then used the "proc transpose" code you gave me, while deleting "date2" and "variable" (which should not be in the ID statement, as the data should remain by the dep_date/variable) and I still got the error "the ID value [name] occurs twice in the same BY group." See below:

ERROR: The ID value "__" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1346 dep_date=01/20/2019
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1346 dep_date=02/19/2019
ERROR: The ID value "__" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1786 dep_date=03/14/2019
WARNING: 3 BY groups omitted due to earlier errors.
NOTE: There were 109 observations read from the data set WORK.TALL.
NOTE: The data set WORK.WANT has 1 observations and 17 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

I understand why I'm getting this error; I therefore created a counter by id, dep_date, and name, "day_count", and added "day_count" as a BY group in the PROC TRANSPOSE, which solved the issue. My data now look like this:

id	dep_date	day_count	_NAME_	__	_	counter_dep	counter_nap	counter_report	counter_to_eat	date	dep_count	dep_duration	end_datetime	happy_count	mid	qual	___
1346	1/20/2019	1	value			15	297	0	0	"2019-01-25"	4	4	"2019-01-25T06	10	"2019-01-25T04	95
1346	1/20/2019	2	value			13	300	0	0	"2019-01-26"	0	0	"2019-01-26T07	5	"2019-01-26T04	96
1346	1/20/2019	3	value			2	296	6	0	"2019-01-27"	1	1	"2019-01-27T06	4	"2019-01-27T04	99
1346	2/19/2019	1	value			17	392	0	0	"2019-02-04"	0	0	"2019-02-04T12	15	"2019-02-04T09	96
1346	2/19/2019	2	value			26	327	0	0	"2019-02-05"	4	7	"2019-02-05T09	13	"2019-02-05T06	93
1346	2/19/2019	3	value			21	444	1	0	"2019-02-06"	1	3	"2019-02-06T12	12	"2019-02-06T08	95
1786	3/14/2019	1	value			9	307	0	0	"2019-03-10"	2	2	"2019-03-10T05	8	"2019-03-10T03	97
1786	3/14/2019	2	value			5	332	0	0	"2019-03-11"	0	0	"2019-03-11T04	5	"2019-03-11T01	99
1786	3/14/2019	3	value			34	393	0	0	"2019-03-12"	1	4	"2019-03-12T05	15	"2019-03-12T02	92
1786	4/14/2019	1	value

However:

1) The value is cut off for end_datetime and mid (e.g., "2019-01-25T06 should be 2019-01-25T06:44:30Z). This problem was evident in the original "tall" dataset, as you can see in the first table I posted.

1) I need to strip the quotation marks off of date (which should be in date format), end_datetime, and mid (which should both be in datetime format).

Thanks.

Tom · Posted 08-13-2023 12:22 AM

Looks like you are almost there.

You appear to be using too short a variable when scanning. or perhaps your JSON text is just truncating when you read it into SAS?

You appear to have used the wrong delimiters in the SCAN() function call if it is use the slash in the date strings as a delimiter.

You appear to have multiple values for the same NAME. What do you want to do with those? You could add a new counter variable. Then you could either add the new variable to the BY group or the ID variable list so that you have distinct names.

Tom · Posted 08-13-2023 01:14 AM

Someone above posted an actual file. So let's use that file.

The first thing that needs to be fixed it to remove the END OF LINE characters from some of the values so that each observation is only ONE LINE in the CSV file. You can use this macro: https://github.com/sasutils/macros/blob/master/replace_crlf.sas

Let's replace the LINE FEED characters with a PIPE character so we know where the line breaks used to be.

filename csv "c:\downloads\dep_PA to paste sample v3 for sas forum.csv";
filename csvfix temp;
%replace_crlf(csv encoding=any,csvfix,cr=,lf='|');

Now we can read the CSV file into an actual SAS dataset.

data have;
  infile csvfix dsd truncover firstobs=2 ;
  input id $ dep_date :mmddyy. dep_1_PA :$1500. date_2 :mmddyy. dep_2_PA :$1500.;
  format dep_date date_2 yymmdd10.;
run;

Now let's parse the strings into records and then into terms and then into NAME/VALUE pairs.

data tall;
  set have;
  array x DEP_1_PA DEP_2_PA;
  length col line row 8 name $32 value $200 term $300 record $1500;
  do col=1 to dim(x);
   do line=1 by 1 until(record=' ');
     record=scan(x[col],line,'|');
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]");
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
   end;
  end;
  drop DEP_1_PA DEP_2_PA term record;
run;

Now we can transpose

proc transpose data=tall out=want(drop=_name_);
  by id dep_date date_2 col line;
  id name ;
  var value;
run;

Note that everything is a character string, but most of those variables look like numbers to me.

You could write the file back out into an actual CSV file and let PROC IMPORT try to read it in and it will do a pretty good job with this small sample.

filename newcsv temp;
proc export data=want file=newcsv dbms=csv;
run;

proc import file=newcsv dbms=csv out=want2 ;
run;

                  Variables in Creation Order

 #    Variable          Type    Len    Format        Informat

 1    id                Num       8    BEST12.       BEST32.
 2    dep_date          Num       8    YYMMDD10.     YYMMDD10.
 3    date_2            Num       8    YYMMDD10.     YYMMDD10.
 4    col               Num       8    BEST12.       BEST32.
 5    line              Num       8    BEST12.       BEST32.
 6    dep_count         Num       8    BEST12.       BEST32.
 7    dep_duration      Num       8    BEST12.       BEST32.
 8    happy_count       Num       8    BEST12.       BEST32.
 9    qual              Num       8    BEST12.       BEST32.
10    end_datetime      Num       8    B8601DZ35.    B8601DZ35.
11    date              Num       8    YYMMDD10.     YYMMDD10.
12    mid               Num       8    B8601DZ35.    B8601DZ35.
13    counter_report    Num       8    BEST12.       BEST32.
14    counter_nap       Num       8    BEST12.       BEST32.
15    counter_dep       Num       8    BEST12.       BEST32.
16    counter_to_eat    Num       8    BEST12.       BEST32.

confooseddesi89 · Posted 08-14-2023 02:56 PM

Hi,

This is nearly perfect, but:

1.) One line of data is missing; where end_datetime=2019-02-05T09:09:30Z (should be between rows 4 and 5). See below:

Obs	id	dep_date	col	line	dep_count	dep_duration	happy_count	qual	end_datetime	date	mid	counter_report	counter_nap	counter_dep
1	1346	2019-01-20	1	1	4	4	10	95	25JAN19:06:44:30	2019-01-25	25JAN19:04:08:00	0	297	15
2	1346	2019-01-20	1	2	0	0	5	96	26JAN19:07:15:30	2019-01-26	26JAN19:04:39:30	0	300	13
3	1346	2019-01-20	1	3	1	1	4	99	27JAN19:06:46:00	2019-01-27	27JAN19:04:13:30	6	296	2
4	1346	2019-02-19	2	1	0	0	15	96	04FEB19:12:25:30	2019-02-04	04FEB19:09:01:30	0	392	17
5	1346	2019-02-19	2	3	1	3	12	95	06FEB19:12:05:30	2019-02-06	06FEB19:08:12:00	1	444	21
6	1786	2019-03-14	1	1	2	2	8	97	10MAR19:05:39:00	2019-03-10	10MAR19:03:00:30	0	307	9
7	1786	2019-03-14	1	2	0	0	5	99	11MAR19:04:38:30	2019-03-11	11MAR19:01:50:30	0	332	5
8	1786	2019-03-14	1	3	1	4	15	92	12MAR19:05:49:00	2019-03-12	12MAR19:02:16:00	0	393	34

The line of data should look like this:

ID	dep_date	col	line	dep_count	dep_duration	happy_count	qual	end_datetime	date	mid	counter_report	counter_nap	counter_dep	counter_to_eat
1346	2019-02-19	2	2	4	7	13	93	05FEB19:09:09:30Z	2019-02-05	05FEB19:06:13:00Z	0	327	26	0

I've discovered the data gets lost somewhere when moving from "have" to "tall" (the code below):

data tall;
  set have;
  array x DEP_1_PA DEP_2_PA;
  length col line row 8 name $32 value $200 term $300 record $1500;
  do col=1 to dim(x);
   do line=1 by 1 until(record=' ');
     record=scan(x[col],line,'|');
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]");
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
   end;
  end;
  drop DEP_1_PA DEP_2_PA term record;
run;

2) The issue with the duplicate dates ("date_2") is still present; e.g., where dates are 2019-01-25, 2019-01-26, and 2019-01-27, the dep_date should be 1/20/2019, whereas where dates are 2019-02-04 and 2019-02-06, dep_date should be 2/19/2019. None of these value belong to both of these dates (dep_date and date_2 should be collapsed). I used the code below:

data want2; set  want2; if col=2 then dep_date=date_2;drop date_2;run;

However, is there a better way to do this - to not create "date_2" to begin with? (My larger dataset also has more than 2 "dep_dates" per person, which will make correcting the issue as with the above "if" code more tedious.)

Thanks.

Tom · Posted 08-14-2023 03:10 PM

Look at the formatting of LINE 2 for that source observation.

Remove the DROP statement and you should see it in RECORD.

See if you can figure out why it is different than line 1 and line 3 for that observation.

Do perhaps some of the values not have NAME:VALUE pairs? Does it have some extra strange characters that perhaps should be used as delimiters in the second scan() to look for the pairs?

Tom · Posted 08-14-2023 03:18 PM

Looks like there are more non-breaking spaces inserted. You could perhaps include those in the second SCAN() function. You could probably remove the line that replaces them with actual spaces since you are now using them as delimiter.

data tall;
  set have;
  array x DEP_1_PA DEP_2_PA;
  length col line row 8 name $32 value $200 term $300 record $1500;
  do col=1 to dim(x);
   do line=1 by 1 until(record=' ');
     record=scan(x[col],line,'|');
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]"||'A0'x);
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
   end;
  end;
  drop DEP_1_PA DEP_2_PA term record;
run;

3392  data _null_;
3393    set tall;
3394    where id='1346' and col=2 and line=2;
3395    put (_all_) (=) ;
3396  run;

id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=1 name=dep_count value=4
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=2 name=dep_duration value=7
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=3 name=happy_count value=13
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=4 name=qual value=93
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=5 name=end_datetime value=2019-02-05T09:09:30Z
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=6 name=date value=2019-02-05
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=7 name=mid value=2019-02-05T06:13:00Z
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=8 name=counter_report value=0
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=9 name=counter_nap value=327
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=10 name=counter_dep value=26
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=11 name=counter_to_eat value=0
NOTE: There were 11 observations read from the data set WORK.TALL.
      WHERE (id='1346') and (col=2) and (line=2);
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Tom · Posted 08-14-2023 03:24 PM

What is the meaning of the two date columns in the CSV file?

How do those dates relate to the dates in the JSON text?

Why not just use a different name for those two dates that does not conflict with the names in the JSON text?

confooseddesi89 · Posted 08-14-2023 03:37 PM

Thanks, the new code worked.

The dep_date is a reference date. It's actually not that important - but I'm not sure what you mean by " use a different name." The dep_date and date_2 names are not used for any of the JSON data.

Tom · Posted 08-14-2023 05:07 PM

The code I posted works is it takes that one observation and splits into separate observations for reach JSON string VARIABLE and each LINE in within that variable.

The only way it could cause name conflicts is that you have the same name in the same line or the name conflicts with one of the grouping variables. If you have same the NAME in the same LINE of the same VARIABLE then what does that mean? And what do you want to DO about it?

(Note this is why JSON is a terrible format for storing relational data.)

Catch up on SAS Innovate 2026