BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

Hello,

 

I have a .csv file with JSON data that looks as follows (sample of data below):

 

id dep_date dep_1_PA date_2 dep_2_PA
1346 1/20/2019 ['{"dep_count":4,"dep_duration":4,"happy_count":10,"qual":95,"end_datetime":"2019-01-25T06:44:30Z","date":"2019-01-25","mid":"2019-01-25T04:08:00Z","counter_report":0,"counter_nap":297,"counter_dep":15,"counter_to_eat":0}'

'{"dep_count":0,"dep_duration":0,"happy_count":5,"qual":96,"end_datetime":"2019-01-26T07:15:30Z","date":"2019-01-26","mid":"2019-01-26T04:39:30Z","counter_report":0,"counter_nap":300,"counter_dep":13,"counter_to_eat":0}'

'{"dep_count":1,"dep_duration":1,"happy_count":4,"qual":99,"end_datetime":"2019-01-27T06:46:00Z","date":"2019-01-27","mid":"2019-01-27T04:13:30Z","counter_report":6,"counter_nap":296,"counter_dep":2,"counter_to_eat":0}']
2/19/2019 ['{"dep_count":0,"dep_duration":0,"happy_count":15,"qual":96,"end_datetime":"2019-02-04T12:25:30Z","date":"2019-02-04","mid":"2019-02-04T09:01:30Z","counter_report":0,"counter_nap":392,"counter_dep":17,"counter_to_eat":0}

 '{"dep_count":4,"dep_duration":7,"happy_count":13,"qual":93,"end_datetime":"2019-02-05T09:09:30Z","date":"2019-02-05","mid":"2019-02-05T06:13:00Z","counter_report":0,"counter_nap":327,"counter_dep":26,"counter_to_eat":0}'

'{"dep_count":1,"dep_duration":3,"happy_count":12,"qual":95,"end_datetime":"2019-02-06T12:05:30Z","date":"2019-02-06","mid":"2019-02-06T08:12:00Z","counter_report":1,"counter_nap":444,"counter_dep":21,"counter_to_eat":0}']
1786 3/14/2019 ['{"dep_count":2,"dep_duration":2,"happy_count":8,"qual":97,"end_datetime":"2019-03-10T05:39:00Z","date":"2019-03-10","mid":"2019-03-10T03:00:30Z","counter_report":0,"counter_nap":307,"counter_dep":9,"counter_to_eat":0}'

'{"dep_count":0,"dep_duration":0,"happy_count":5,"qual":99,"end_datetime":"2019-03-11T04:38:30Z","date":"2019-03-11","mid":"2019-03-11T01:50:30Z","counter_report":0,"counter_nap":332,"counter_dep":5,"counter_to_eat":0}'

'{"dep_count":1,"dep_duration":4,"happy_count":15,"qual":92,"end_datetime":"2019-03-12T05:49:00Z","date":"2019-03-12","mid":"2019-03-12T02:16:00Z","counter_report":0,"counter_nap":393,"counter_dep":34,"counter_to_eat":0}']
4/14/2019 []

 

I need it to look like this:

 

ID dep_date dep_count dep_duration happy_count qual end_datetime date mid counter_report counter_nap counter_dep counter_to_eat
1346 1/20/2019 4 4 10 95 2019-01-25T06:44:30Z 1/25/2019 2019-01-25T04:08:00Z 0 297 15 0
1346 1/20/2019 0 0 5 96 2019-01-26T07:15:30Z 1/26/2019 2019-01-26T04:39:30Z 0 300 13 0
1346 1/20/2019 1 1 4 99 2019-01-27T06:46:00Z 1/27/2019 2019-01-27T04:13:30Z 6 296 2 0
1346 2/19/2019 0 0 15 96 2019-02-04T12:25:30Z 2/4/2019 2019-02-04T09:01:30Z 0 392 17 0
1346 2/19/2019 4 7 13 93 2019-02-05T09:09:30Z 2/5/2019 2019-02-05T06:13:00Z 0 327 26 0
1346 2/19/2019 1 3 12 95 2019-02-06T12:05:30Z 2/6/2019 2019-02-06T08:12:00Z 1 444 21 0
1786 3/14/2019 2 2 8 97 2019-03-10T05:39:00Z 3/10/2019 2019-03-10T03:00:30Z 0 307 9 0
1786 3/14/2019 0 0 5 99 2019-03-11T04:38:30Z 3/11/2019 2019-03-11T01:50:30Z 0 332 5 0
1786 3/14/2019 1 4 15 92 2019-03-12T05:49:00Z 3/12/2019 2019-03-12T02:16:00Z 0 393 34 0
1786 4/14/2019                      
1786 4/14/2019                      
1786 4/14/2019                    

 

 

 

 

Is there a way to parse this JSON data accordingly in SAS? I can't do it in Excel. Thanks.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Once you have the JSON data properly extracted from the CSV file you can finally turn to trying to parse it.

data tall;
  set json(keep=id type rep json);
  where json ne ' ';
  length line row 8 name $32 value $200 term $300 record $1500;
  do line=1 to countw(json,'|');
     record=left(scan(json,line,'|'));
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]"||'A0'x);
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
  end;
  drop json term record;
run;

proc sort;
  by id type rep line row;
run;

proc transpose data=tall out=wide(drop=_name_) ;
  where name ne 'id';
  by id type rep line ;
  id name;
  var value;
run;

Looks like you also might need to do a little more cleaning.  It appears some of the JSON strings have MID instead of MIDPOINT for one the timestamp variables.

 

Tom_0-1692210445032.png

 

View solution in original post

22 REPLIES 22
Patrick
Opal | Level 21

Please attach a representative sample .csv. We need to understand in detail how your .csv looks like - especially where the line feeds are. Make sure the data you share really look like the source the SAS code needs to process (=avoid any intermediary step using Excel for providing the data).

confooseddesi89
Quartz | Level 8

I don't see how I can attach the .csv to a post, so I've uploaded it here

Patrick
Opal | Level 21

@confooseddesi89 wrote:

I don't see how I can attach the .csv to a post, so I've uploaded it here


@confooseddesi89 Just drag&drop. I've done it here now because many people won't download from some link.

Patrick_0-1691886188278.png

 

The following code won't return the final result you're likely after but should take you one step further. Please don't just state what's still missing but just where you need further help with.

One of the challenges with your data: The json in the data needs some clean-up to use with a json engine. The question is where all this additional quoting got added. 

Could it be that the .csv you've shared is some export from a database where the json string is stored in a CLOB field? Would you in the end directly access the database? And if so: Which database?

%let source_path=C:\temp;
data step1;
  infile "&source_path\dep_PA to paste sample v3 for sas forum.csv" 
    firstobs=2
    lrecl=10000
    encoding="utf-8"
    termstr=crlf
    dlm=','
    truncover
    dsd 
    ;    

  length id dep 8;
  format dep_date date9.;

  input id dep_date:mmddyy10. dep_PA:$2000. @;
  dep=1;
  dep_pa=transtrn(dep_pa,"['",'[');
  dep_pa=transtrn(dep_pa,"']",']');
  dep_pa=transtrn(dep_pa,'""','"');
  output;
  input dep_date:mmddyy10. dep_PA:$2000.;
  dep=2;
  dep_pa=transtrn(dep_pa,"['",'[');
  dep_pa=transtrn(dep_pa,"']",']');
  dep_pa=transtrn(dep_pa,'""','"');
  output;

run;

data step2;
  set step1;
  length name $32 value $40 ;

  do i=1 by 1 until(name=' ');
    name=dequote(scan(scan(dep_pa,i,"{{,'}]"),1,':'));
    value=scan(scan(dep_pa,i,"{{,'}]"),2,':');
    if name ne ' ' then output;
  end;
run;
  

 

Tom
Super User Tom
Super User

You have posted a listing, not a CSV file. A CSV files is a text file.  A CSV file has only one line per observation.

The listing you posted appears to have line breaks in the middle of some of the cells.  That will not work.

So first step is to make sure you actually have just one line of text in the CSV file.

 

Also it would be a lot easier to use something other than a COMMA is the separator in your text file since COMMA is a common character in JSON text.  So perhaps use a pipe character as the delimiter.  Or TAB if you never have to actually look at the file with an editor. TAB characters are normally invisible they make it hard to look at or modify files that have them.

 

The text you posted also does not look like JSON text to me.  What is that extra single quote doing there in the beginning?

['{"dep_count":4,"dep_duration":4,"happy_count":10

 

 

Tom
Super User Tom
Super User

It is hard to tell from your post what the values in the DEP_1_PA and DEP_2_PA fields really have in them, but if they are consistently structed as those NAME:VALUE pairs then perhaps you can just use SCAN() to help you.

 

Try doing something like this to see what you get.

data tall;
  set have;
  length variable name $32 value $40 ;
  array x DEP_1_PA DEP_2_PA;
  do over x;
     variable = scan(vname(x),2,'_');
     do i=1 by 1 until(name=' ');
        name=dequote(scan(scan(x,i,"{{,'}]"),1,':'));
        value=scan(scan(x,i,"{{,'}]"),2,':');
        if name ne ' ' then output;
    end;
  end;
  drop DEP_1_PA DEP_2_PA;
run;

So you should now have a dataset like with ID, DEP_DATE, DATE_2, VARIABLE, NAME and VALUE.

You can then try re-ordering and using PROC TRANPOSE.

proc sort data=tall;
  by id dep_date date2 name variable ;
run;
proc transpose data=tall out=want;
  by id dep_date date2;
  id name variable;
  var value;
run;

 

 

confooseddesi89
Quartz | Level 8

Hello,

 

Using the code you gave me, I get this error:

 

ERROR: The ID value "__1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep2" occurs twice in the same BY group.
ERROR: The ID value "counter_dep2" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap2" occurs twice in the same BY group.
ERROR: The ID value "counter_nap2" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_report2" occurs twice in the same BY group.
ERROR: The ID value "counter_report2" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat2" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat2" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "date2" occurs twice in the same BY group.
ERROR: The ID value "date2" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_count2" occurs twice in the same BY group.
ERROR: The ID value "dep_count2" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration2" occurs twice in the same BY group.
ERROR: The ID value "dep_duration2" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime2" occurs twice in the same BY group.
ERROR: The ID value "end_datetime2" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "happy_count2" occurs twice in the same BY group.
ERROR: The ID value "happy_count2" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "mid2" occurs twice in the same BY group.
ERROR: The ID value "mid2" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
ERROR: The ID value "qual2" occurs twice in the same BY group.
ERROR: The ID value "qual2" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1346 dep_date=01/20/2019 date_2=02/19/2019
ERROR: The ID value "__1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_dep1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_nap1" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_report1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat1" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "date1" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_count1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "dep_duration1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "end_datetime1" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "happy_count1" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "mid1" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
ERROR: The ID value "qual1" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1786 dep_date=03/14/2019 date_2=04/14/2019
ERROR: All BY groups were bad.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 109 observations read from the data set WORK.TALL.
WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 0
         observations and 0 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

 

 

Also, the "tall" dataset looks like below:

 

id dep_date date_2 variable name value i
1346 1/20/2019 2/19/2019 1     13
1346 1/20/2019 2/19/2019 1     25
1346 1/20/2019 2/19/2019 2     25
1346 1/20/2019 2/19/2019 2     13
1346 1/20/2019 2/19/2019 1 [   1
1346 1/20/2019 2/19/2019 2 [   1
1346 1/20/2019 2/19/2019 1 counter_dep 15 11
1346 1/20/2019 2/19/2019 1 counter_dep 13 23
1346 1/20/2019 2/19/2019 1 counter_dep 2 35
1346 1/20/2019 2/19/2019 2 counter_dep 17 11
1346 1/20/2019 2/19/2019 2 counter_dep 26 23
1346 1/20/2019 2/19/2019 2 counter_dep 21 35
1346 1/20/2019 2/19/2019 1 counter_nap 297 10
1346 1/20/2019 2/19/2019 1 counter_nap 300 22
1346 1/20/2019 2/19/2019 1 counter_nap 296 34
1346 1/20/2019 2/19/2019 2 counter_nap 392 10
1346 1/20/2019 2/19/2019 2 counter_nap 327 22
1346 1/20/2019 2/19/2019 2 counter_nap 444 34
1346 1/20/2019 2/19/2019 1 counter_report 0 9
1346 1/20/2019 2/19/2019 1 counter_report 0 21
1346 1/20/2019 2/19/2019 1 counter_report 6 33
1346 1/20/2019 2/19/2019 2 counter_report 0 9
1346 1/20/2019 2/19/2019 2 counter_report 0 21
1346 1/20/2019 2/19/2019 2 counter_report 1 33
1346 1/20/2019 2/19/2019 1 counter_to_eat 0 12
1346 1/20/2019 2/19/2019 1 counter_to_eat 0 24
1346 1/20/2019 2/19/2019 1 counter_to_eat 0 36
1346 1/20/2019 2/19/2019 2 counter_to_eat 0 12
1346 1/20/2019 2/19/2019 2 counter_to_eat 0 24
1346 1/20/2019 2/19/2019 2 counter_to_eat 0 36
1346 1/20/2019 2/19/2019 1 date "2019-01-25" 7
1346 1/20/2019 2/19/2019 1 date "2019-01-26" 19
1346 1/20/2019 2/19/2019 1 date "2019-01-27" 31
1346 1/20/2019 2/19/2019 2 date "2019-02-04" 7
1346 1/20/2019 2/19/2019 2 date "2019-02-05" 19
1346 1/20/2019 2/19/2019 2 date "2019-02-06" 31
1346 1/20/2019 2/19/2019 1 dep_count 4 2
1346 1/20/2019 2/19/2019 1 dep_count 0 14
1346 1/20/2019 2/19/2019 1 dep_count 1 26
1346 1/20/2019 2/19/2019 2 dep_count 0 2
1346 1/20/2019 2/19/2019 2 dep_count 4 14
1346 1/20/2019 2/19/2019 2 dep_count 1 26
1346 1/20/2019 2/19/2019 1 dep_duration 4 3
1346 1/20/2019 2/19/2019 1 dep_duration 0 15
1346 1/20/2019 2/19/2019 1 dep_duration 1 27
1346 1/20/2019 2/19/2019 2 dep_duration 0 3
1346 1/20/2019 2/19/2019 2 dep_duration 7 15
1346 1/20/2019 2/19/2019 2 dep_duration 3 27
1346 1/20/2019 2/19/2019 1 end_datetime "2019-01-25T06 6
1346 1/20/2019 2/19/2019 1 end_datetime "2019-01-26T07 18
1346 1/20/2019 2/19/2019 1 end_datetime "2019-01-27T06 30
1346 1/20/2019 2/19/2019 2 end_datetime "2019-02-04T12 6
1346 1/20/2019 2/19/2019 2 end_datetime "2019-02-05T09 18
1346 1/20/2019 2/19/2019 2 end_datetime "2019-02-06T12 30
1346 1/20/2019 2/19/2019 1 happy_count 10 4
1346 1/20/2019 2/19/2019 1 happy_count 5 16
1346 1/20/2019 2/19/2019 1 happy_count 4 28
1346 1/20/2019 2/19/2019 2 happy_count 15 4
1346 1/20/2019 2/19/2019 2 happy_count 13 16
1346 1/20/2019 2/19/2019 2 happy_count 12 28
1346 1/20/2019 2/19/2019 1 mid "2019-01-25T04 8
1346 1/20/2019 2/19/2019 1 mid "2019-01-26T04 20
1346 1/20/2019 2/19/2019 1 mid "2019-01-27T04 32
1346 1/20/2019 2/19/2019 2 mid "2019-02-04T09 8
1346 1/20/2019 2/19/2019 2 mid "2019-02-05T06 20
1346 1/20/2019 2/19/2019 2 mid "2019-02-06T08 32
1346 1/20/2019 2/19/2019 1 qual 95 5
1346 1/20/2019 2/19/2019 1 qual 96 17
1346 1/20/2019 2/19/2019 1 qual 99 29
1346 1/20/2019 2/19/2019 2 qual 96 5
1346 1/20/2019 2/19/2019 2 qual 93 17
1346 1/20/2019 2/19/2019 2 qual 95 29
1786 3/14/2019 4/14/2019 1     13
1786 3/14/2019 4/14/2019 1     25
1786 3/14/2019 4/14/2019 1 [   1
1786 3/14/2019 4/14/2019 2 [   1
1786 3/14/2019 4/14/2019 1 counter_dep 9 11
1786 3/14/2019 4/14/2019 1 counter_dep 5 23
1786 3/14/2019 4/14/2019 1 counter_dep 34 35
1786 3/14/2019 4/14/2019 1 counter_nap 307 10
1786 3/14/2019 4/14/2019 1 counter_nap 332 22
1786 3/14/2019 4/14/2019 1 counter_nap 393 34
1786 3/14/2019 4/14/2019 1 counter_report 0 9
1786 3/14/2019 4/14/2019 1 counter_report 0 21
1786 3/14/2019 4/14/2019 1 counter_report 0 33
1786 3/14/2019 4/14/2019 1 counter_to_eat 0 12
1786 3/14/2019 4/14/2019 1 counter_to_eat 0 24
1786 3/14/2019 4/14/2019 1 counter_to_eat 0 36
1786 3/14/2019 4/14/2019 1 date "2019-03-10" 7
1786 3/14/2019 4/14/2019 1 date "2019-03-11" 19
1786 3/14/2019 4/14/2019 1 date "2019-03-12" 31
1786 3/14/2019 4/14/2019 1 dep_count 2 2
1786 3/14/2019 4/14/2019 1 dep_count 0 14
1786 3/14/2019 4/14/2019 1 dep_count 1 26
1786 3/14/2019 4/14/2019 1 dep_duration 2 3
1786 3/14/2019 4/14/2019 1 dep_duration 0 15
1786 3/14/2019 4/14/2019 1 dep_duration 4 27
1786 3/14/2019 4/14/2019 1 end_datetime "2019-03-10T05 6
1786 3/14/2019 4/14/2019 1 end_datetime "2019-03-11T04 18
1786 3/14/2019 4/14/2019 1 end_datetime "2019-03-12T05 30
1786 3/14/2019 4/14/2019 1 happy_count 8 4
1786 3/14/2019 4/14/2019 1 happy_count 5 16
1786 3/14/2019 4/14/2019 1 happy_count 15 28
1786 3/14/2019 4/14/2019 1 mid "2019-03-10T03 8
1786 3/14/2019 4/14/2019 1 mid "2019-03-11T01 20
1786 3/14/2019 4/14/2019 1 mid "2019-03-12T02 32
1786 3/14/2019 4/14/2019 1 qual 97 5
1786 3/14/2019 4/14/2019 1 qual 99 17
1786 3/14/2019 4/14/2019 1 qual 92 29

 

The parsing by dep_date / date_2 is incorrect. For example, the values 15, 13, and 2 in the "value" column belong to date 1/20/2019, whereas the values 17, 26, and 21 belong to date 2/19/2019. None of these value belong to both of these dates (2nd and third columns should be collapsed). I used the code below:

 

data tall; set  tall; if Variable=2 then dep_date=date_2;drop date_2;run;

 

To get this:

 

id dep_date variable name value i
1346 1/20/2019 1 [   1
1346 1/20/2019 1 dep_count 4 2
1346 1/20/2019 1 dep_duration 4 3
1346 1/20/2019 1 happy_count 10 4
1346 1/20/2019 1 qual 95 5
1346 1/20/2019 1 end_datetime "2019-01-25T06 6
1346 1/20/2019 1 date "2019-01-25" 7
1346 1/20/2019 1 mid "2019-01-25T04 8
1346 1/20/2019 1 counter_report 0 9
1346 1/20/2019 1 counter_nap 297 10
1346 1/20/2019 1 counter_dep 15 11
1346 1/20/2019 1 counter_to_eat 0 12
1346 1/20/2019 1     13
1346 1/20/2019 1 dep_count 0 14
1346 1/20/2019 1 dep_duration 0 15
1346 1/20/2019 1 happy_count 5 16
1346 1/20/2019 1 qual 96 17
1346 1/20/2019 1 end_datetime "2019-01-26T07 18
1346 1/20/2019 1 date "2019-01-26" 19
1346 1/20/2019 1 mid "2019-01-26T04 20
1346 1/20/2019 1 counter_report 0 21
1346 1/20/2019 1 counter_nap 300 22
1346 1/20/2019 1 counter_dep 13 23
1346 1/20/2019 1 counter_to_eat 0 24
1346 1/20/2019 1     25
1346 1/20/2019 1 dep_count 1 26
1346 1/20/2019 1 dep_duration 1 27
1346 1/20/2019 1 happy_count 4 28
1346 1/20/2019 1 qual 99 29
1346 1/20/2019 1 end_datetime "2019-01-27T06 30
1346 1/20/2019 1 date "2019-01-27" 31
1346 1/20/2019 1 mid "2019-01-27T04 32
1346 1/20/2019 1 counter_report 6 33
1346 1/20/2019 1 counter_nap 296 34
1346 1/20/2019 1 counter_dep 2 35
1346 1/20/2019 1 counter_to_eat 0 36
1346 2/19/2019 2 [   1
1346 2/19/2019 2 dep_count 0 2
1346 2/19/2019 2 dep_duration 0 3
1346 2/19/2019 2 happy_count 15 4
1346 2/19/2019 2 qual 96 5
1346 2/19/2019 2 end_datetime "2019-02-04T12 6
1346 2/19/2019 2 date "2019-02-04" 7
1346 2/19/2019 2 mid "2019-02-04T09 8
1346 2/19/2019 2 counter_report 0 9
1346 2/19/2019 2 counter_nap 392 10
1346 2/19/2019 2 counter_dep 17 11
1346 2/19/2019 2 counter_to_eat 0 12
1346 2/19/2019 2     13
1346 2/19/2019 2 dep_count 4 14
1346 2/19/2019 2 dep_duration 7 15
1346 2/19/2019 2 happy_count 13 16
1346 2/19/2019 2 qual 93 17
1346 2/19/2019 2 end_datetime "2019-02-05T09 18
1346 2/19/2019 2 date "2019-02-05" 19
1346 2/19/2019 2 mid "2019-02-05T06 20
1346 2/19/2019 2 counter_report 0 21
1346 2/19/2019 2 counter_nap 327 22
1346 2/19/2019 2 counter_dep 26 23
1346 2/19/2019 2 counter_to_eat 0 24
1346 2/19/2019 2     25
1346 2/19/2019 2 dep_count 1 26
1346 2/19/2019 2 dep_duration 3 27
1346 2/19/2019 2 happy_count 12 28
1346 2/19/2019 2 qual 95 29
1346 2/19/2019 2 end_datetime "2019-02-06T12 30
1346 2/19/2019 2 date "2019-02-06" 31
1346 2/19/2019 2 mid "2019-02-06T08 32
1346 2/19/2019 2 counter_report 1 33
1346 2/19/2019 2 counter_nap 444 34
1346 2/19/2019 2 counter_dep 21 35
1346 2/19/2019 2 counter_to_eat 0 36
1786 3/14/2019 1 [   1
1786 3/14/2019 1 dep_count 2 2
1786 3/14/2019 1 dep_duration 2 3
1786 3/14/2019 1 happy_count 8 4
1786 3/14/2019 1 qual 97 5
1786 3/14/2019 1 end_datetime "2019-03-10T05 6
1786 3/14/2019 1 date "2019-03-10" 7
1786 3/14/2019 1 mid "2019-03-10T03 8
1786 3/14/2019 1 counter_report 0 9
1786 3/14/2019 1 counter_nap 307 10
1786 3/14/2019 1 counter_dep 9 11
1786 3/14/2019 1 counter_to_eat 0 12
1786 3/14/2019 1     13
1786 3/14/2019 1 dep_count 0 14
1786 3/14/2019 1 dep_duration 0 15
1786 3/14/2019 1 happy_count 5 16
1786 3/14/2019 1 qual 99 17
1786 3/14/2019 1 end_datetime "2019-03-11T04 18
1786 3/14/2019 1 date "2019-03-11" 19
1786 3/14/2019 1 mid "2019-03-11T01 20
1786 3/14/2019 1 counter_report 0 21
1786 3/14/2019 1 counter_nap 332 22
1786 3/14/2019 1 counter_dep 5 23
1786 3/14/2019 1 counter_to_eat 0 24
1786 3/14/2019 1     25
1786 3/14/2019 1 dep_count 1 26
1786 3/14/2019 1 dep_duration 4 27
1786 3/14/2019 1 happy_count 15 28
1786 3/14/2019 1 qual 92 29
1786 3/14/2019 1 end_datetime "2019-03-12T05 30
1786 3/14/2019 1 date "2019-03-12" 31
1786 3/14/2019 1 mid "2019-03-12T02 32
1786 3/14/2019 1 counter_report 0 33
1786 3/14/2019 1 counter_nap 393 34
1786 3/14/2019 1 counter_dep 34 35
1786 3/14/2019 1 counter_to_eat 0 36
1786 4/14/2019 2 [   1

 

I then used the "proc transpose" code you gave me, while deleting "date2" and "variable" (which should not be in the ID statement, as the data should remain by the dep_date/variable) and I still got the error "the ID value [name] occurs twice in the same BY group." See below:

 

 

ERROR: The ID value "__" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1346 dep_date=01/20/2019
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1346 dep_date=02/19/2019
ERROR: The ID value "__" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_dep" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_nap" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_report" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "counter_to_eat" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "date" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_count" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "dep_duration" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "end_datetime" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "happy_count" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "mid" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
ERROR: The ID value "qual" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=1786 dep_date=03/14/2019
WARNING: 3 BY groups omitted due to earlier errors.
NOTE: There were 109 observations read from the data set WORK.TALL.
NOTE: The data set WORK.WANT has 1 observations and 17 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

 

I understand why I'm getting this error; I therefore created a counter by id, dep_date, and name, "day_count", and added "day_count" as a BY group in the PROC TRANSPOSE, which solved the issue. My data now look like this:

 

id dep_date day_count _NAME_ __ _ counter_dep counter_nap counter_report counter_to_eat date dep_count dep_duration end_datetime happy_count mid qual ___
1346 1/20/2019 1 value     15 297 0 0 "2019-01-25" 4 4 "2019-01-25T06 10 "2019-01-25T04 95  
1346 1/20/2019 2 value     13 300 0 0 "2019-01-26" 0 0 "2019-01-26T07 5 "2019-01-26T04 96  
1346 1/20/2019 3 value     2 296 6 0 "2019-01-27" 1 1 "2019-01-27T06 4 "2019-01-27T04 99  
1346 2/19/2019 1 value     17 392 0 0 "2019-02-04" 0 0 "2019-02-04T12 15 "2019-02-04T09 96  
1346 2/19/2019 2 value     26 327 0 0 "2019-02-05" 4 7 "2019-02-05T09 13 "2019-02-05T06 93  
1346 2/19/2019 3 value     21 444 1 0 "2019-02-06" 1 3 "2019-02-06T12 12 "2019-02-06T08 95  
1786 3/14/2019 1 value     9 307 0 0 "2019-03-10" 2 2 "2019-03-10T05 8 "2019-03-10T03 97  
1786 3/14/2019 2 value     5 332 0 0 "2019-03-11" 0 0 "2019-03-11T04 5 "2019-03-11T01 99  
1786 3/14/2019 3 value     34 393 0 0 "2019-03-12" 1 4 "2019-03-12T05 15 "2019-03-12T02 92  
1786 4/14/2019 1 value                            

 

However:

 

1) The value is cut off for end_datetime and mid (e.g., "2019-01-25T06 should be 2019-01-25T06:44:30Z). This problem was evident in the original "tall" dataset, as you can see in the first table I posted.

1) I need to strip the quotation marks off of date (which should be in date format), end_datetime, and mid (which should both be in datetime format).

 

Thanks.

 

Tom
Super User Tom
Super User

Looks like you are almost there.

You appear to be using too short a variable when scanning.  or perhaps your JSON text is just truncating when you read it into SAS?

You appear to have used the wrong delimiters in the SCAN() function call if it is use the slash in the date strings as a delimiter.

You appear to have multiple values for the same NAME.  What do you want to do with those?  You could add a new counter variable.  Then you could either add the new variable to the BY group or the ID variable list so that you have distinct names.

Tom
Super User Tom
Super User

Someone above posted an actual file. So let's use that file.

The first thing that needs to be fixed it to remove the END OF LINE characters from some of the values so that each observation is only ONE LINE in the CSV file.  You can use this macro:  https://github.com/sasutils/macros/blob/master/replace_crlf.sas

Let's replace the LINE FEED characters with a PIPE character so we know where the line breaks used to be.

filename csv "c:\downloads\dep_PA to paste sample v3 for sas forum.csv";
filename csvfix temp;
%replace_crlf(csv encoding=any,csvfix,cr=,lf='|');

Now we can read the CSV file into an actual SAS dataset.

data have;
  infile csvfix dsd truncover firstobs=2 ;
  input id $ dep_date :mmddyy. dep_1_PA :$1500. date_2 :mmddyy. dep_2_PA :$1500.;
  format dep_date date_2 yymmdd10.;
run;

Now let's parse the strings into records and then into terms and then into NAME/VALUE pairs.

data tall;
  set have;
  array x DEP_1_PA DEP_2_PA;
  length col line row 8 name $32 value $200 term $300 record $1500;
  do col=1 to dim(x);
   do line=1 by 1 until(record=' ');
     record=scan(x[col],line,'|');
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]");
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
   end;
  end;
  drop DEP_1_PA DEP_2_PA term record;
run;

Now we can transpose

proc transpose data=tall out=want(drop=_name_);
  by id dep_date date_2 col line;
  id name ;
  var value;
run;

Tom_0-1691903652685.png

Note that everything is a character string, but most of those variables look like numbers to me.  

You could write the file back out into an actual CSV file and let PROC IMPORT try to read it in and it will do a pretty good job with this small sample.

filename newcsv temp;
proc export data=want file=newcsv dbms=csv;
run;

proc import file=newcsv dbms=csv out=want2 ;
run;
                  Variables in Creation Order

 #    Variable          Type    Len    Format        Informat

 1    id                Num       8    BEST12.       BEST32.
 2    dep_date          Num       8    YYMMDD10.     YYMMDD10.
 3    date_2            Num       8    YYMMDD10.     YYMMDD10.
 4    col               Num       8    BEST12.       BEST32.
 5    line              Num       8    BEST12.       BEST32.
 6    dep_count         Num       8    BEST12.       BEST32.
 7    dep_duration      Num       8    BEST12.       BEST32.
 8    happy_count       Num       8    BEST12.       BEST32.
 9    qual              Num       8    BEST12.       BEST32.
10    end_datetime      Num       8    B8601DZ35.    B8601DZ35.
11    date              Num       8    YYMMDD10.     YYMMDD10.
12    mid               Num       8    B8601DZ35.    B8601DZ35.
13    counter_report    Num       8    BEST12.       BEST32.
14    counter_nap       Num       8    BEST12.       BEST32.
15    counter_dep       Num       8    BEST12.       BEST32.
16    counter_to_eat    Num       8    BEST12.       BEST32.
confooseddesi89
Quartz | Level 8

Hi,

 

This is nearly perfect, but:

 

1.) One line of data is missing; where end_datetime=2019-02-05T09:09:30Z (should be between rows 4 and 5). See below:

 

Obs id dep_date col line dep_count dep_duration happy_count qual end_datetime date mid counter_report counter_nap counter_dep counter_to_eat
1 1346 2019-01-20 1 1 4 4 10 95 25JAN19:06:44:30 2019-01-25 25JAN19:04:08:00 0 297 15 0
2 1346 2019-01-20 1 2 0 0 5 96 26JAN19:07:15:30 2019-01-26 26JAN19:04:39:30 0 300 13 0
3 1346 2019-01-20 1 3 1 1 4 99 27JAN19:06:46:00 2019-01-27 27JAN19:04:13:30 6 296 2 0
4 1346 2019-02-19 2 1 0 0 15 96 04FEB19:12:25:30 2019-02-04 04FEB19:09:01:30 0 392 17 0
5 1346 2019-02-19 2 3 1 3 12 95 06FEB19:12:05:30 2019-02-06 06FEB19:08:12:00 1 444 21 0
6 1786 2019-03-14 1 1 2 2 8 97 10MAR19:05:39:00 2019-03-10 10MAR19:03:00:30 0 307 9 0
7 1786 2019-03-14 1 2 0 0 5 99 11MAR19:04:38:30 2019-03-11 11MAR19:01:50:30 0 332 5 0
8 1786 2019-03-14 1 3 1 4 15 92 12MAR19:05:49:00 2019-03-12 12MAR19:02:16:00 0 393 34 0

 

The line of data should look like this:

ID dep_date col line dep_count dep_duration happy_count qual end_datetime date mid counter_report counter_nap counter_dep counter_to_eat
1346 2019-02-19 2 2 4 7 13 93 05FEB19:09:09:30Z 2019-02-05 05FEB19:06:13:00Z 0 327 26 0

 

I've discovered the data gets lost somewhere when moving from "have" to "tall" (the code below):

 

data tall;
  set have;
  array x DEP_1_PA DEP_2_PA;
  length col line row 8 name $32 value $200 term $300 record $1500;
  do col=1 to dim(x);
   do line=1 by 1 until(record=' ');
     record=scan(x[col],line,'|');
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]");
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
   end;
  end;
  drop DEP_1_PA DEP_2_PA term record;
run;

 

 

2) The issue with the duplicate dates ("date_2") is still present; e.g., where dates are 2019-01-25, 2019-01-26, and 2019-01-27, the dep_date should be 1/20/2019, whereas where dates are 2019-02-04 and 2019-02-06, dep_date should be 2/19/2019. None of these value belong to both of these dates (dep_date and date_2 should be collapsed). I used the code below:

 

data want2; set  want2; if col=2 then dep_date=date_2;drop date_2;run;

 

However, is there a better way to do this - to not create "date_2" to begin with? (My larger dataset also has more than 2 "dep_dates" per person, which will make correcting the issue as with the above "if" code more tedious.)

 

Thanks.

Tom
Super User Tom
Super User

Look at the formatting of LINE 2 for that source observation.

Remove the DROP statement and you should see it in RECORD.

See if you can figure out why it is different than line 1 and line 3 for that observation.

Do perhaps some of the values not have NAME:VALUE pairs?  Does it have some extra strange characters that perhaps should be used as delimiters in the second scan() to look for the pairs?

Tom
Super User Tom
Super User

Looks like there are more non-breaking spaces inserted.  You could perhaps include those in the second SCAN() function.  You could probably remove the line that replaces them with actual spaces since you are now using them as delimiter.

data tall;
  set have;
  array x DEP_1_PA DEP_2_PA;
  length col line row 8 name $32 value $200 term $300 record $1500;
  do col=1 to dim(x);
   do line=1 by 1 until(record=' ');
     record=scan(x[col],line,'|');
     do row=1 by 1 until(term=' ');
        term=scan(record,row,"[{,'}]"||'A0'x);
        term=translate(term,' ','A0'x);
        name=dequote(scan(term,1,':','q'));
        value=dequote(scan(term,2,':','q'));     
        if term ne ' ' then output;
    end;
   end;
  end;
  drop DEP_1_PA DEP_2_PA term record;
run;
3392  data _null_;
3393    set tall;
3394    where id='1346' and col=2 and line=2;
3395    put (_all_) (=) ;
3396  run;

id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=1 name=dep_count value=4
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=2 name=dep_duration value=7
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=3 name=happy_count value=13
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=4 name=qual value=93
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=5 name=end_datetime value=2019-02-05T09:09:30Z
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=6 name=date value=2019-02-05
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=7 name=mid value=2019-02-05T06:13:00Z
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=8 name=counter_report value=0
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=9 name=counter_nap value=327
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=10 name=counter_dep value=26
id=1346 dep_date=2019-01-20 date_2=2019-02-19 col=2 line=2 row=11 name=counter_to_eat value=0
NOTE: There were 11 observations read from the data set WORK.TALL.
      WHERE (id='1346') and (col=2) and (line=2);
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

 

Tom
Super User Tom
Super User

What is the meaning of the two date columns in the CSV file? 

How do those dates relate to the dates in the JSON text?

Why not just use a different name for those two dates that does not conflict with the names in the JSON text?

confooseddesi89
Quartz | Level 8

Thanks, the new code worked.

The dep_date is a reference date. It's actually not that important - but I'm not sure what you mean by " use a different name." The dep_date and date_2 names are not used for any of the JSON data.

Tom
Super User Tom
Super User

The code I posted works is it takes that one observation and splits into separate observations for reach JSON string VARIABLE and each LINE in within that variable.

 

The only way it could cause name conflicts is that you have the same name in the same line or the name conflicts with one of the grouping variables.  If you have same the NAME in the same LINE of the same VARIABLE then what does that mean?  And what do you want to DO about it?

 

(Note this is why JSON is a terrible format for storing relational data.)

 

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 22 replies
  • 3181 views
  • 1 like
  • 4 in conversation