Every developer must have at some point come across JSON (JavaScript Object Notation). JSON is arguably the most popular format by which data is transferred across the internet. Since the rise of client server architecture where we usually have a frontend browser or mobile application and a backend server application, the client constructs its request body as JSON and sends it to the backend, the backend also responds with JSON. So JSON is very popular and at some point we’ve all used them.
When the frontend constructs the JSON request payload and sends it over the network to the backend, the backend server receives this request payload as bytes and then depending on the web framework being used on the backend or programming language, we are able to parse these bytes into a more concrete/familiar data structure that can easily be used in that programming language. For example in Python, the JSON bytes are converted into pure Python objects, JSON objects to Python dictionaries, JSON arrays to Python lists, JSON true to Python True, JSON false to Python False, JSON null to Python None, JSON number to Python int or float and JSON strings to Python strings.
A lot of web frameworks and programming languages have this parsing capability built into them, or there’s usually a JSON library that can used to parse JSON bytes/strings into the programming language’s or web framework’s data structures.
The aim of this article is to demonstrate how to implement such parsers in pure Python. At the end of this article, we would have a function that takes a JSON string and outputs the equivalent Python object.
To follow along with this article, you only need to have Python3 installed and of course, an IDE like Visual Studio Code.
The JSON Grammar and Syntax
The JSON grammar is an LL(k) grammar or LL(1) grammar to be precise, which means you can determine with certainty what data structure should be produced from the very first character. This means that we can parse JSON using an LL parser with recursive descent approach.
Let's go...
Open your terminal and cd into any folder of your choice and run:
touch simple_json_parser.py
code simple_json_parser.py
touch simple_json_parser.py
would create the Python file and code simple_json_parser.py
would open the file in Vs Code.
Before we go further, let’s define clearly what our code wants to do.
We want to write a simple Python - let’s call it parse_json
function that would take a Python string containing JSON data and return the equivalent Python object. Here is a conversion table:
JSON | Python |
---|---|
Object | Dictionary |
Array | List |
Null | None |
true | True |
false | False |
number | int or float |
string | string |
Some very important rules to keep in mind for a valid JSON
- A valid JSON must either be an object or an array.
- JSON strings use double quotes.
- JSON can be nested to any depth, I suppose.
- JSON is whitespace agnostic meaning
{ "count" : 4}
is the same as{"count":4}
Let’s define our parser function in our open file:
def parse_json(json_string: str ) => dict | list:
...
Since JSON is whitespace agnostic as we’ve established, we need a way to skip whitespaces while parsing our JSON string so we only have to deal with non-whitespace characters.
Let’s define a function to do that for us just above our parser function.
import re
def skip_whitespace(inp: str, start_index: int = 0) -> int:
m = re.compile(r"\s*")
return m.match(inp, start_index).end()
def parse_json(json_string: str ) -> dict | list:
...
Our skip_white_space
function will take an input string inp
and a start_index
and return the index of the first non-whitespace character. We use regex to compile a pattern matching all consecutive whitespaces and we store that in a variable m
and then m.match(inp, start_index)
will match all consecutive whitespace character starting from the character at the start_index
of our inp
string. The .end()
will return the index of the first non-whitespace character it finds or in other words, .end()
will return last_whitespace_char_index + 1
where last_whitespace_char_index
is the index of the last whitespace character it finds.
Now that we have a function to skip all whitespaces from a particular position in our input string, returning the index of the first non-whitespace character it comes across, let’s continue with our parser function:
.... # everything above our parser function
def parse_json(json_string: str) -> dict | list:
start_index = skip_whitespace(json_string)
if start_index >= len(json_string):
# json string is empty
raise_invalid_json_error(json_string, start_index)
first_char = json_string[start_index]
if first_char not in ["{", "["]:
raise_invalid_json_error(json_string, start_index)
First, we skip any leading whitespace in the string to get the index of the first non-whitespace character. If the string is empty (i.e. it only has whitespaces), the returned index would be equal to the length of the string. In this case, we raise a Python Exception (we would define this shortly). If we do have non-whitespace characters, we ensure that the first character in our JSON is either a left curly brace or left bracket (as we’ve described in our rule 1 of valid JSON above) otherwise we raise an exception.
import re
class InvalidJSONError(Exception):
...
# ... skip_whitespace function is somewhere here
def raise_invalid_json_error(inp: str, index: int, message=None):
if message is None:
message = """Unexpected token at char({index}):
{inp}
{caret}"""
raise InvalidJSONError(message.format(index=index, inp=inp, caret=" " * index + "^"))
# ...... parse_json function is somewhere here
We define a custom exception class just for our parser. The raise_invalid_json_error function would raise this custom exception class and use a caret (^) to sort of pinpoint the character in the input string that’s causing this error.
def parse_json(json_string: str) -> dict|list:
# ... previous code goes here ....
try:
result, index = _parse_json(json_string, start_index)
index = skip_whitespace(json_string, index + 1)
if index < len(json_string):
raise_invalid_json_error(json_string, index)
return result
except IndexError:
raise_invalid_json_error(json_string, len(json_string))
After we’ve ensured that our JSON is either an object or an array, we parse the entire JSON string using the _parse_json
function we would define shortly. This function would recursively parse the string starting from the start_index
and it will return the resulting Python object from the JSON string and an index
. The returned index is the index of the last character that makes up the result in the input JSON string.
So for example:
inp = " {\"count\": 4 } "
the result would be {"count": 4}
and the index would be 16 because the }
is the last character that makes up our resulting Python object and it is at 16 in our input string.
The next line in our function skips any whitespaces occurring after our parsed JSON. We do this because in the next line, we check if the index returned from skipping the whitespaces is less than the length of the input string. If it is less, it means there is still at least one non-whitespace character in the string after parsing and therefore, the JSON string is not a valid JSON.
We wrap this section of the code in try and except, catching IndexError
because as we would see when we implement the rest of the code, our parser continuously moves forward through the input string until it makes sense of what it’s trying to parse. If it is unable to make sense of what it’s currently parsing by the end of the string, moving forward will raise the IndexError
.
# ... previous code here
def _parse_json(inp: str, index: int) -> Any:
first_char = inp[index]
if first_char == "{":
return parse_object(inp, index)
elif first_char == "[":
return parse_array(inp, index)
elif first_char == "\"":
return parse_string(inp, index)
elif first_char == "t":
return parse_true(inp, index)
elif first_char == "f":
return parse_false(inp, index)
elif first_char == "n":
return parse_null(inp, index)
else:
return parse_number(inp, index)
# ... parse_json function defined here
Like we’ve established before, the JSON grammar is an LL(1) grammar, therefore, we can determine what is expected by using 1 character. Our _parse_json
function takes the character at the provided index
of the inp
string and uses the right parser function to get the desired Python object. Every parse_{x}
function where {x}
is either object, array, string, true, false, null or number
returns the resulting Python object and the index of the last character that makes the result in the string.
Let’s start with the easy and straight forward parser functions. Shall we?
def parse_true(inp: str, index: int) -> tuple[bool, int]:
if inp[index] != "t":
raise raise_invalid_json_error(inp, index)
if inp[index: index+4] != "true":
raise_invalid_json_error(inp, index)
return True, index + 3
def parse_false(inp: str, index: int) -> tuple[bool, int]:
if inp[index] != "f":
raise raise_invalid_json_error(inp, index)
if inp[index: index+5] != "true":
raise_invalid_json_error(inp, index)
return False, index + 4
def parse_null(inp: str, index: int) -> tuple[None, int]:
if inp[index] != "n":
raise raise_invalid_json_error(inp, index)
if inp[index: index+4] != "null":
raise_invalid_json_error(inp, index)
return None, index + 3
When _parse_json
comes across the character t
or character f
or n
it knows to look for the JSON true
, false
or null
tokens respectively, it then uses the parse_true
, parse_false
or parse_null
functions respectively to parse the JSON true
to Python’s True
value, JSON false
to Python False
and JSON null
to Python None
. These parser functions first ensures that the character at index
of the inp
string does indeed start with a t
for parse_true
, f
for parse_false
or n
for parse_null
and then, they look forward into the input string by 3 characters for the true
and null
and by 4 characters for the false
.
The functions raise an exception if they find anything other than what they are looking for and when they do find what they are looking for, they return the desired Python value and the index of the last character that makes up the result.
def parse_number(inp: str, index: int) -> tuple[int | float, int]:
try:
int(inp[index])
result_string = ""
dot_count = 0
while inp[index] in "0123456789.":
result_string += inp[index]
if inp[index] == ".":
dot_count += 1
if dot_count > 1:
raise_invalid_json_error(inp, index)
if inp[index+1] not in "0123456789":
raise_invalid_json_error(inp, index)
index += 1
# loop will end when the character at 'index' is not a valid number
# the index for the last parsed character is index - 1
if dot_count > 0:
return float(result_string), index - 1
else:
return int(result_string), index - 1
except ValueError:
raise_invalid_json_error(inp, index)
When _parse_json
comes across a character it doesn’t recognize, it tries to parse it as a number. First, we ensure that the character at index
of the inp
string is indeed a number by doing int(inp[index])
. If it’s not a number, then we get ValueError
which will handle and raise the invalid JSON exception.
We instantiate 2 variables, result_string
to hold all the characters we’ve identified as numbers and dot_count
to count how many dots as in .
we’ve come across. A valid float should have only 1 .
.
We then start a while loop that loops through the characters starting from the current character and checks if this character is a number between 0 to 9 or a dot. If it is, we add the character to our result string and if it’s a dot, we increase our dot count and check if we now have more than a single dot.
If we have more than a single dot, we raise an exception, if not, we check the next character and ensure that it’s a number between 0 and 9; this is to ensure we don’t consider tokens like 456.
as a valid number - there must be a number after the dot. If there isn’t, we raise an exception and move on to the next character.
Our loop will terminate when the character at the current index
is not a number or a dot. This means we have to return the previous index for the function to comply with the rule of returning the index of the last character that makes up the result; in this case, the index of the last number. We also cast the result string into integer or float depending on the presence of a dot or not.
I am pretty sure there is a more elegant way to parse numbers 🙂
def parse_string(inp: str, index: int) -> tuple[str, int]:
if inp[index] != "\"":
# This is not a valid string
raise_invalid_json_error(inp, index)
start_index = index + 1
end_index = start_index
while inp[end_index] != "\"":
if inp[end_index] == "\\":
# if the character at that index is an escape sequence
# move end index by 2
end_index += 2
else:
end_index += 1
result = bytes(inp[start_index:end_index], "utf-8").decode("unicode_escape")
return result, end_index
To parse a string, first, we check to ensure that the input does indeed start with a "
at the index
. We then instantiate two variables to hold a start and end index for the first and last characters in the strings excluding the quotes "
. This follows the sliding window algorithm. At first, the end and start index are in the same position and then, we slide the end index by one character at every iteration checking to see if we have hit the closing quote "
.
The most interesting part of this whole loop is the escape sequence check, the escape character \
actually takes two positions and hence, we slide the index by two characters when we come across it.
At the end of the loop, we have our string content without the quotes and then, we need to properly handle any escape characters that might have been in the string. To do that, we convert the string into bytes and then decode with unicode_escape
. Finally, we return our string and the index of the closing quote.
def parse_object(inp: str, index: int) -> tuple[dict, int]:
if inp[index] != "{":
# This is not a valid json object
raise_invalid_json_error(inp, index)
result = {}
while inp[index] != "}":
# skip any white space between the open curly brace "{" and the key
key_start_index = skip_whitespace(inp, index + 1)
key, key_end_index = parse_string(inp, key_start_index)
# skip any white space between the key and the colon
colon_index = skip_whitespace(inp, key_end_index + 1)
if inp[colon_index] != ":":
# if the next item is not a colon the object is invalid
raise_invalid_json_error(inp, index)
# skip any white space between the colon index and the value
value_start_index = skip_whitespace(inp, colon_index + 1)
value, value_end_index = _parse_json(inp, value_start_index)
# store key and value in dictionary
result[key] = value
# skip any white space between the value and comma or closing brace
next_character_index = skip_whitespace(inp, value_end_index + 1)
if inp[next_character_index] == ",":
# if we have a comma it means there is another key value pair
# therefore the next character should not be a "}" otherwise
# the loop would end.
if inp[skip_whitespace(inp, next_character_index + 1)] == "}":
raise_invalid_json_error(inp, next_character_index + 1)
index = next_character_index
else:
# if it's not a comma then it should be a closing curly brace
if inp[next_character_index] != "}":
raise_invalid_json_error(inp, next_character_index)
index = next_character_index # this would terminate the loop
return result, index
To parse a JSON object, first, we ensure it does start with a left curly brace {
. We then instantiate our resulting Python dictionary and loop through every character starting from the index of the left curly brace {
. In every iteration, we check if we’ve hit the closing curly brace }
to indicate the end of the loop.
The JSON object follows a very simple structure. After the open curly brace, {
we must either have the close curly brace }
immediately (of course we are skipping all the whitespaces in between these tokens) for an empty object {}
or we must have a key value pair. The key value pair must be followed by a comma ,
to indicate that we have another key value pair or it must be followed by the closing curly brace }
to close the object. To parse a key value pair, we first parse the string for the key and then we ensure the next non-whitespace character after the key string is a colon :
and then, we ensure parsing the next non-whitespace character from the colon is a valid JSON value using _parse_json
function.
At the end of the function, we return the resulting Python dictionary and the index of the closing curly brace.
def parse_array(inp: str, index: int) -> tuple[list, int]:
if inp[index] != "[":
# This is not a valid json object
raise_invalid_json_error(inp, index)
result = []
while inp[index] != "]":
# skip white space between open bracket and the first value in array
value_start_index = skip_whitespace(inp, index + 1)
value, value_end_index = _parse_json(inp, value_start_index)
result.append(value)
next_character_index = skip_whitespace(inp, value_end_index + 1)
if inp[next_character_index] == ",":
# peek at the next c
if inp[skip_whitespace(inp, next_character_index + 1)] == "]":
raise_invalid_json_error(inp, next_character_index + 1)
index = next_character_index
else:
if inp[next_character_index] != "]":
raise_invalid_json_error(inp, next_character_index)
index = next_character_index
return result, index
Parsing a JSON array is quite similar to parsing a JSON object. After the open bracket [
, the next non-whitespace character should either be a closing bracket ]
for an empty array or a JSON value that we parse with _parse_json
.
After the JSON value, we should either have a comma ,
indicating that we have another item in the array, or we have the closing bracket ]
to close the array. At the end of the function, we would return the resulting Python list and the index of the closing bracket.
Now we have defined all our parser functions and our parser should now be complete. Here is the complete code:
import re
from typing import Any
class InvalidJSONError(Exception):
...
def skip_whitespace(inp: str, start_index: int = 0) -> int:
m = re.compile(r"\s*")
return m.match(inp, start_index).end()
def raise_invalid_json_error(inp: str, index: int, message=None):
if message is None:
message = """Unexpected token at char({index}):
{inp}
{caret}"""
raise InvalidJSONError(message.format(index=index, inp=inp, caret=" " * index + "^"))
def parse_string(inp: str, index: int) -> tuple[str, int]:
if inp[index] != "\"":
# This is not a valid string
raise_invalid_json_error(inp, index)
start_index = index + 1
end_index = start_index
while inp[end_index] != "\"":
if inp[end_index] == "\\":
# if the character at that index is an escape sequence
# move end index by 2
end_index += 2
else:
end_index += 1
result = bytes(inp[start_index:end_index], "utf-8").decode("unicode_escape")
return result, end_index
def parse_number(inp: str, index: int) -> tuple[int | float, int]:
try:
int(inp[index])
result_string = ""
dot_count = 0
while inp[index] in "0123456789.":
result_string += inp[index]
if inp[index] == ".":
dot_count += 1
if dot_count > 1:
raise_invalid_json_error(inp, index)
if inp[index+1] not in "0123456789":
raise_invalid_json_error(inp, index)
index += 1
# loop will end when the character at 'index' is not a valid number
# the index for the last parsed character is index - 1
if dot_count > 0:
return float(result_string), index - 1
else:
return int(result_string), index - 1
except ValueError:
raise_invalid_json_error(inp, index)
def parse_true(inp: str, index: int) -> tuple[bool, int]:
if inp[index] != "t":
raise raise_invalid_json_error(inp, index)
if inp[index: index+4] != "true":
raise_invalid_json_error(inp, index)
return True, index + 3
def parse_false(inp: str, index: int) -> tuple[bool, int]:
if inp[index] != "f":
raise raise_invalid_json_error(inp, index)
if inp[index: index+5] != "true":
raise_invalid_json_error(inp, index)
return False, index + 4
def parse_null(inp: str, index: int) -> tuple[None, int]:
if inp[index] != "n":
raise raise_invalid_json_error(inp, index)
if inp[index: index+4] != "null":
raise_invalid_json_error(inp, index)
return None, index + 3
def parse_object(inp: str, index: int) -> tuple[dict, int]:
if inp[index] != "{":
# This is not a valid json object
raise_invalid_json_error(inp, index)
result = {}
while inp[index] != "}":
# skip any white space between the open curly brace "{" and the key
key_start_index = skip_whitespace(inp, index + 1)
key, key_end_index = parse_string(inp, key_start_index)
# skip any white space between the key and the colon
colon_index = skip_whitespace(inp, key_end_index + 1)
if inp[colon_index] != ":":
# if the next item is not a colon the object is invalid
raise_invalid_json_error(inp, index)
# skip any white space between the colon index and the value
value_start_index = skip_whitespace(inp, colon_index + 1)
value, value_end_index = _parse_json(inp, value_start_index)
# store key and value in dictionary
result[key] = value
# skip any white space between the value and comma or closing brace
next_character_index = skip_whitespace(inp, value_end_index + 1)
if inp[next_character_index] == ",":
# if we have a comma it means there is another key value pair
# therefore the next character should not be a "}" otherwise
# the loop would end.
if inp[skip_whitespace(inp, next_character_index + 1)] == "}":
raise_invalid_json_error(inp, next_character_index + 1)
index = next_character_index
else:
# if it's not a comma then it should be a closing curly brace
if inp[next_character_index] != "}":
raise_invalid_json_error(inp, next_character_index)
index = next_character_index # this would terminate the loop
return result, index
def parse_array(inp: str, index: int) -> tuple[list, int]:
if inp[index] != "[":
# This is not a valid json object
raise_invalid_json_error(inp, index)
result = []
while inp[index] != "]":
# skip white space between open bracket and the first value in array
value_start_index = skip_whitespace(inp, index + 1)
value, value_end_index = _parse_json(inp, value_start_index)
result.append(value)
next_character_index = skip_whitespace(inp, value_end_index + 1)
if inp[next_character_index] == ",":
# peek at the next c
if inp[skip_whitespace(inp, next_character_index + 1)] == "]":
raise_invalid_json_error(inp, next_character_index + 1)
index = next_character_index
else:
if inp[next_character_index] != "]":
raise_invalid_json_error(inp, next_character_index)
index = next_character_index
return result, index
def _parse_json(inp: str, index: int) -> Any:
first_char = inp[index]
if first_char == "{":
return parse_object(inp, index)
elif first_char == "[":
return parse_array(inp, index)
elif first_char == "\"":
return parse_string(inp, index)
elif first_char == "t":
return parse_true(inp, index)
elif first_char == "f":
return parse_false(inp, index)
elif first_char == "n":
return parse_null(inp, index)
else:
return parse_number(inp, index)
def parse_json(json_string: str) -> dict|list:
start_index = skip_whitespace(json_string)
if start_index >= len(json_string):
# json string is empty
raise_invalid_json_error(json_string, start_index)
first_char = json_string[start_index]
if first_char not in ["{", "["]:
raise_invalid_json_error(json_string, start_index)
try:
result, index = _parse_json(json_string, start_index)
index = skip_whitespace(json_string, index + 1)
if index < len(json_string):
raise_invalid_json_error(json_string, index)
return result
except IndexError:
raise_invalid_json_error(json_string, len(json_string))
We can test that our JSON parser works correctly by:
if __name__ == "__main__":
import json
x = json.dumps({
"hello": "world\nworld\thi there",
"hi": 123,
"how": 56.8,
"are": [
1, 2, 3, 5.7,
"x", "y", "zed",
{"a": 1, "b": 2, "c": 3}
],
"you": {
"a": 56,
"bee": "hen",
"cee": 90
}
})
print(parse_json(x) == json.loads(x))
print(parse_json("""{"hello": "World", "hi": 34.5, "how": 456, "gee": null}"""))
We get the following output:
True
{'hello': 'World', 'hi': 34.5, 'how': 456, 'gee': None}
Conclusion
We have now implemented a simple JSON parser in Python using the recursive descent approach. By looking one character ahead we were able to determine the expected JSON token and parse it accordingly. You can find the full source code here.