Write a Simple JSON Parser in Python

Shaibu Shaibu

Software Engineer

Published: Jan 24, 2024

Every developer must have at some point come across JSON (JavaScript Object Notation). JSON is arguably the most popular format by which data is transferred across the internet. Since the rise of client server architecture where we usually have a frontend browser or mobile application and a backend server application, the client constructs its request body as JSON and sends it to the backend, the backend also responds with JSON. So JSON is very popular and at some point we’ve all used them.

When the frontend constructs the JSON request payload and sends it over the network to the backend, the backend server receives this request payload as bytes and then depending on the web framework being used on the backend or programming language, we are able to parse these bytes into a more concrete/familiar data structure that can easily be used in that programming language. For example in Python, the JSON bytes are converted into pure Python objects, JSON objects to Python dictionaries, JSON arrays to Python lists, JSON true to Python True, JSON false to Python False, JSON null to Python None, JSON number to Python int or float and JSON strings to Python strings.

A lot of web frameworks and programming languages have this parsing capability built into them, or there’s usually a JSON library that can used to parse JSON bytes/strings into the programming language’s or web framework’s data structures.

The aim of this article is to demonstrate how to implement such parsers in pure Python. At the end of this article, we would have a function that takes a JSON string and outputs the equivalent Python object.

To follow along with this article, you only need to have Python3 installed and of course, an IDE like Visual Studio Code.

The JSON Grammar and Syntax

The JSON grammar is an LL(k) grammar or LL(1) grammar to be precise, which means you can determine with certainty what data structure should be produced from the very first character. This means that we can parse JSON using an LL parser with recursive descent approach.

Let's go...

Open your terminal and cd into any folder of your choice and run:

touch simple_json_parser.py
code simple_json_parser.py

touch simple_json_parser.py would create the Python file and code simple_json_parser.py would open the file in Vs Code.

Before we go further, let’s define clearly what our code wants to do.

We want to write a simple Python - let’s call it parse_json function that would take a Python string containing JSON data and return the equivalent Python object. Here is a conversion table:

JSON	Python
Object	Dictionary
Array	List
Null	None
true	True
false	False
number	int or float
string	string

Some very important rules to keep in mind for a valid JSON

A valid JSON must either be an object or an array.
JSON strings use double quotes.
JSON can be nested to any depth, I suppose.
JSON is whitespace agnostic meaning { "count" : 4} is the same as {"count":4}

Let’s define our parser function in our open file:

def parse_json(json_string: str ) => dict | list:
    ...

Since JSON is whitespace agnostic as we’ve established, we need a way to skip whitespaces while parsing our JSON string so we only have to deal with non-whitespace characters.

Let’s define a function to do that for us just above our parser function.

import re

def skip_whitespace(inp: str, start_index: int = 0) -> int:
    m = re.compile(r"\s*")
    return m.match(inp, start_index).end()

def parse_json(json_string: str ) -> dict | list:
    ...

Our skip_white_space function will take an input string inp and a start_index and return the index of the first non-whitespace character. We use regex to compile a pattern matching all consecutive whitespaces and we store that in a variable m and then m.match(inp, start_index) will match all consecutive whitespace character starting from the character at the start_index of our inp string. The .end() will return the index of the first non-whitespace character it finds or in other words, .end() will return last_whitespace_char_index + 1 where last_whitespace_char_index is the index of the last whitespace character it finds.

Now that we have a function to skip all whitespaces from a particular position in our input string, returning the index of the first non-whitespace character it comes across, let’s continue with our parser function:

.... # everything above our parser function
def parse_json(json_string: str) -> dict | list:
    start_index = skip_whitespace(json_string)
		if start_index >= len(json_string):
        # json string is empty
        raise_invalid_json_error(json_string, start_index)
		first_char = json_string[start_index]
    if first_char not in ["{", "["]:
        raise_invalid_json_error(json_string, start_index)

First, we skip any leading whitespace in the string to get the index of the first non-whitespace character. If the string is empty (i.e. it only has whitespaces), the returned index would be equal to the length of the string. In this case, we raise a Python Exception (we would define this shortly). If we do have non-whitespace characters, we ensure that the first character in our JSON is either a left curly brace or left bracket (as we’ve described in our rule 1 of valid JSON above) otherwise we raise an exception.

import re

class InvalidJSONError(Exception):
    ...

# ... skip_whitespace function is somewhere here


def raise_invalid_json_error(inp: str, index: int, message=None):
    if message is None:
        message = """Unexpected token at char({index}):
{inp}
{caret}"""
    raise InvalidJSONError(message.format(index=index, inp=inp, caret=" " * index + "^"))

# ...... parse_json function is somewhere here

We define a custom exception class just for our parser. The raise_invalid_json_error function would raise this custom exception class and use a caret (^) to sort of pinpoint the character in the input string that’s causing this error.

def parse_json(json_string: str) -> dict|list:
    # ... previous code goes here ....
    try:
        result, index = _parse_json(json_string, start_index)
        index = skip_whitespace(json_string, index + 1)
        if index < len(json_string):
            raise_invalid_json_error(json_string, index)
        return result
    except IndexError:
        raise_invalid_json_error(json_string, len(json_string))

After we’ve ensured that our JSON is either an object or an array, we parse the entire JSON string using the _parse_json function we would define shortly. This function would recursively parse the string starting from the start_index and it will return the resulting Python object from the JSON string and an index . The returned index is the index of the last character that makes up the result in the input JSON string.

So for example:

inp = " {\"count\": 4 } " the result would be {"count": 4} and the index would be 16 because the } is the last character that makes up our resulting Python object and it is at 16 in our input string.

The next line in our function skips any whitespaces occurring after our parsed JSON. We do this because in the next line, we check if the index returned from skipping the whitespaces is less than the length of the input string. If it is less, it means there is still at least one non-whitespace character in the string after parsing and therefore, the JSON string is not a valid JSON.

We wrap this section of the code in try and except, catching IndexError because as we would see when we implement the rest of the code, our parser continuously moves forward through the input string until it makes sense of what it’s trying to parse. If it is unable to make sense of what it’s currently parsing by the end of the string, moving forward will raise the IndexError .

# ... previous code here

def _parse_json(inp: str, index: int) -> Any:
    first_char = inp[index]
    if first_char == "{":
        return parse_object(inp, index)
    elif first_char == "[":
        return parse_array(inp, index)
    elif first_char == "\"":
        return parse_string(inp, index)
    elif first_char == "t":
        return parse_true(inp, index)
    elif first_char == "f":
        return parse_false(inp, index)
    elif first_char == "n":
        return parse_null(inp, index)
    else:
        return parse_number(inp, index)

# ... parse_json function defined here

Like we’ve established before, the JSON grammar is an LL(1) grammar, therefore, we can determine what is expected by using 1 character. Our _parse_json function takes the character at the provided index of the inp string and uses the right parser function to get the desired Python object. Every parse_{x} function where {x} is either object, array, string, true, false, null or number returns the resulting Python object and the index of the last character that makes the result in the string.

Let’s start with the easy and straight forward parser functions. Shall we?

def parse_true(inp: str, index: int) -> tuple[bool, int]:
    if inp[index] != "t":
        raise raise_invalid_json_error(inp, index)
    if inp[index: index+4] != "true":
        raise_invalid_json_error(inp, index)
    return True, index + 3


def parse_false(inp: str, index: int) -> tuple[bool, int]:
    if inp[index] != "f":
        raise raise_invalid_json_error(inp, index)
    if inp[index: index+5] != "true":
        raise_invalid_json_error(inp, index)
    return False, index + 4


def parse_null(inp: str, index: int) -> tuple[None, int]:
    if inp[index] != "n":
        raise raise_invalid_json_error(inp, index)
    if inp[index: index+4] != "null":
        raise_invalid_json_error(inp, index)
    return None, index + 3

When _parse_json comes across the character t or character f or n it knows to look for the JSON true, false or null tokens respectively, it then uses the parse_true , parse_false or parse_null functions respectively to parse the JSON true to Python’s True value, JSON false to Python False and JSON null to Python None. These parser functions first ensures that the character at index of the inp string does indeed start with a t for parse_true, f for parse_false or n for parse_null and then, they look forward into the input string by 3 characters for the true and null and by 4 characters for the false.

The functions raise an exception if they find anything other than what they are looking for and when they do find what they are looking for, they return the desired Python value and the index of the last character that makes up the result.

def parse_number(inp: str, index: int) -> tuple[int | float, int]:
    try:
        int(inp[index])
        result_string = ""
        dot_count = 0
        while inp[index] in "0123456789.":
            result_string += inp[index]
            if inp[index] == ".":
                dot_count += 1
                if dot_count > 1:
                    raise_invalid_json_error(inp, index)
                if inp[index+1] not in "0123456789":
                    raise_invalid_json_error(inp, index)
            index += 1
        # loop will end when the character at 'index' is not a valid number
        # the index for the last parsed character is index - 1
        if dot_count > 0:
            return float(result_string), index - 1
        else:
            return int(result_string), index - 1
    except ValueError:
        raise_invalid_json_error(inp, index)

When _parse_json comes across a character it doesn’t recognize, it tries to parse it as a number. First, we ensure that the character at index of the inp string is indeed a number by doing int(inp[index]) . If it’s not a number, then we get ValueError which will handle and raise the invalid JSON exception.

We instantiate 2 variables, result_string to hold all the characters we’ve identified as numbers and dot_count to count how many dots as in . we’ve come across. A valid float should have only 1 . .

We then start a while loop that loops through the characters starting from the current character and checks if this character is a number between 0 to 9 or a dot. If it is, we add the character to our result string and if it’s a dot, we increase our dot count and check if we now have more than a single dot.

If we have more than a single dot, we raise an exception, if not, we check the next character and ensure that it’s a number between 0 and 9; this is to ensure we don’t consider tokens like 456. as a valid number - there must be a number after the dot. If there isn’t, we raise an exception and move on to the next character.

Our loop will terminate when the character at the current index is not a number or a dot. This means we have to return the previous index for the function to comply with the rule of returning the index of the last character that makes up the result; in this case, the index of the last number. We also cast the result string into integer or float depending on the presence of a dot or not.

I am pretty sure there is a more elegant way to parse numbers 🙂

def parse_string(inp: str, index: int) -> tuple[str, int]:
    if inp[index] != "\"":
        # This is not a valid string
        raise_invalid_json_error(inp, index)
    start_index = index + 1
    end_index = start_index
    while inp[end_index] != "\"":
        if inp[end_index] == "\\":
            # if the character at that index is an escape sequence
            # move end index by 2
            end_index += 2
        else:
            end_index += 1
    result = bytes(inp[start_index:end_index], "utf-8").decode("unicode_escape")
    return result, end_index

To parse a string, first, we check to ensure that the input does indeed start with a " at the index . We then instantiate two variables to hold a start and end index for the first and last characters in the strings excluding the quotes " . This follows the sliding window algorithm. At first, the end and start index are in the same position and then, we slide the end index by one character at every iteration checking to see if we have hit the closing quote ".

The most interesting part of this whole loop is the escape sequence check, the escape character \ actually takes two positions and hence, we slide the index by two characters when we come across it.

At the end of the loop, we have our string content without the quotes and then, we need to properly handle any escape characters that might have been in the string. To do that, we convert the string into bytes and then decode with unicode_escape. Finally, we return our string and the index of the closing quote.

def parse_object(inp: str, index: int) -> tuple[dict, int]:
    if inp[index] != "{":
        # This is not a valid json object
        raise_invalid_json_error(inp, index)
    result = {}
    while inp[index] != "}":
        # skip any white space between the open curly brace "{" and the key
        key_start_index = skip_whitespace(inp, index + 1)
        key, key_end_index = parse_string(inp, key_start_index)
        # skip any white space between the key and the colon
        colon_index = skip_whitespace(inp, key_end_index + 1)
        if inp[colon_index] != ":":
            # if the next item is not a colon the object is invalid
            raise_invalid_json_error(inp, index)
        # skip any white space between the colon index and the value
        value_start_index = skip_whitespace(inp, colon_index + 1)
        value, value_end_index = _parse_json(inp, value_start_index)
        # store key and value in dictionary
        result[key] = value
        # skip any white space between the value and comma or closing brace
        next_character_index = skip_whitespace(inp, value_end_index + 1)
        if inp[next_character_index] == ",":
            # if we have a comma it means there is another key value pair
            # therefore the next character should not be a "}" otherwise
            # the loop would end.
            if inp[skip_whitespace(inp, next_character_index + 1)] == "}":
                raise_invalid_json_error(inp, next_character_index + 1)
            index = next_character_index
        else:
            # if it's not a comma then it should be a closing curly brace
            if inp[next_character_index] != "}":
                raise_invalid_json_error(inp, next_character_index)
            index = next_character_index  # this would terminate the loop
    return result, index

To parse a JSON object, first, we ensure it does start with a left curly brace {. We then instantiate our resulting Python dictionary and loop through every character starting from the index of the left curly brace { . In every iteration, we check if we’ve hit the closing curly brace } to indicate the end of the loop.

The JSON object follows a very simple structure. After the open curly brace, { we must either have the close curly brace } immediately (of course we are skipping all the whitespaces in between these tokens) for an empty object {} or we must have a key value pair. The key value pair must be followed by a comma , to indicate that we have another key value pair or it must be followed by the closing curly brace } to close the object. To parse a key value pair, we first parse the string for the key and then we ensure the next non-whitespace character after the key string is a colon : and then, we ensure parsing the next non-whitespace character from the colon is a valid JSON value using _parse_json function.

At the end of the function, we return the resulting Python dictionary and the index of the closing curly brace.

def parse_array(inp: str, index: int) -> tuple[list, int]:
    if inp[index] != "[":
        # This is not a valid json object
        raise_invalid_json_error(inp, index)
    result = []
    while inp[index] != "]":
        # skip white space between open bracket and the first value in array
        value_start_index = skip_whitespace(inp, index + 1)
        value, value_end_index = _parse_json(inp, value_start_index)
        result.append(value)
        next_character_index = skip_whitespace(inp, value_end_index + 1)
        if inp[next_character_index] == ",":
            # peek at the next c
            if inp[skip_whitespace(inp, next_character_index + 1)] == "]":
                raise_invalid_json_error(inp, next_character_index + 1)
            index = next_character_index
        else:
            if inp[next_character_index] != "]":
                raise_invalid_json_error(inp, next_character_index)
            index = next_character_index
    return result, index

Parsing a JSON array is quite similar to parsing a JSON object. After the open bracket [, the next non-whitespace character should either be a closing bracket ] for an empty array or a JSON value that we parse with _parse_json .

After the JSON value, we should either have a comma , indicating that we have another item in the array, or we have the closing bracket ] to close the array. At the end of the function, we would return the resulting Python list and the index of the closing bracket.

Now we have defined all our parser functions and our parser should now be complete. Here is the complete code:

import re
from typing import Any


class InvalidJSONError(Exception):
    ...


def skip_whitespace(inp: str, start_index: int = 0) -> int:
    m = re.compile(r"\s*")
    return m.match(inp, start_index).end()


def raise_invalid_json_error(inp: str, index: int, message=None):
    if message is None:
        message = """Unexpected token at char({index}):
{inp}
{caret}"""
    raise InvalidJSONError(message.format(index=index, inp=inp, caret=" " * index + "^"))


def parse_string(inp: str, index: int) -> tuple[str, int]:
    if inp[index] != "\"":
        # This is not a valid string
        raise_invalid_json_error(inp, index)
    start_index = index + 1
    end_index = start_index
    while inp[end_index] != "\"":
        if inp[end_index] == "\\":
            # if the character at that index is an escape sequence
            # move end index by 2
            end_index += 2
        else:
            end_index += 1
    result = bytes(inp[start_index:end_index], "utf-8").decode("unicode_escape")
    return result, end_index


def parse_number(inp: str, index: int) -> tuple[int | float, int]:
    try:
        int(inp[index])
        result_string = ""
        dot_count = 0
        while inp[index] in "0123456789.":
            result_string += inp[index]
            if inp[index] == ".":
                dot_count += 1
                if dot_count > 1:
                    raise_invalid_json_error(inp, index)
                if inp[index+1] not in "0123456789":
                    raise_invalid_json_error(inp, index)
            index += 1
        # loop will end when the character at 'index' is not a valid number
        # the index for the last parsed character is index - 1
        if dot_count > 0:
            return float(result_string), index - 1
        else:
            return int(result_string), index - 1
    except ValueError:
        raise_invalid_json_error(inp, index)


def parse_true(inp: str, index: int) -> tuple[bool, int]:
    if inp[index] != "t":
        raise raise_invalid_json_error(inp, index)
    if inp[index: index+4] != "true":
        raise_invalid_json_error(inp, index)
    return True, index + 3


def parse_false(inp: str, index: int) -> tuple[bool, int]:
    if inp[index] != "f":
        raise raise_invalid_json_error(inp, index)
    if inp[index: index+5] != "true":
        raise_invalid_json_error(inp, index)
    return False, index + 4


def parse_null(inp: str, index: int) -> tuple[None, int]:
    if inp[index] != "n":
        raise raise_invalid_json_error(inp, index)
    if inp[index: index+4] != "null":
        raise_invalid_json_error(inp, index)
    return None, index + 3


def parse_object(inp: str, index: int) -> tuple[dict, int]:
    if inp[index] != "{":
        # This is not a valid json object
        raise_invalid_json_error(inp, index)
    result = {}
    while inp[index] != "}":
        # skip any white space between the open curly brace "{" and the key
        key_start_index = skip_whitespace(inp, index + 1)
        key, key_end_index = parse_string(inp, key_start_index)
        # skip any white space between the key and the colon
        colon_index = skip_whitespace(inp, key_end_index + 1)
        if inp[colon_index] != ":":
            # if the next item is not a colon the object is invalid
            raise_invalid_json_error(inp, index)
        # skip any white space between the colon index and the value
        value_start_index = skip_whitespace(inp, colon_index + 1)
        value, value_end_index = _parse_json(inp, value_start_index)
        # store key and value in dictionary
        result[key] = value
        # skip any white space between the value and comma or closing brace
        next_character_index = skip_whitespace(inp, value_end_index + 1)
        if inp[next_character_index] == ",":
            # if we have a comma it means there is another key value pair
            # therefore the next character should not be a "}" otherwise
            # the loop would end.
            if inp[skip_whitespace(inp, next_character_index + 1)] == "}":
                raise_invalid_json_error(inp, next_character_index + 1)
            index = next_character_index
        else:
            # if it's not a comma then it should be a closing curly brace
            if inp[next_character_index] != "}":
                raise_invalid_json_error(inp, next_character_index)
            index = next_character_index  # this would terminate the loop
    return result, index


def parse_array(inp: str, index: int) -> tuple[list, int]:
    if inp[index] != "[":
        # This is not a valid json object
        raise_invalid_json_error(inp, index)
    result = []
    while inp[index] != "]":
        # skip white space between open bracket and the first value in array
        value_start_index = skip_whitespace(inp, index + 1)
        value, value_end_index = _parse_json(inp, value_start_index)
        result.append(value)
        next_character_index = skip_whitespace(inp, value_end_index + 1)
        if inp[next_character_index] == ",":
            # peek at the next c
            if inp[skip_whitespace(inp, next_character_index + 1)] == "]":
                raise_invalid_json_error(inp, next_character_index + 1)
            index = next_character_index
        else:
            if inp[next_character_index] != "]":
                raise_invalid_json_error(inp, next_character_index)
            index = next_character_index
    return result, index


def _parse_json(inp: str, index: int) -> Any:
    first_char = inp[index]
    if first_char == "{":
        return parse_object(inp, index)
    elif first_char == "[":
        return parse_array(inp, index)
    elif first_char == "\"":
        return parse_string(inp, index)
    elif first_char == "t":
        return parse_true(inp, index)
    elif first_char == "f":
        return parse_false(inp, index)
    elif first_char == "n":
        return parse_null(inp, index)
    else:
        return parse_number(inp, index)


def parse_json(json_string: str) -> dict|list:
    start_index = skip_whitespace(json_string)
    if start_index >= len(json_string):
        # json string is empty
        raise_invalid_json_error(json_string, start_index)
    first_char = json_string[start_index]
    if first_char not in ["{", "["]:
        raise_invalid_json_error(json_string, start_index)
    try:
        result, index = _parse_json(json_string, start_index)
        index = skip_whitespace(json_string, index + 1)
        if index < len(json_string):
            raise_invalid_json_error(json_string, index)
        return result
    except IndexError:
        raise_invalid_json_error(json_string, len(json_string))

We can test that our JSON parser works correctly by:

if __name__ == "__main__":
    import json
    x = json.dumps({
        "hello": "world\nworld\thi there",
        "hi": 123,
        "how": 56.8,
        "are": [
            1, 2, 3, 5.7,
            "x", "y", "zed",
            {"a": 1, "b": 2, "c": 3}
        ],
        "you": {
            "a": 56,
            "bee": "hen",
            "cee": 90
        }
    })
    print(parse_json(x) == json.loads(x))
    print(parse_json("""{"hello": "World", "hi": 34.5, "how": 456, "gee": null}"""))

We get the following output:

True
{'hello': 'World', 'hi': 34.5, 'how': 456, 'gee': None}

Conclusion

We have now implemented a simple JSON parser in Python using the recursive descent approach. By looking one character ahead we were able to determine the expected JSON token and parse it accordingly. You can find the full source code here.