My question can be described as follows:
what is the best practice for serializing/deserializing complex python objects into/from JSON, that would account for subclassing and prevent multiple copies of same objects (assuming we know how to distinguish between different instances of same class) to be stored multiple times?
In a nutshell, I'm writing a small scientific library and want people to use it. But after watching Raymond Hettinger talk Python's Class Development Toolkit I've decided that it would be a good exercise for me to implement subclassing-aware behaviour. So far it went fine, but now I hit the JSON serialization task.
Until now I've looked around and found the following about JSON serialization in Python:
Two main obstacles that I have are accounting for possible subclassing, single copy per instance.
After multiple different attempts to solve it in pure python, without any changes to the json representation of object, I've ended up understanding, that at a time of deserializing json, there is now way to know instance of what class heir was serialized before. So some mention about it shall be maid, and I've ended up with something like this:
class MyClassJSONEncoder(json.JSONEncoder):
@classmethod
def represent_object(cls, obj):
"""
This is a way to serialize all built-ins as is, and all complex objects as their id, which is hash(obj) in this implementation
"""
if isinstance(obj, (int, float, str, Boolean)) or value is None:
return obj
elif isinstance(obj, (list, dict, tuple)):
return cls.represent_iterable(obj)
else:
return hash(obj)
@classmethod
def represent_iterable(cls, iterable):
"""
JSON supports iterables, so they shall be processed
"""
if isinstance(iterable, (list, tuple)):
return [cls.represent_object(value) for value in iterable]
elif isinstance(iterable, dict):
return [cls.represent_object(key): cls.represent_object(value) for key, value in iterable.items()]
def default(self, obj):
if isinstance(obj, MyClass):
result = {"MyClass_id": hash(obj),
"py__class__": ":".join([obj.__class__.__module, obj.__class__.__qualname__]}
for attr, value in self.__dict__.items():
result[attr] = self.represent_object(value)
return result
return super().default(obj) # accounting for JSONEncoder subclassing
here the accounting for subclassing is done in
"py__class__": ":".join([obj.__class__.__module, obj.__class__.__qualname__]
the JSONDecoder is to be implemented as follows:
class MyClassJSONDecoder(json.JSONDecoder):
def decode(self, data):
if isinstance(data, str):
data = super().decode(data)
if "py__class__" in data:
module_name, class_name = data["py__class__"].split(":")
object_class = getattr(importlib.__import__(module_name, fromlist=[class_name]), class_name)
else:
object_class = MyClass
data = {key, value for key, value in data.items() if not key.endswith("_id") or key != "py__class__"}
return object_class(**data)
As can be seen, here we account for possible subclassing with a "py__class__" attribute in json representation of object, and if no such attribute is present (this can be the case, if json was generated by another program, say in C++, and they just want to pass us information about the plain MyClass object, and don't really care for ingeritance) the default approach to creating an instance of MyClass
is pursued. This is, by the way, the reason why not a single JSONDecoder can be created all obejcts: it has to have a default class value to create, if no py__class__
is specified.
In terms of a single copy for every instance, this is done by the fact, that object is serialized with a special json key myclass_id
, and all attribute values are serialized as primitives (lists
, tuples
, dicts
, and builtint are preserved, while when a complex object is a value of some attribute, only its hash is stored). Such approach of storing objects hashes allows one to serialize each object exactly once, and then, knowing the structure of an object to be decoded from json representation, it can look for respective objects and assign them after all. To simply illustrate this the following example can be observed:
class MyClass(object):
json_encoder = MyClassJSONEncoder()
json_decoder = MyClassJSONDecoder()
def __init__(self, attr1):
self.attr1 = attr1
self.attr2 = [complex_object_1, complex_object_2]
def to_json(self, top_level=None):
if top_level is None:
top_level = {}
top_level["my_class"] = self.json_encoder.encode(self)
top_level["complex_objects"] = [obj.to_json(top_level=top_level) for obj in self.attr2]
return top_level
@classmethod
def from_json(cls, data, class_specific_data=None):
if isinstance(data, str):
data = json.loads(data)
if class_specific_data is None:
class_specific_data = data["my_class"] # I know the flat structure of json, and I know the attribute name, this class will be stored
result = cls.json_decoder.decode(class_spcific_data)
# repopulate complex valued attributes with real python objects
# rather than their id aliases
complex_objects = {co_data["ComplexObject_id"]: ComplexObject.from_json(data, class_specific_data=co_data) for co_data in data["complex_objects"]]
result.complex_objects = [c_o for c_o_id, c_o in complex_objects.items() if c_o_id in self.complex_objects]
# finish such repopulation
return result
Is this even a right way to go? Is there a more robust way? Have I missed some programming patter to implement in this very particular situation?
I just really want to understand what is the most correct and pythonic way to implement a json serialization that would account for subclassing and also prevent multiple copies of same object to be stored.
Thanks in advance!