Pandas Immutable DataFrame

I'm interested in an immutable DataFrame for use in a program as a reference table with read_only properties enabled after it was originally built (which in my case is during the class method def __init__() ).

I see that index objects are frozen.

Is there a way to make a holistic DataFrame immutable?

+10
source share
3 answers

Try typing something like this

 class Bla(object): def __init__(self): self._df = pd.DataFrame(index=[1,2,3]) @property def df(self): return self._df.copy() 

this will allow you to return df back using b.df, but you cannot assign it. In short, you have a df class in a class that behaves in an "immutable DataFrame", just that it blocks the changes to the original. however, the returned object is still a mutable data frame, so it will not behave as immutable in other ways. That is, you cannot use it as a key for a dictionary, etc.

+5
source

If you really want the DataFrame to behave as immutable, instead of using the copy by @Joop solution (which I would recommend), you could build the following structure.

Please note that this is just a starting point.

Basically, it is a proxy data object that hides everything that would change state and allow hashing, and all instances of the same source data will have the same hash. There are probably modules that make it colder, but I decided it could be an educational example.

Some warnings:

  • Depending on how the string representation of the proxy object is built, two different proxied objects can receive the same hash, the howerver implementation is compatible with the DataFrame among other objects.

  • Changes to the source object will affect the proxy object.

  • Uniformity will lead to some unpleasant unresolved recursions if another object returns an equality question (which is why list has a special case).

  • The DataFrame proxy maker helper is just the beginning, the problem is that any method that changes the state of the original object cannot be resolved or must be manually overwritten by the helper or the extraFilter parameter is completely masked when creating the _ReadOnly instance. See DataFrameProxy.sort .

  • Proxies will not be displayed as derived from the proxied type.

General proxy file to read

It can be used for any object.

 import md5 import warnings class _ReadOnly(object): def __init__(self, obj, extraFilter=tuple()): self.__dict__['_obj'] = obj self.__dict__['_d'] = None self.__dict__['_extraFilter'] = extraFilter self.__dict__['_hash'] = int(md5.md5(str(obj)).hexdigest(), 16) @staticmethod def _cloak(obj): try: hash(obj) return obj except TypeError: return _ReadOnly(obj) def __getitem__(self, value): return _ReadOnly._cloak(self._obj[value]) def __setitem__(self, key, value): raise TypeError( "{0} has a _ReadOnly proxy around it".format(type(self._obj))) def __delitem__(self, key): raise TypeError( "{0} has a _ReadOnly proxy around it".format(type(self._obj))) def __getattr__(self, value): if value in self.__dir__(): return _ReadOnly._cloak(getattr(self._obj, value)) elif value in dir(self._obj): raise AttributeError("{0} attribute {1} is cloaked".format( type(self._obj), value)) else: raise AttributeError("{0} has no {1}".format( type(self._obj), value)) def __setattr__(self, key, value): raise TypeError( "{0} has a _ReadOnly proxy around it".format(type(self._obj))) def __delattr__(self, key): raise TypeError( "{0} has a _ReadOnly proxy around it".format(type(self._obj))) def __dir__(self): if self._d is None: self.__dict__['_d'] = [ i for i in dir(self._obj) if not i.startswith('set') and i not in self._extraFilter] return self._d def __repr__(self): return self._obj.__repr__() def __call__(self, *args, **kwargs): if hasattr(self._obj, "__call__"): return self._obj(*args, **kwargs) else: raise TypeError("{0} not callable".format(type(self._obj))) def __hash__(self): return self._hash def __eq__(self, other): try: return hash(self) == hash(other) except TypeError: if isinstance(other, list): try: return all(zip(self, other)) except: return False return other == self 

DataFrame Proxy

It should be expanded with more methods, such as sort and filtering all other state-changing methods that are of no interest.

You can either create an instance of DataFrame -instance as a single argument, or give it arguments as you would need to create a DataFrame

 import pandas as pd class DataFrameProxy(_ReadOnly): EXTRA_FILTER = ('drop', 'drop_duplicates', 'dropna') def __init__(self, *args, **kwargs): if (len(args) == 1 and not len(kwargs) and isinstance(args, pd.DataFrame)): super(DataFrameProxy, self).__init__(args[0], DataFrameProxy.EXTRA_FILTER) else: super(DataFrameProxy, self).__init__(pd.DataFrame(*args, **kwargs), DataFrameProxy.EXTRA_FILTER) def sort(self, inplace=False, *args, **kwargs): if inplace: warnings.warn("Inplace sorting overridden") return self._obj.sort(*args, **kwargs) 

Finally:

However, despite the fact that the creation of this device is fun, why not just have a DataFrame that you are not DataFrame ? If it is available only to you, it is better to just make sure that you do not change it ...

+3
source

The StaticFrame package (of which I am the author) implements an interface similar to Pandas and many of the usual Pandas operations, while ensuring immutability in NumPy base arrays and immutable Series and Frame containers.

You can make the entire Pandas DataFrame unchanged by converting it to Frame static_frame.Frame.from_pandas(df) using static_frame.Frame.from_pandas(df) . Then you can use it as a read-only table.

See the StaticFrame documentation for this method: https://static-frame.readthedocs.io/en/latest/api_creation.html#static_frame.Series.from_pandas.

+2
source

Source: https://habr.com/ru/post/972752/


All Articles