lundi 29 août 2022

Making sense out of PyArrow

I am browsing the tutorials and the documentation of PyArrow. I see some redundancies, for example, when reading a parquet Dataset (or folder) I could either

type1 = pyarrow.parquet.ParquetDataset("Pqfolder/", use_legacy_dataset=False)
# or
type2 = pyarrow.dataset.dataset('Pqfolder/', format='parquet')

What are pyarrow.parquet and pyarrow.dataset? Are they modules of the pyarrow package? Where do I find the docs? It looks like pyarrow.dataset is explained in https://arrow.apache.org/docs/python/api/dataset.html and pyarrow.parquet in https://arrow.apache.org/docs/python/parquet.html So i wonder why it is not pyarrow.api.dataset...

From what I understood the API (pyarrow.dataset) also allows you to filter the data with the scanner method, while with pyarrow.parquet I can only do the filtering when I read the file/s with filters but after that I can only read without filtering. Also, filtering is richer thanks to expressions... So, what's the point of having pyarrow.parquet if it can only do a subset of what pyarrow.dataset does (using a different notation)?

The issue here is that I have understood all this by guessing, trials and errors. Is this the standard way in which one learns about new libraries or did I miss some docs? I think I am missing some basics in software design. I was wondering if anyone could point me to some reference about this.

Aucun commentaire:

Enregistrer un commentaire