I am browsing the tutorials and the documentation of PyArrow
. I see some redundancies, for example, when reading a parquet Dataset (or folder) I could either
type1 = pyarrow.parquet.ParquetDataset("Pqfolder/", use_legacy_dataset=False)
# or
type2 = pyarrow.dataset.dataset('Pqfolder/', format='parquet')
What are pyarrow.parquet
and pyarrow.dataset
? Are they modules of the pyarrow
package? Where do I find the docs? It looks like pyarrow.dataset
is explained in https://arrow.apache.org/docs/python/api/dataset.html and pyarrow.parquet
in https://arrow.apache.org/docs/python/parquet.html So i wonder why it is not pyarrow.api.dataset
...
From what I understood the API (pyarrow.dataset
) also allows you to filter the data with the scanner
method, while with pyarrow.parquet
I can only do the filtering when I read the file/s with filters
but after that I can only read
without filtering. Also, filtering is richer thanks to expressions... So, what's the point of having pyarrow.parquet
if it can only do a subset of what pyarrow.dataset
does (using a different notation)?
The issue here is that I have understood all this by guessing, trials and errors. Is this the standard way in which one learns about new libraries or did I miss some docs? I think I am missing some basics in software design. I was wondering if anyone could point me to some reference about this.
Aucun commentaire:
Enregistrer un commentaire