聚合国内IT技术精华文章,分享IT技术精华,帮助IT从业人士成长

Get the number of rows for a parquet file

2021-12-17 09:52 浏览: 1481587 次 我要评论(0 条) 字号:

We were using Pandas to get the number of rows for a parquet file:

import pandas as pd

df = pd.read_parquet("my.parquet")
print(df.shape[0])

This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.

If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:

import pyarrow.parquet as pq

table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)

This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.



网友评论已有0条评论, 我也要评论

发表评论

*

* (保密)

Ctrl+Enter 快捷回复