Get the number of rows for a parquet file
We were using Pandas to get the number of rows for a parquet file: import pandas as pd df = pd.read_parquet("my.parquet") print(df.shape[0]) This is easy but will cost a lot of ti...
聚合国内IT技术精华文章,分享IT技术精华,帮助IT从业人士成长
We were using Pandas to get the number of rows for a parquet file: import pandas as pd df = pd.read_parquet("my.parquet") print(df.shape[0]) This is easy but will cost a lot of ti...
How could I conveniently get the creating-SQL of a table in BigQuery? We could use INFORMATION_SCHEMA: SELECT table_name, ddl FROM `data-to-insights.taxi.INFORMATION_SCHEMA.TABLES` WHERE ta...
When running a job in the cluster of Dataproc, it reported: java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY. The reason is I h...
If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” m...
We can easily add new column for a table in BigQuery: ALTER TABLE mydataset.mytable ADD COLUMN new_col STRING But when you want to delete or rename an existed column, there is no SQL to imp...
This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error: File "test.py", line 333 return ...
To save memory for my program using Pandas, I change types of some column from string to category as the reference. df[["os_type", "cpu_type", "chip_brand"]] = d...
Get the memory size of a DataFrame of Pandas df.memory_usage(deep=True).sum() 2. Upload a large DataFrame of Pandas to BigQuery table If your DataFrame is too big, the uploading operation...
Imaging we have a small CSV file: name,enroll_time robin,2021-01-15 09:50:33 tony,2021-01-14 01:50:33 jaime,2021-01-13 00:50:33 tyrion,2021-2-15 13:22:17 bran,2022-3-16 14:00:01 Let’s ...
Below is an example from pandas official document for pivot(): import pandas as pd df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'],...
Previously I just use this snippet to get all the column names of a parquet file: import pandas as pd df = pd.read_parquet("hello.parquet") print(list(df.columns)) But if the parq...
In my recent work, I need to run some SQL snippet from Redshift on Google’s BigQuery platform. Since different data warehouses have a different recipe for SQL, the transferring work couldn...