聚合国内IT技术精华文章,分享IT技术精华,帮助IT从业人士成长

  • 3914 views阅读

    Get the number of rows for a parquet file

    We were using Pandas to get the number of rows for a parquet file: import pandas as pd df = pd.read_parquet("my.parquet") print(df.shape[0]) This is easy but will cost a lot of ti...

    分类:技术文章 时间:2021-12-17 09:52 我要评论(0个)

  • 3488 views阅读

    Get DDL of a table in BigQuery

    How could I conveniently get the creating-SQL of a table in BigQuery? We could use INFORMATION_SCHEMA: SELECT table_name, ddl FROM `data-to-insights.taxi.INFORMATION_SCHEMA.TABLES` WHERE ta...

    分类:技术文章 时间:2021-10-22 10:31 我要评论(0个)

  • 3881 views阅读

    Some hints on Dataproc

    When running a job in the cluster of Dataproc, it reported: java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY. The reason is I h...

    分类:技术文章 时间:2021-09-03 12:20 我要评论(0个)

  • 2440 views阅读

    Recover truncated table in BigQuery

    If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” m...

    分类:技术文章 时间:2021-06-03 15:08 我要评论(0个)

  • 6522 views阅读

    Change the schema of BigQuery tables

    We can easily add new column for a table in BigQuery: ALTER TABLE mydataset.mytable ADD COLUMN new_col STRING But when you want to delete or rename an existed column, there is no SQL to imp...

    分类:技术文章 时间:2021-03-11 12:58 我要评论(0个)

  • 4884 views阅读

    How to gracefully end a PySpark application

    This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error: File "test.py", line 333 return ...

    分类:技术文章 时间:2021-02-12 12:18 我要评论(0个)

  • 4261 views阅读

    An old bug about PyArrow

    To save memory for my program using Pandas, I change types of some column from string to category as the reference. df[["os_type", "cpu_type", "chip_brand"]] = d...

    分类:技术文章 时间:2021-02-05 10:25 我要评论(0个)

  • 4052 views阅读

    A few notes for Pandas and BigQuery

    Get the memory size of a DataFrame of Pandas df.memory_usage(deep=True).sum() 2. Upload a large DataFrame of Pandas to BigQuery table If your DataFrame is too big, the uploading operation...

    分类:技术文章 时间:2021-01-22 10:29 我要评论(0个)

  • 4484 views阅读

    Import date column in Pandas to BigQuery

    Imaging we have a small CSV file: name,enroll_time robin,2021-01-15 09:50:33 tony,2021-01-14 01:50:33 jaime,2021-01-13 00:50:33 tyrion,2021-2-15 13:22:17 bran,2022-3-16 14:00:01 Let’s ...

    分类:技术文章 时间:2021-01-15 12:26 我要评论(0个)

  • 3935 views阅读

    To solve the problem about pivot() of Pandas

    Below is an example from pandas official document for pivot(): import pandas as pd df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'],...

    分类:技术文章 时间:2021-01-08 10:33 我要评论(0个)

  • 6589 views阅读

    Get the schema of a parquet file

    Previously I just use this snippet to get all the column names of a parquet file: import pandas as pd df = pd.read_parquet("hello.parquet") print(list(df.columns)) But if the parq...

    分类:技术文章 时间:2020-11-26 11:35 我要评论(0个)

  • 4705 views阅读

    Transfer Redshift SQL to BigQuery SQL

    In my recent work, I need to run some SQL snippet from Redshift on Google’s BigQuery platform. Since different data warehouses have a different recipe for SQL, the transferring work couldn&#...

    分类:技术文章 时间:2020-08-20 15:01 我要评论(0个)