10 - Parquet Crawler
awswrangler can extract only the metadata from Parquet files and Partitions and then add it to the Glue Catalog.
Enter your bucket name:
Creating a Parquet Table from the NOAA’s CSV files
Reference
|
id |
dt |
element |
value |
m_flag |
q_flag |
s_flag |
obs_time |
0 |
AGE00135039 |
1890-01-01 |
TMAX |
160 |
NaN |
NaN |
E |
NaN |
1 |
AGE00135039 |
1890-01-01 |
TMIN |
30 |
NaN |
NaN |
E |
NaN |
2 |
AGE00135039 |
1890-01-01 |
PRCP |
45 |
NaN |
NaN |
E |
NaN |
3 |
AGE00147705 |
1890-01-01 |
TMAX |
140 |
NaN |
NaN |
E |
NaN |
4 |
AGE00147705 |
1890-01-01 |
TMIN |
74 |
NaN |
NaN |
E |
NaN |
... |
... |
... |
... |
... |
... |
... |
... |
... |
29249753 |
UZM00038457 |
1899-12-31 |
PRCP |
16 |
NaN |
NaN |
r |
NaN |
29249754 |
UZM00038457 |
1899-12-31 |
TAVG |
-73 |
NaN |
NaN |
r |
NaN |
29249755 |
UZM00038618 |
1899-12-31 |
TMIN |
-76 |
NaN |
NaN |
r |
NaN |
29249756 |
UZM00038618 |
1899-12-31 |
PRCP |
0 |
NaN |
NaN |
r |
NaN |
29249757 |
UZM00038618 |
1899-12-31 |
TAVG |
-60 |
NaN |
NaN |
r |
NaN |
29249758 rows × 8 columns
|
id |
dt |
element |
value |
m_flag |
q_flag |
s_flag |
obs_time |
year |
0 |
AGE00135039 |
1890-01-01 |
TMAX |
160 |
NaN |
NaN |
E |
NaN |
1890 |
1 |
AGE00135039 |
1890-01-01 |
TMIN |
30 |
NaN |
NaN |
E |
NaN |
1890 |
2 |
AGE00135039 |
1890-01-01 |
PRCP |
45 |
NaN |
NaN |
E |
NaN |
1890 |
['year=1890/06a519afcf8e48c9b08c8908f30adcfe.snappy.parquet',
'year=1891/5a99c28dbef54008bfc770c946099e02.snappy.parquet',
'year=1892/9b1ea5d1cfad40f78c920f93540ca8ec.snappy.parquet',
'year=1893/92259b49c134401eaf772506ee802af6.snappy.parquet',
'year=1894/c734469ffff944f69dc277c630064a16.snappy.parquet',
'year=1895/cf7ccde86aaf4d138f86c379c0817aa6.snappy.parquet',
'year=1896/ce02f4c2c554438786b766b33db451b6.snappy.parquet',
'year=1897/e04de04ad3c444deadcc9c410ab97ca1.snappy.parquet',
'year=1898/acb0e02878f04b56a6200f4b5a97be0e.snappy.parquet',
'year=1899/a269bdbb0f6a48faac55f3bcfef7df7a.snappy.parquet']
Crawling!
CPU times: user 1.81 s, sys: 528 ms, total: 2.33 s
Wall time: 3.21 s
Checking
|
Column Name |
Type |
Partition |
Comment |
0 |
id |
string |
False |
|
1 |
dt |
timestamp |
False |
|
2 |
element |
string |
False |
|
3 |
value |
bigint |
False |
|
4 |
m_flag |
string |
False |
|
5 |
q_flag |
string |
False |
|
6 |
s_flag |
string |
False |
|
7 |
obs_time |
string |
False |
|
8 |
year |
int |
True |
|
CPU times: user 3.52 s, sys: 811 ms, total: 4.33 s
Wall time: 9.6 s
|
id |
dt |
element |
value |
m_flag |
q_flag |
s_flag |
obs_time |
year |
0 |
USC00195145 |
1890-01-01 |
TMIN |
-28 |
<NA> |
<NA> |
6 |
<NA> |
1890 |
1 |
USC00196770 |
1890-01-01 |
PRCP |
0 |
P |
<NA> |
6 |
<NA> |
1890 |
2 |
USC00196770 |
1890-01-01 |
SNOW |
0 |
<NA> |
<NA> |
6 |
<NA> |
1890 |
3 |
USC00196915 |
1890-01-01 |
PRCP |
0 |
P |
<NA> |
6 |
<NA> |
1890 |
4 |
USC00196915 |
1890-01-01 |
SNOW |
0 |
<NA> |
<NA> |
6 |
<NA> |
1890 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
6139 |
ASN00022006 |
1890-12-03 |
PRCP |
0 |
<NA> |
<NA> |
a |
<NA> |
1890 |
6140 |
ASN00022007 |
1890-12-03 |
PRCP |
0 |
<NA> |
<NA> |
a |
<NA> |
1890 |
6141 |
ASN00022008 |
1890-12-03 |
PRCP |
0 |
<NA> |
<NA> |
a |
<NA> |
1890 |
6142 |
ASN00022009 |
1890-12-03 |
PRCP |
0 |
<NA> |
<NA> |
a |
<NA> |
1890 |
6143 |
ASN00022011 |
1890-12-03 |
PRCP |
0 |
<NA> |
<NA> |
a |
<NA> |
1890 |
1276246 rows × 9 columns