18 - QuickSight¶
For this tutorial we will use the public AWS COVID-19 data lake.
References:
Please, install the CloudFormation template above to have access to the public data lake.
P.S. To be able to access the public data lake, you must allow explicitly QuickSight to access the related external bucket.
[1]:
from time import sleep
import awswrangler as wr
List users of QuickSight account
[2]:
[{"username": user["UserName"], "role": user["Role"]} for user in wr.quicksight.list_users("default")]
[2]:
[{'username': 'dev', 'role': 'ADMIN'}]
[3]:
wr.catalog.databases()
[3]:
Database | Description | |
---|---|---|
0 | aws_sdk_pandas | AWS SDK for pandas Test Arena - Glue Database |
1 | awswrangler_test | |
2 | covid-19 | |
3 | default | Default Hive database |
[4]:
wr.catalog.tables(database="covid-19")
[4]:
Database | Table | Description | Columns | Partitions | |
---|---|---|---|---|---|
0 | covid-19 | alleninstitute_comprehend_medical | Comprehend Medical results run against Allen I... | paper_id, date, dx_name, test_name, procedure_... | |
1 | covid-19 | alleninstitute_metadata | Metadata on papers pulled from the Allen Insti... | cord_uid, sha, source_x, title, doi, pmcid, pu... | |
2 | covid-19 | country_codes | Lookup table for country codes | country, alpha-2 code, alpha-3 code, numeric c... | |
3 | covid-19 | county_populations | Lookup table for population for each county ba... | id, id2, county, state, population estimate 2018 | |
4 | covid-19 | covid_knowledge_graph_edges | AWS Knowledge Graph for COVID-19 data | id, label, from, to, score | |
5 | covid-19 | covid_knowledge_graph_nodes_author | AWS Knowledge Graph for COVID-19 data | id, label, first, last, full_name | |
6 | covid-19 | covid_knowledge_graph_nodes_concept | AWS Knowledge Graph for COVID-19 data | id, label, entity, concept | |
7 | covid-19 | covid_knowledge_graph_nodes_institution | AWS Knowledge Graph for COVID-19 data | id, label, institution, country, settlement | |
8 | covid-19 | covid_knowledge_graph_nodes_paper | AWS Knowledge Graph for COVID-19 data | id, label, doi, sha_code, publish_time, source... | |
9 | covid-19 | covid_knowledge_graph_nodes_topic | AWS Knowledge Graph for COVID-19 data | id, label, topic, topic_num | |
10 | covid-19 | covid_testing_states_daily | USA total test daily trend by state. Sourced ... | date, state, positive, negative, pending, hosp... | |
11 | covid-19 | covid_testing_us_daily | USA total test daily trend. Sourced from covi... | date, states, positive, negative, posneg, pend... | |
12 | covid-19 | covid_testing_us_total | USA total tests. Sourced from covidtracking.c... | positive, negative, posneg, hospitalized, deat... | |
13 | covid-19 | covidcast_data | CMU Delphi's COVID-19 Surveillance Data | data_source, signal, geo_type, time_value, geo... | |
14 | covid-19 | covidcast_metadata | CMU Delphi's COVID-19 Surveillance Metadata | data_source, signal, time_type, geo_type, min_... | |
15 | covid-19 | enigma_jhu | Johns Hopkins University Consolidated data on ... | fips, admin2, province_state, country_region, ... | |
16 | covid-19 | enigma_jhu_timeseries | Johns Hopkins University data on COVID-19 case... | uid, fips, iso2, iso3, code3, admin2, latitude... | |
17 | covid-19 | hospital_beds | Data on hospital beds and their utilization in... | objectid, hospital_name, hospital_type, hq_add... | |
18 | covid-19 | nytimes_counties | Data on COVID-19 cases from NY Times at US cou... | date, county, state, fips, cases, deaths | |
19 | covid-19 | nytimes_states | Data on COVID-19 cases from NY Times at US sta... | date, state, fips, cases, deaths | |
20 | covid-19 | prediction_models_county_predictions | County-level Predictions Data. Sourced from Yu... | countyfips, countyname, statename, severity_co... | |
21 | covid-19 | prediction_models_severity_index | Severity Index models. Sourced from Yu Group a... | severity_1-day, severity_2-day, severity_3-day... | |
22 | covid-19 | tableau_covid_datahub | COVID-19 data that has been gathered and unifi... | country_short_name, country_alpha_3_code, coun... | |
23 | covid-19 | tableau_jhu | Johns Hopkins University data on COVID-19 case... | case_type, cases, difference, date, country_re... | |
24 | covid-19 | us_state_abbreviations | Lookup table for US state abbreviations | state, abbreviation | |
25 | covid-19 | world_cases_deaths_testing | Data on confirmed cases, deaths, and testing. ... | iso_code, location, date, total_cases, new_cas... |
Create data source of QuickSight Note: data source stores the connection information.
[5]:
wr.quicksight.create_athena_data_source(
name="covid-19",
workgroup="primary",
allowed_to_manage={"users": ["dev"]},
)
[6]:
wr.catalog.tables(database="covid-19", name_contains="nyt")
[6]:
Database | Table | Description | Columns | Partitions | |
---|---|---|---|---|---|
0 | covid-19 | nytimes_counties | Data on COVID-19 cases from NY Times at US cou... | date, county, state, fips, cases, deaths | |
1 | covid-19 | nytimes_states | Data on COVID-19 cases from NY Times at US sta... | date, state, fips, cases, deaths |
[7]:
wr.athena.read_sql_query("SELECT * FROM nytimes_counties limit 10", database="covid-19", ctas_approach=False)
[7]:
date | county | state | fips | cases | deaths | |
---|---|---|---|---|---|---|
0 | 2020-01-21 | Snohomish | Washington | 53061 | 1 | 0 |
1 | 2020-01-22 | Snohomish | Washington | 53061 | 1 | 0 |
2 | 2020-01-23 | Snohomish | Washington | 53061 | 1 | 0 |
3 | 2020-01-24 | Cook | Illinois | 17031 | 1 | 0 |
4 | 2020-01-24 | Snohomish | Washington | 53061 | 1 | 0 |
5 | 2020-01-25 | Orange | California | 06059 | 1 | 0 |
6 | 2020-01-25 | Cook | Illinois | 17031 | 1 | 0 |
7 | 2020-01-25 | Snohomish | Washington | 53061 | 1 | 0 |
8 | 2020-01-26 | Maricopa | Arizona | 04013 | 1 | 0 |
9 | 2020-01-26 | Los Angeles | California | 06037 | 1 | 0 |
[8]:
sql = """
SELECT
j.*,
co.Population,
co.county AS county2,
hb.*
FROM
(
SELECT
date,
county,
state,
fips,
cases as confirmed,
deaths
FROM "covid-19".nytimes_counties
) j
LEFT OUTER JOIN (
SELECT
DISTINCT county,
state,
"population estimate 2018" AS Population
FROM
"covid-19".county_populations
WHERE
state IN (
SELECT
DISTINCT state
FROM
"covid-19".nytimes_counties
)
AND county IN (
SELECT
DISTINCT county as county
FROM "covid-19".nytimes_counties
)
) co ON co.county = j.county
AND co.state = j.state
LEFT OUTER JOIN (
SELECT
count(objectid) as Hospital,
fips as hospital_fips,
sum(num_licensed_beds) as licensed_beds,
sum(num_staffed_beds) as staffed_beds,
sum(num_icu_beds) as icu_beds,
avg(bed_utilization) as bed_utilization,
sum(
potential_increase_in_bed_capac
) as potential_increase_bed_capacity
FROM "covid-19".hospital_beds
WHERE
fips in (
SELECT
DISTINCT fips
FROM
"covid-19".nytimes_counties
)
GROUP BY
2
) hb ON hb.hospital_fips = j.fips
"""
wr.athena.read_sql_query(sql, database="covid-19", ctas_approach=False)
[8]:
date | county | state | fips | confirmed | deaths | population | county2 | Hospital | hospital_fips | licensed_beds | staffed_beds | icu_beds | bed_utilization | potential_increase_bed_capacity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-04-12 | Park | Montana | 30067 | 7 | 0 | 16736 | Park | 0 | 30067 | 25 | 25 | 4 | 0.432548 | 0 |
1 | 2020-04-12 | Ravalli | Montana | 30081 | 3 | 0 | 43172 | Ravalli | 0 | 30081 | 25 | 25 | 5 | 0.567781 | 0 |
2 | 2020-04-12 | Silver Bow | Montana | 30093 | 11 | 0 | 34993 | Silver Bow | 0 | 30093 | 98 | 71 | 11 | 0.551457 | 27 |
3 | 2020-04-12 | Clay | Nebraska | 31035 | 2 | 0 | 6214 | Clay | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | <NA> |
4 | 2020-04-12 | Cuming | Nebraska | 31039 | 2 | 0 | 8940 | Cuming | 0 | 31039 | 25 | 25 | 4 | 0.204493 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
227684 | 2020-06-11 | Hockley | Texas | 48219 | 28 | 1 | 22980 | Hockley | 0 | 48219 | 48 | 48 | 8 | 0.120605 | 0 |
227685 | 2020-06-11 | Hudspeth | Texas | 48229 | 11 | 0 | 4795 | Hudspeth | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | <NA> |
227686 | 2020-06-11 | Jones | Texas | 48253 | 633 | 0 | 19817 | Jones | 0 | 48253 | 45 | 7 | 1 | 0.718591 | 38 |
227687 | 2020-06-11 | La Salle | Texas | 48283 | 4 | 0 | 7531 | La Salle | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | <NA> |
227688 | 2020-06-11 | Limestone | Texas | 48293 | 36 | 1 | 23519 | Limestone | 0 | 48293 | 78 | 69 | 9 | 0.163940 | 9 |
227689 rows × 15 columns
Create Dataset with custom SQL option
[9]:
wr.quicksight.create_athena_dataset(
name="covid19-nytimes-usa",
sql=sql,
sql_name="CustomSQL",
data_source_name="covid-19",
import_mode="SPICE",
allowed_to_manage={"users": ["dev"]},
)
[10]:
ingestion_id = wr.quicksight.create_ingestion("covid19-nytimes-usa")
Wait ingestion
[11]:
while wr.quicksight.describe_ingestion(ingestion_id=ingestion_id, dataset_name="covid19-nytimes-usa")[
"IngestionStatus"
] not in ["COMPLETED", "FAILED"]:
sleep(1)
Describe last ingestion
[12]:
wr.quicksight.describe_ingestion(ingestion_id=ingestion_id, dataset_name="covid19-nytimes-usa")["RowInfo"]
[12]:
{'RowsIngested': 227689, 'RowsDropped': 0}
List all ingestions
[13]:
[
{"time": user["CreatedTime"], "source": user["RequestSource"]}
for user in wr.quicksight.list_ingestions("covid19-nytimes-usa")
]
[13]:
[{'time': datetime.datetime(2020, 6, 12, 15, 13, 46, 996000, tzinfo=tzlocal()),
'source': 'MANUAL'},
{'time': datetime.datetime(2020, 6, 12, 15, 13, 42, 344000, tzinfo=tzlocal()),
'source': 'MANUAL'}]
Create new dataset from a table directly
[14]:
wr.quicksight.create_athena_dataset(
name="covid-19-tableau_jhu",
table="tableau_jhu",
data_source_name="covid-19",
database="covid-19",
import_mode="DIRECT_QUERY",
rename_columns={"cases": "Count_of_Cases", "combined_key": "County"},
cast_columns_types={"Count_of_Cases": "INTEGER"},
tag_columns={"combined_key": [{"ColumnGeographicRole": "COUNTY"}]},
allowed_to_manage={"users": ["dev"]},
)
Cleaning up
[15]:
wr.quicksight.delete_data_source("covid-19")
wr.quicksight.delete_dataset("covid19-nytimes-usa")
wr.quicksight.delete_dataset("covid-19-tableau_jhu")