capstone

Udacity Data Engineering Capstone Project

  • Post published: July 4, 2020
  • Post category: Data Engineering / Machine Learning
  • Post comments: 0 Comments

Project Summary

The project follows the follow steps:

Step 1: Scope the Project and Gather Data

Step 2: explore and assess the data, step 3: define the data model.

  • Step 4: Run ETL to Model the Data

Step 5: Complete Project Write Up

The project is one provided by Udacity to showcase the learnings of the student throughout the program. There are four datasets as follows to complete the project.

  • i94 Immigration Sample Data : sample data which is from the US National Tourism and Trade Office. This data comes from the US National Tourism and Trade Office. This table is used for the fact table in this project.
  • World Temperature Data world_temperature. This dataset contains temperature data of various cities from 1700’s – 2013. This dataset came from Kaggle. This table is not used because the data is available until 2013
  • U.S. City Demographic Data us-cities-demographics. This dataset contains population details of all US Cities and census-designated places includes gender & race information. This data came from OpenSoft. The table is grouped by state to get aggregated statistics.
  • Airport Codes is a simple table of airport codes and corresponding cities. The rows where IATA codes are available in the table are selected for this project.

The project builds a data lake using Pyspark that can help to support the analytics department of the US immigration department to query the information by extracting data from all the sources. The conceptual data model is a Factless fact based transactional star schema with dimensions tables. Some examples of the information which can be queries from the data model include the numbers of visitors by nationality, visitor’s main country of residence, their demographics and flight information. Python is the main language used to complete the project. The libraries used to perform ETL are Pandas, Pyarrow and Pyspark. The environment used is workspace by Udacity. Immigration data was transformed from sas format to parquet format using Pyspark. These parquest files were ingested using Pyarrow and explored using Pandas to gain an understanding of the data and before building a conceptual data model. Pyspark was then used to build the ETL pipeline. The data sources provided have been cleaned, transformed to create new features and then save the data tables are parquet file. The two notebooks with all the code and output are as follows:

1. exploringUsingPandas.ipynb

2. exploringUsingPyspark.ipynb

Describe and Gather Data

Immigration data.

“Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” ( https://en.wikipedia.org/wiki/Form_I-94 ) .It lists the traveler’s immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler’s lawful admission to the United States ( https://i94.cbp.dhs.gov/I94/(

This is the main dataset and there is a file for each month of the year of 2016 available in the directory ../../data/18-83510-I94-Data-2016/ . It is in SAS binary database storage format sas7bdat. This project uses the parquet files available in the workspace and the folder called sap_data. The data is for the month of the month of April of 2016 which has more than three million records (3.096.313). The fact table is derived from this table.

World Temperature Data

Data is from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. The original dataset from Kaggle includes several files ( https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data ). But for this project, only the GlobalLandTemperaturesByCity was analyzed. The dataset provides a long period of the world’s temperature (from year 1743 to 2013). However, since the immigration dataset only has data in the year of 2016, the vast majority of the data here seems not to be suitable.

Airports Data

“Airport data includes IATA airport code.An IATA airport code, also known as an IATA location identifier, IATA station code or simply a location identifier, is a three-letter geocode designating many airports and metropolitan areas around the world, defined by the International Air Transport Association (IATA). IATA code is used in passenger reservation, ticketing and baggage-handling systems (https://en.wikipedia.org/wiki/IATA_airport_code)”. It was downloaded from a public domain source ( http://ourairports.com/data/ )

U.S. City Demographic Data

This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau’s 2015 American Community Survey. This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. The US City Demographics is the source of the STATE dimension in the data model and grouped by State.

Gather Data

Explore the data, exploringusingpandas.ipynb shows the workings to assess and explore the data., the main finding and cleaning steps necessary are as follows:.

  • The dataset is for 30 days in the month of April and year 2016.
  • Most of the people used air as mode of travel. Some people do not report their mode of transport.
  • Males immigrated more than females
  • i94 has missing values. These rows need to be dropped.
  • there are no duplicate gender and address for each cicid.
  • Immigration was to 243 different cities for multiple states.
  • Immigration was from 229 different cities.
  • Departure date is less than the arrival date. Therefore, these visitors are still in the country.
  • airline & fltno is also missing in some rows and hence the mode of transport was different.
  • i94 form supports O, M, F gender.Null values are considered invalid
  • Some arrival & departure records don’t have a matching flag (matflag ).
  • There is a minimum age as -3. The data selected is more than age of zero years.
  • The dates are stored in SAS date format, which is a value that represents the number of days between January 1, 1960, and a specified date. We need to convert the dates in the dataframe to a string date format in the pattern YYYY-MM-DD.
  • insnum can be dropped as it is for US residents or citizens
  • Count, dtadfile, admnum, i94res, dtaddto, occup, visapost can be dropped as these do not provide any extra information or have high missing values.
  • Demographic dataset doesnot have many missing values but has data for only 48 states.
  • Most of the iata_code are missing. Almost 50% of local codes are also missing
  • Select only where IATA codes are available from US airports and type of airport is either large, medium and small.
  • Extract iso regions and dropped continent
  • Rename the columns of the dataset to more meaning full names
  • Convert the data types of the columns
  • Removed city and race from Demographics
  • Grouped the data to provide an aggregated statistics per US state
  • Drop duplicates

3.1 Conceptual Data Model

Map out the conceptual data model and explain why you chose that model

For this project, Star schema is deployed in a relational database management system as dimensional structures. Star schemas characteristically consist of fact tables linked to associated dimension tables via primary/ foreign key relationships.

3.2 Mapping Out Data Pipelines

The project involved four key decisions during the design of a dimensional model:

  • Select the business process.

The business process for the immigration department is to allow valid visitors into the country. The process generate events and capture performance metrics that translate into facts in a fact table.

  • Declare the grain.

The grain establishes exactly what a single fact table row represents. In the project the records are recorded as the event of a visitor entring the USA occurs. It is done before choosing the fact and dimension table and becomes a binding contract on the design. This ensures uniformity on all dimensional designs and critical to BI application performance and ease of use.

  • Identify the dimensions.

Dimensions table provides the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts. In this project, a dimension is single valued when associated with a given fact row. Every dimension table has a single primary key column. This primary key is embedded as a foreign key in the associated fact table where the dimension row’s descriptive context is exactly correct for that fact table row.

Dimension tables are wide, flat denormalized tables with many low cardinality text attributes. It is designed with one column serving as a unique primary key. This primary key is not the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. These surrogate keys are simple integers, assigned in sequence. The tables also denormalize the many-to-one fixed depth hierarchies into separate attributes on a flattened dimension row. Dimension denormalization supports dimensional modeling’s twin objectives of simplicity and speed.

  • Identify the facts

The fact table focuses on the results of a single business process. A single fact table row has a one-to-one relationship to a measurement event as described by the fact table’s grain. Thus a fact table design is entirely based on a physical activity and is not influenced by the demands of a particular report. Within a fact table, only facts consistent with the declared grain are allowed. In this project, the information about the visitor is the fact. The fact table is transactional with each row corresponding to a measurement event at a point in space and time. It is also Factless Fact Tables as the event merely records a set of dimensional entities coming together at a moment in time. Factless fact tables can also be used to analyze what didn’t happen. These queries always have two parts: a factless coverage table that contains all the possibilities of events that might happen and an activity table that contains the events that did happen. When the activity is subtracted from the coverage, the result is the set of events that did not happen. Each row corresponds to an event. The fact table contains foreign keys for each of its associated dimensions, as well as date stamps. Fact tables are the primary target of computations and dynamic aggregations arising from queries.

( http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf )

Step 4: Run Pipelines to Model the Data

4.1 create the data model.

Build the data pipelines to create the data model.

4.2 Data Quality Checks

Explain the data quality checks you’ll perform to ensure the pipeline ran as expected. These could include:

  • Integrity constraints on the relational database (e.g., unique key, data type, etc.)
  • Unit tests for the scripts to ensure they are doing the right thing
  • Source/Count checks to ensure completeness

exploringUsingPyspark.ipynb are the workings for task 4.1 and 4.2

4.3 data dictionary, datadictionary.md has the data model.

  • The data was increased by 100x. 1. Use of Redshift ( https://aws.amazon.com/redshift/ ). It allows querying petabytes of structured and semi-structured data across the data warehouse2. Use of Cassandra ( http://cassandra.apache.org/ ). It offers robust support for clusters spanning multiple datacenters with asynchronous masterless replication allowing low latency operations for all clients.
  • The data populates a dashboard that must be updated on a daily basis by 7am every day. 1. For small datasets, a cron job will be sufficient2. Use of Airflow ( https://airflow.apache.org/docs/stable/macros.html )
  • The database needed to be accessed by 100+ people. 1. Use of Redshift with auto-scaling capabilities and good read performance2. Use of Cassandra with pre-defined indexes to optimize read queries3. Use of Elastic Map Reduce ( https://aws.amazon.com/emr/ ). It allows provisioning one, hundreds, or thousands of compute instances to process data at any scale.

You Might Also Like

Classical Time Series

Classical & Statistical Time Series Modelling of United Health Group’s Stock Price

Machine learning's current state

A history of Machine Learning

Apriori Algorithm

Apriori Algorithm (Python 3.0)

Leave a reply cancel reply.

This site uses Akismet to reduce spam. Learn how your comment data is processed .

  • Search for:

data engineering capstone project udacity

Un sito che tratta di data science, machine learning, big data e applicazioni varie.

... udacity data engineering capstone project.

In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di Udacity. Cosa sviluppare era libera scelta dello sviluppatore posto che alcuni criteri fossero soddisfatti, per esempio lavorare con un database di almeno 3 milioni di records.

Questa è il primo notebook del progetto, nel secondo ci sono esempi di queries che possono essere eseguite sul data lake.

Data lake with Apache Spark ¶

Data engineering capstone project ¶, project summary ¶.

The Organization for Tourism Development ( OTD ) want to analyze migration flux in USA, in order to find insights to significantly and sustainably develop the tourism in USA.

To support their core idea they have identified a set of analysis/queries they want to run on the raw data available.

The project deals with building a data pipeline, to go from raw data to the data insights on the migration flux.

The raw data are gathered from different sources, saved in files and made available for download.

The project shows the execution and decisional flow, specifically:

  • Describe the data and how they have been obtained
  • Answer the question “how to achieve the target?”
  • What infrastructure (storage, computation, communication) has been used and why
  • Explore the data
  • Check the data for issues, for example null, NaN, or other inconsistencies
  • Why this data model has been chosen
  • How it is implemented
  • Load the data from S3 into the SQL database, if any
  • Perform quality checks on the database
  • Perform example queries
  • Documentation of the project
  • Possible scenario extensions
  • 1. Scope of the Project
  • 1.1 What data
  • 1.2 What tools
  • 1.3 The I94 immigration data
  • 1.3.1 What is an I94?
  • 1.3.2 The I94 dataset
  • 1.3.3 The SAS date format
  • 1.3.4 Loading I94 SAS data
  • 1.4 World Temperature Data
  • 1.5 Airport Code Table
  • 1.6 U.S. City Demographic Data
  • 2. Data Exploration
  • 2.1 The I94 dataset
  • 2.2 I94 SAS data load
  • 2.3 Explore I94 data
  • 2.4 Cleaning the I94 dataset
  • 2.5 Store I94 data as parquet
  • 2.6 Airport codes dataset: load, clean, save
  • 3. The Data Model
  • 3.1 Mapping Out Data Pipelines
  • 4. Run Pipelines to Model the Data
  • 4.1 Provision the AWS S3 infrastructure
  • 4.2 Transfer raw data to S3 bucket
  • 4.3 EMR cluster on EC2
  • 4.3.1 Provision the EMR cluster
  • 4.3.2 Coded fields: I94CIT and I94RES
  • 4.3.3 Coded field: I94PORT
  • 4.3.4 Data cleaning
  • 4.3.5 Save clean data (parquet/json) to S3
  • 4.3.6 Loading, cleaning and saving airport codes
  • 4.4 Querying data on-the-fly
  • 4.5 Querying data using the SQL querying style
  • 4.6 Data Quality Checks
  • Lesson learned

1. Scope of the Project ¶

The OTD want to run pre-defined queries on the data, with periodical timing.

They also want to maintain the flexibility to run different queries on the data, using BI tools connected to an SQL-like database.

The core data is the dataset provided by US governative agencies filing request of access in the USA (I94 module).

They also have other lower value data available, that are not part of the core analysis, whose use is unclear, therefore are stored in the data lake for a possible future use.

1.1 What data ¶

Following datasets are used in the project:

  • I94 immigration data for year 2016 . Used for the main analysis
  • World Temperature Data
  • Airport Code Table
  • U.S. City Demographic Data

1.2 What tools ¶

Because of the nature of the data and the analysis that must be performed, not time-critical analysis, monthly or weekly batch, the choice fell on a cheaper S3-based data lake with on-demand on-the-fly analytical capability: EMR cluster with Apache Spark , and optionally Apache Airflow for scheduled execution (not implemented here).

The architecture shown below has been implemented.

architecture

  • Starting from a common storage solution (currently Udacity workspace) where both the OTD and its partners have access, the data is then ingested into an S3 bucket , in raw format
  • To ease future operations, the data is immediately processed, validated and cleansed using a Spark cluster and stored into S3 in parquet format. Raw and parquet data formats coesist in the data lake.
  • By default, the project doesn”t use costly Redshift cluster, but data are queried in-place on the S3 parquet data.
  • The EMR cluster serves the analytical needs of the project. SQL based queries are performed using Spark SQL directly on the S3 parquet data
  • A Spark job can be triggered monthly, using the Parquet data. The data is aggregated to gain insights on the evolution of the migration flows

1.3 The I94 immigration data ¶

The data are provided by the US National Tourism and Trade Office . It is a collection of all I94 that have been filed in 2016.

1.3.1 What is an I94? ¶

To give some context is useful to explain what an I94 file is.

From the government website : “The I-94 is the Arrival/Departure Record, in either paper or electronic format, issued by a Customs and Border Protection (CBP) Officer to foreign visitors entering the United States.”

1.3.2 The I94 dataset ¶

Each record contains these fields:

  • CICID, unique numer of the file
  • I94YR, 4 digit year of the application
  • I94MON, Numeric month of the application
  • I94CIT, city where the applicant is living
  • I94RES, state where the applicant is living
  • I94PORT, location (port) where the application is issued
  • ARRDATE, arrival date in USA in SAS date format
  • I94MODE, how did the applicant arrived in the USA
  • I94ADDR, US state where the port is
  • DEPDATE is the Departure Date from the USA
  • I94BIR, age of applicant in years
  • I94VISA, what kind of VISA
  • COUNT, used for summary statistics, always 1
  • DTADFILE, date added to I-94 Files
  • VISAPOST, department of State where where Visa was issued
  • OCCUP, occupation that will be performed in U.S.
  • ENTDEPA, arrival Flag
  • ENTDEPD, departure Flag
  • ENTDEPU, update Flag
  • MATFLAG, match flag
  • BIRYEAR, 4 digit year of birth
  • DTADDTO, date to which admitted to U.S. (allowed to stay until)
  • GENDER, non-immigrant sex
  • INSNUM, INS number
  • AIRLINE, airline used to arrive in USA
  • ADMNUM, admission Number
  • FLTNO, flight number of Airline used to arrive in USA
  • VISATYPE, class of admission legally admitting the non-immigrant to temporarily stay in USA

More details in the file I94_SAS_Labels_Descriptions.SAS

1.3.3 The SAS date format ¶

Represent any date D0 as the number of days between D0 and the 1th January 1960

1.3.4 Loading I94 SAS data ¶

The package saurfang:spark-sas7bdat:2.0.0-s_2.11 and the dependency parso-2.0.8 are needed to read SAS data format.

To load them use the config option spark.jars and give the URL of the repositories, as Spark itself wasn’t able to resolve the dependencies.

1.4 World temperature data ¶

The dataset is from Kaggle. It can be found here .

The dataset contains temperature data:

  • Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv)
  • Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)
  • Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)
  • Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)
  • Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

land temp

1.5 Airport codes data ¶

This is a table of airport codes, and information on the corresponding cities, like gps coordinates, elevation, country, etc. It comes from Datahub website .

airport codes

1.6 U.S. City Demographic Data ¶

The dataset comes from OpenSoft. It can be found here .

us city demo

2. Data Exploration ¶

In this chapter we proceed identifying data quality issues, like missing values, duplicate data, etc.

The purpose is to identify the flow in the data pipeline to programmatically correct data issues.

In this step we work on local data.

2.1 The I94 dataset ¶

  • How many files are in the I94 dataset?
  • What is the size of the files?

2.2 I94 SAS data load ¶

To read SAS data format I need to specify the com.github.saurfang.sas.spark format.

  • Let’s see the schema Spark applied on reading the file

The most columns are categorical data, this means the information is coded, for example I94CIT=101 , 101 is the country code for Albania.

Other columns represent integer data.

It appears clear that there is no need to have data that are defined as double => let’s change those fields to integer

Verifying the schema is correct.

  • convert string columns dtadfile and dtaddto to date type

These fields come in a simple string format. To be able to run time-based queries they are converted to date type

  • convert columns arrdate and depdate from SAS-date format to a timestamp type.

A date in SAS format is simply the number of days between the chosen date and the reference date (01-01-1960)

  • print final schema

2.3 Explore I94 data ¶

  • How many rows does the I94 database has?
  • Let’s see the gender distribution of the applicants
  • Where are the I94 applicants coming from?

I want to know the 10 most represented nations

The i94res code 135, where the highest number of visitors come from, corresponds to the the United Kingdom, as can be read in the accompanying file I94_SAS_Labels_Descriptions.SAS

  • What port registered the highest number of arrivals?

New York City port registered the highest number of arrivals.

2.4 Cleaning the I94 dataset ¶

These are the steps to perform on the I94 database:

  • Identify null and NaN values. Remove duplicates ( quality check ).
  • Find errors in the records ( quality check ) for example dates not in year 2016
  • Counting how many NaN in each column, excluding the date type columns dtadfile , dtadddto , arrdate , depdate because the isnan function works only on numerical types
  • How many rows of the I94 database have null value?

The number of nulls equal the number of rows. It means there is at least one null on each row of the dataframe.

  • Now we can count how many null there are in each row

There are many nulls in many columns.

The question is, if there is a need to correct/fill those nulls.

Looking at the data, it seems like some field have been left empty for lack of information.

Because these are categorical data there is no use, at this step, in assigning arbitrary values to the nulls.

The nulls are not going to be filled apriori, but only if a specific need comes up.

  • Are there duplicated rows?

Dropping duplicate row

Cheching if the number changed

No row has been dropped => no duplicated row

  • Verify that all rows have i94yr column equal 2016

This gives confidence on the consistence of the data

2.5 Store I94 data as parquet ¶

I94 data are stored in parquet format in an S3 bucket, they are partinioned using the fields: year, month

2.6 The Airport codes dataset ¶

A snippet of the data

How many records?

There are no duplicates

We discover there are some null fields:

The nulls are in these colomuns:

No action taken to fill the nulls

Finally, let’s save the data in parquet format in our temporary folder mimicking the S3 bucket.

3. The Data Model ¶

The core of the architecture is a data lake , with S3 storage and EMR processing.

The data are stored into S3 in raw and parquet format.

Apache Spark is the tool elected for analytical tasks, therefore all data are loaded into Spark dataframe using a schema-on-read approach.

For SQL queries style on the data, Spark temporary views are generated.

3.1 Mapping Out Data Pipelines ¶

  • Provision the AWS S3 infrastructure
  • Transfer data from the common storage to the S3 lake storage
  • Provision an EMR cluster. It runs 2 steps then autoterminate, these are the 2 steps: 3.1 Run a spark job to extract codes from file I94_SAS_Labels_Descriptions.SAS and save to S3 3.2 Data cleaning. Find nan, null, duplicate. Save the clean data to parquet files
  • Generate reports using Spark query on S3 parquet data
  • On-the-fly queries with Spark SQL

data lineage

4. Run Pipeline to Model the Data ¶

4.1 provision the aws s3 infrastructure ¶.

Reading credentials and configuration from file

Create the bucket if it’s not existing

4.2 Transfer raw data to S3 bucket ¶

Transfer the data from current shared storage (currently Udacity workspace) to S3 lake storage.

A naive metadata system is implemented. It uses a json file to store basic information on each file added to the S3 bucket:

  • file name: file being processed
  • added by: user logged as | aws access id
  • date added: timestamp of date of processing
  • modified on: timestamp of modification time
  • notes: any additional information
  • access granted to (role or policy): admin | anyone | I94 access policy | weather data access policy |
  • expire date: 5 years (default)

These dataset are moved to the S3 lake storage:

  • I94 immigration data
  • airport codes
  • US cities demographics

4.3 EMR cluster on EC2 ¶

An EMR cluster on EC2 instances with Apache Spark preinstalled is used to perform the ELT work.

A 3-nodes cluster of m5.xlarge istances is configured by default in the config.cfg file.

If the performance requires it, the cluster can be scaled up to use more nodes and/or bigger instances.

After the cluster has been created, the steps to execute spark cleaning jobs are added to the EMR job flow, the steps are in separate .py files. These steps are added:

  • extract I94res, i94cit, i94port codes
  • save the codes in a json file in S3
  • load I94 raw data from S3
  • change schema
  • data cleaning
  • save parquet data to S3

The cluster is set to auto-terminate by default after executing all the steps.

4.3.1 Provision the EMR cluster ¶

Create the cluster using the code emr_cluster.py [Ref. 3] and emr_cluster_spark_submit.py and and set the steps to execute spark_script_1 and spark_script_2 .

These scripts have already been previously uploaded to a dedicated folder in the project’s S3 bucket, and are accessible from the EMR cluster.

The file spark_4_emr_codes_extraction.py contains the code for following paragraphs 4.3.1

The file spark_4_emr_I94_processing.py contains the code for following paragraphs 4.3.2, 4.3.3, 4.3.4

4.3.2 Coded fields: I94CIT and I94RES ¶

I94CIT, I94RES contain codes indicating the country where the applicant is born (I94CIT), or resident (I94RES).

The data is extracted from I94_SAS_Labels_Descriptions.SAS . This can be done sporadically or every time a change occurred, for example a new code has been added.

The conceptual flow below was implemented.

data transform

First steps are define credential to access S3a then load the data in a dataframe, in a single row

Find the section of the file where I94CIT and I94RES are specified.

It start with I94CIT & I94RES and finish with the semicolon character.

To match the section, it is important to have the complete text in a single row, I did this using the option wholetext=True in the previous dataFrame read operation

Now I can split in a dataframe with multiple rows

I filter the rows with structure \ = \

And then create 2 differents columns with code and country

I can finally store the data in a single file in json format

4.3.3 Coded field: I94PORT ¶

Similarly to extract the I94PORT codes

The complete code for codes extraction is in spark_4_emr_codes_extraction.py

4.3.4 Data cleaning ¶

The cleaning steps have already been shown in section 2, here are only summarized

  • Load dataset
  • Numeric fields: double to integer
  • Fields dtadfile and dtaddto : string to date
  • Fields arrdate and depdate : sas to date
  • Handle nulls: no fill is set by default
  • Drop duplicate

4.3.5 Save clean data (parquet/json) to S3 ¶

The complete code, refactorized and modularized, is in **spark_4_emr_I94_processing.py**

As a side note, saving the test file as parquet takes about 3 minute on the provisioned cluster. The complete script execution takes 6 minutes.

4.3.6 Loading, cleaning and saving airport codes ¶

4.4 querying data on-the-fly ¶.

The data in the data lake can be queried on-place. That is the Spark cluster on EMR is directly operating on S3 data.

There are two possible ways to query the data:

  • using Spark dataframe functions
  • using SQL on tables

We see example of both programming styles.

These are some typical queries that are run on the data:

  • For each port, in a given period, how many arrivals there are in each day?
  • Where are the I94 applicants coming from, in a given period?
  • In the given period, what port registered the highest number of arrivals?
  • Number of arrivals in a given city for a given period
  • Travelers genders
  • Is there a city where the difference between male and female travelers is higher?
  • Find most visited city (the function)

The queries are collected in the Jupyter notebook Capstone project 1 – Querying the data lake.ipynb

4.5 Querying data using the SQL querying style ¶

4.6 data quality checks ¶.

The query-in-place concept implemented here uses a very short pipeline, data are loaded from S3 and after a cleaning process are saved as parquet. Quality of the data is guaranteed by design.

5. Write Up ¶

The project has been set up with scalability in mind. All components used, S3 and EMR, offer higher degree of scalability, either horizontal and vertical.

The tool used for the processing, Apache Spark, is the de facto tool for big data processing.

To achieve such a level of scalability we sacrified processing speed. A data warehouse solution with a Redshift database or an OLAP cube would have been faster answering the queries. Anyway nothing forbids to add a DWH to stage the data in case of a more intensive, real-time responsive, usage of the data.

An important part of an ELT/ETL process is automation. Although it has not been touched here, I believe the code developed here is prone to be automatized with a reasonable small effort. A tool like Apache Airflow can be used for the purpose.

Scenario extension ¶

  • The data was increased by 100x.

In an increased data scenario, the EMR hardware needs to be scaled up accordingly. This is done by simply changing configuration in the config.cfg file. Apache Spark is the tool for big data processing, and is already used as the project analityc tool.

  • The data populates a dashboard that must be updated on a daily basis by 7am every day.

In this case an orchestration tool like Apache Airflow is required. A DAG that trigger Phython scripts and Spark jobs executions, needs to be scheduled for daily execution at 7am.

The results of the queries for the dashboard can be saved in a file.

  • The database needed to be accessed by 100+ people.

A proper database wasn’t used, on the contrary Amazon S3 is used to store data and queries them in-place. S3 is designed to massive scale in mind, it is able to handle sudden traffic spikes. Therefore, accessing the data by many people shouldn’t be an issue.

The programming used in the project, provision an EMR cluster for any user that plan to run it’s queries. 100+ EMRs is probably going to be expensive for the company. A more efficient sharing of processing resources must be realized.

6. Lessons learned ¶

Emr 5.28.1 use python 2 as default ¶.

  • As a consequence important Python packages like pandas are not installed by default for Python 3.
  • install packages for Python 3: python 3 -m pip install \

Adding jars packages to Spark ¶

For some reason adding the packages in the Python programm when instantiating the sparkSession doesn’t work (error message package not found). This doesn’t work:

The packages must be added in the spark-submit:

Debugging Spark on EMR ¶

While evrything work locally, it doesn’t necessarily means that is going to work on the EMR cluster. Debugging the code is easier with SSH on EMR.

Reading an S3 file from Python is tricky ¶

While reading with Spark is straightforward, one just needs to give the address s3://…., with Python boto3 must be used.

Transfering file to S3 ¶

During the debbuging phase, when the code on S3 must be changed many time using the web interface is slow and unpractical ( permanently delete ). Memorize this command: aws s3 cp <local file> <s3 folder>

Removing the content of a directory from Python ¶

import shutil dirPath = 'metastore_db' shutil.rmtree(dirPath)

7. References ¶

  • AWS CLI Command Reference
  • EMR provisioning is based on: Github repo Boto-3 provisioning
  • Boto3 Command Reference

Leave a Reply Cancel Reply

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

You may use these HTML tags and attributes:

  • Civil Engineering (BASc)
  • Computer Engineering (BASc)
  • Electrical Engineering (BASc)
  • Manufacturing Engineering (BASc)
  • Mechanical Engineering (BASc)
  • Graduate Degrees
  • Micro-credentials
  • Transfer Student
  • Outreach & Community Programs
  • Graduate Resources
  • Undergraduate Academic Advising
  • Undergraduate Capstone
  • Clubs & Associations
  • Professional Development
  • Student Feedback System
  • Advanced Materials & Manufacturing
  • Advanced Systems & Data Analytics
  • Clean Technology & Environmental Systems
  • Health Technologies
  • Urban Infrastructure & Green Construction
  • Research Chairs & Directors
  • Postdoctoral Research Fellow Listings
  • Labs & Facilities
  • Contact & People
  • Lab & Facility Safety
  • News & Events
  • Strategic Plan & Annual Reports
  • Prospective Students
  • Current Students
  • Faculty and Staff
  • Undergraduate
  • Donor & Alumni

Engineering students’ Capstone solutions on track to real-world impact  

April 23, 2024

UBC Engineering students walk into the 2024 Capstone Showcase under a sunny sky with clouds.

SOE student projects soar at 2024 Capstone Design Showcase  

Engineering students from UBC Okanagan landed at KF Aerospace Centre for Excellence on Friday, April 12 for the 2024 Capstone Design Showcase and Competition, where their innovative ideas were on display for UBCO faculty, staff, industry partners and the public.  

Following the competition, many of these ideas are poised to have a real-world impact in industry and communities.  

This year’s showcase saw 48 design solutions by 283 students evaluated. The popular annual event is the culmination of students’ learning as part of the ENG499 Engineering Capstone Design Project course. In order to be successful, students had to draw upon knowledge and skills gained in courses throughout their undergraduate program, including engineering design, engineering science, project management, communication, and more, with guidance from a faculty advisor.  

This year student teams worked independently on either industry proposed, faculty proposed, or student proposed problems, with support from engineering industry professionals and local entrepreneurs, to create tangible, interdisciplinary solutions spanning five key themes:  

  • advanced manufacturing,   
  • biocompatible systems,   
  • complex systems,   
  • infrastructure, and  
  • sustainability.  

“We took a slightly different approach to Capstone this year,” explains Dr. Ken Chau. “In organizing projects by themes instead of by disciplines, we encouraged students to think about what they aspire to be, about what their colleagues aspire to be, and then in turn how they can work together across disciplines to come up with innovative solutions to real problems and needs in our society. It was incredibly inspiring to see how students embraced that challenge. The quality of their work is evidence of their highly effective collaboration.”  

Each project team delivered a comprehensive project report and a formal presentation, some including fabricated prototypes, to a distinguished panel of judges, comprising of industry leaders and esteemed engineering faculty.  

The event concluded with an award ceremony emceed by Capstone faculty leads Dr. Ken Chau, Associate Professor of Electrical Engineering, and Dr. Alon Eisenstein, Assistant Professor of Teaching, Technology Entrepreneurship and Professional Development. Students also received congratulations from Kelowna City Councillor and Deputy Mayor Loyal Wooldridge.  

Team 21 – Cable Tray Robot won within the Complex Systems theme, and declared the overall winner of this year’s capstone competition*  

Team 37 – Thrust Cushion Vehicle won within the Advanced Manufacturing theme  

Team 41 – Emergency Cervical Collar for Patients with Abnormal Anatomical Neck Positions won within the Biocompatible Systems theme  

Team 11 – 10080 Chase Road Subdivision Development Feasibility Study won within the Infrastructure theme  

Team 28 – Battery Module for a Solar-Powered Car won within the Sustainability theme  

“Congratulations to all the students – every team worked incredibly hard and yielded insightful, well-thought-out data and applicable solutions. The teams that earned the top prizes demonstrated particularly impressive creativity, a strong grasp of related engineering principles, excellent technical skill, and outstanding collaboration and communication, all things that will serve them well in their futures,” said Dr. Alon Eisenstein.   

“On behalf of everyone at the School of Engineering, we are deeply grateful to everyone who makes the Capstone course and showcase so rewarding for students. Thank you to our industry partners, faculty advisors, and event organizers, as well as to the family, friends and community members who joined us to recognize students for their outstanding achievements.”   

2024 Capstone Showcase top overall team members hold their certificates.

This year’s top overall team and winner of the complex systems category were elated to hear their names called after working right up until the wire.   

The team was comprised of students Kieran McIntosh, Daniel Holmes, Tanner Boutin, James Flood, Jackie Zhou, and Everett Douglas; they were supported by Faculty Advisor Dr. Klaske van Heusden. The team’s solution, a cable tray robot, is designed to help reduce human effort and the cost of placing electrical cables in industrial facilities.   

“Placement and organization of electrical cables into cable trays is a physically demanding and expensive tasks for electricians,” explains the team. “We have designed a novel way to increase efficiency and affordability of this task via the cable tray robot, which can pull cable along the length of the tray.”  

“We’re proud and elated,” said the team. “We finished our prototype the day before the competition. We worked very hard to refine it right up until the event. It feels great that our teamwork paid off like this.”   

You can view a photo gallery from the event and watch recorded video of the award presentation.   

More information about the ENG 499 Capstone course is available on the School of Engineering’s Capstone page .  

Posted in Uncategorized

College of Engineering

Surgical tool, airport navigation aid top spring 2024 capstone expo.

Projects that could help doctors save lives and restore independence for visually impaired travelers impress judges at semester-ending showcase.

students and College leaders posing with winning check

A device that could save lives when things go wrong during surgery and a wearable to ease airport navigation for visually impaired people shared top honors at the Spring 2024 Capstone Design Expo April 23.

For the second Expo in a row, two teams tied in judging for the best overall project: Team Seekr with their navigation aid and Team Left Atrial Files with a catheter-based tool that quickly and safely retrieves a dislodged medical device in a patient’s heart. 

They were just two of 204 teams from 12 schools and three different Georgia Tech colleges at the Expo, a showcase of students’ semester-long senior design projects. An army of judges from industry and across campus selected the top project in each discipline, best interdisciplinary team, and overall best project. 

The five biomedical engineers on the Left Atrial Files team created a tool they called EmPath to recapture a dislodged clot-blocking device that’s implanted in the heart. 

In some patients with an irregular heartbeat called atrial fibrillation, doctors place a small device in a pocket of their heart where blood clots often form. Those clots can leave the heart and cause strokes. But sometimes, after doctors insert the device, it gets loose and endangers the patient’s life. 

The EmPath tool uses a small net to capture it and a claw to quickly pull it out through a catheter.

It’s a project that was personal for several team members who’ve lost family members to strokes and related complications, including Emily Yan.

“My grandfather suffered from a stroke. And we believe it's one of the major reasons that he passed away,” Yan said. “We wanted to make something that would be able to save someone's life in a situation where they can't take blood thinners and they have to go under this procedure. If something goes wrong, there’s something to save them.”

Yan said the team has filed a provisional patent on their device alongside their sponsor, physician Kevin Graham.

The other top project aimed to restore independence to blind and visually impaired airline passengers . Team members found that travelers with limited vision often use video calls with friends or relatives to help them navigate. Others depend on airport personnel pushing them to the gate in a wheelchair.

Their solution is a device the size of a crossbody bag that passengers could borrow from the airport and turn in as they board their flight. It offers voice directions and beeps to help them make their way through the terminal.

See More Photos

A look at more projects and teams from the spring Expo. 

box with buttons

Team Seekr created a wearable device to help visually impaired travelers walk through airports. 

student in wheelchair with sensors on arms

Team EMG Controlled Wheelchair created technology that controls a wheelchair with arm movements. 

woman wears material on ear to avoid migraine headaches

MigraGuard created an ear plug that reduces the likelihood of migraine headaches during storms. 

students work with surgical instruments

GloSCOPE designed an affordable laparoscope to increase accessibility in low-to-middle income countries.

“Complaints to ADA coordinators [at airports] have been up 167% since 2020, so airports are really looking for a solution,” said team member Aislinn Abbott, a mechanical engineering major.

The team of mechanical and computer engineering and computer science students has tested their concept in large venues with nearly a dozen people with visual impairments. Airport testing is the next step, and the team is in conversation with several, including Atlanta’s. This summer, they’ll participate in the CREATE-X Startup Launch program to continue developing the idea and potentially turn it into a commercial venture.

Elsewhere at the Expo, teams worked with Georgia Power to use historical power outage data to more effectively develop and communicate power restoration estimates to customers during storms. They helped develop models for the Georgia Tech football team’s recruitment efforts and created a wearable device to help ease pain for migraine sufferers .

student punches around a robot

Students created a robotic boxer that can throw punches to help humans train. 

students show device to test steak cooking

A device created by Team Well Done detects the doneness of steak without puncturing it.

student looks at face detection device that unlocks doors

Computer engineers devised a system that uses face identification software to lock and unlock house doors. 

Other projects included: 

  • A low-cost motorized wheelchair system controlled by electromyography — the electrical activity of muscles. 
  • A team that worked with restor3d to create a new retractor to give doctors better access during total shoulder replacement surgery and improve patient outcomes.
  • A vibration absorber to help Volvo improve construction equipment energy efficiency .
  • Two devices to help semiconductor manufacturer KLA keep cleanrooms clean .
  • Tools to help clothing manufacturer Crystal sas automate sock packaging .
  • An automated packaging line to help American Baitworks speed up packing their fishing baits.
  • An under-the-counter system to wirelessly power small kitchen appliances .
  • A beach-cleaning robot to remove litter while filtering out and leaving sand.

The spring Expo also welcomed more than 200 high school students to inspire them to explore science, technology, engineering, and math — and also to serve as reviewers. They picked two projects for special honorable mentions: an idea for improving battery recycling and an autonomous drone design that can follow the terrain and fly low to the ground .

The Expo also included more than 70 industry sponsors. Their donations support Transforming Tomorrow: The Campaign for Georgia Tech , a more than $2 billion comprehensive campaign designed to secure resources that will advance the Institute and its impact — on people’s lives, on the way we work together to create innovative solutions, and on our world — for decades to come.

See all the winners from Capstone Expo below and visit expo.gatech.edu for more projects.

roller coaster model

Team RCT created a control system that allows roller coaster model enthusiasts to learn more about complex electrical and computer systems used on real rides. 

students display wound dressing options

Biomedical engineering students aimed to decrease the time it takes to set up wound dressing systems by automating a currently manual process.

students stand next to robot holding flowers

A team of mechanical engineers built a r obotic system that builds floral arrangements . 

students looking at shiny structure

My Tab is a a utomated bartender that allows busy bars and venues to serve more customers. 

Atlanta Mayor and Georgia Tech engineering graduate Andre Dickens congratulates the 2024 Capstone Design Expo participants.

Capstone Results

Left atrial files.

OVERALL BEST PROJECT (TIE)

Left atrial appendage closure device retrieval 

  • Mitali Gupte, BME (North Andover, MA)
  • Santosh Nachimuthu, BME (Cumming, GA)
  • Jeremiah Sirait, BME (Denver, CO)
  • Emily Yan, BME (Atlanta, GA)
  • Arda Yigitkanli, BME (Woodbridge)

Assisted airport navigation for the visually impaired

  • Aislinn Abbott, ME (York, PA)
  • Jackie Chen, CmpE (Calcutta, OH)
  • Alaz Cig, ME (Istanbul, Turkey)
  • Andrew Gunawan, CmpE (Austin, TX)
  • James Mead, ME (Eatonton, GA)
  • Rithvi Ravichandran, CmpE (Jacksonville, FL)
  • Hanrui Wang, CS (Tianjin, China)

EMG Controlled Wheelchair

INTERDISCIPLINARY

Wheelchair control for tetraplegics

  • Bareesh Bhaduri, EE (Knoxville, TN)
  • Indraja Chatterjee, CmpE (Carmel, IN)
  • Yash Fichadia, CmpE (Omaha, NE)
  • Philip Kuhle, EE (Camas, WA)
  • Nicholas Leone, EE (Orlando, FL)
  • Kartik Parameswaran, EE (Chelmsford, MA)
  • Eduardo Sanchez, ME (San Juan)

DANIEL GUGGENHEIM SCHOOL  OF AEROSPACE ENGINEERING

Orbital anomaly recovery system 

  • Oscar Haase (Pelham, MA)
  • Elliot Kantor (Jacksonville, FL)
  • Vishal Rachapudi (Hillsborough, NJ)
  • Samuel Stoknes (Oslo)
  • Aiden Wilson (Louisville, KY)

Gray and Nola

SCHOOL OF ARCHITECTURE

  • Nola Timmins (New Orleans, LA)
  • Gray Walters (Marietta, GA)  

WALLACE H. COULTER DEPARTMENT  OF BIOMEDICAL ENGINEERING

Improved hospital wound care device 

  • Shangze Lyu (Zhengzhou, Henan, China)
  • Deniz Onalir (Istanbul, Turkey)
  • Aya Samadi (Lexington, KY)
  • Xiaokun Xie (Jiangsu, China)

EcoPeach Solutions

SCHOOL OF CIVIL AND ENVIRONMENTAL ENGINEERING

Sustainable fresh produce system for NASA 

  • Jessica Brown, CE (Brooklyn, NY)
  • Pearl Dumbu, CE (Pretoria, South Africa)
  • Ananya Kumar, EnvE (Dacula, GA)
  • Annabelle Sarkissian, EnvE (Atlanta, GA)

Electric Pump for Rocket Propellants

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING

  • William Ewles, EE (Hamilton, Bermuda)
  • Cody Kaminsky, EE (Fairfax, VA)
  • Mihir Kasmalkar, EE (San Jose, CA)
  • Ochogie Omot, EE (Atlanta, GA)

The Americoldest

H. MILTON STEWART SCHOOL  OF INDUSTRIAL AND SYSTEMS ENGINEERING

Labor planning model for Americold Logistics

  • Rohan Bagade (Johns Creek, GA)
  • Landen Ledford (Chatsworth, GA)
  • Curran Myers (Alpharetta, GA)
  • Chandler Pittman (Rome, GA)
  • Justin Siegel (Dunwoody, GA)
  • Stephen Sowatzka (Atlanta, GA)
  • Nicholas Van (Savannah, GA)
  • Ashley Wilds Jr. (Augusta, GA)

INDUSTRIAL DESIGN AND ENGINEERING

Clothing solution for post-mastectomy patients

  • Fatimah Ahmed, ID (Atlanta, GA)
  • Catherine Ettershank, ID (Suwanee, GA)
  • Wyatt Pangan, ME (Austin, TX)
  • Christopher Rowley, ME (West Chester)
  • Max Shapiro, MSE (Atlanta, GA)

SCHOOL OF MATERIALS SCIENCE AND ENGINEERING

Lignin-filled natural rubber for tread applications

  • Katherine Cauffiel (Kennesaw, GA)
  • Téa Cook (Simpsonville, SC)
  • Ryan Cortes (Locust Grove, GA)
  • Arian Patel (Leesburg, VA)
  • Evan Wilson (Newnan, GA)

GEORGE W. WOODRUFF SCHOOL OF MECHANICAL ENGINEERING

Total shoulder arthroplasty retractor set 

  • Miguel Daly (Orange Park, FL)
  • Maxwell Gart (Avon, CT)
  • Isabelle Gustafson (Carrollton, GA)
  • Sana Hafeez (Atlanta, GA)
  • Lena Moller (Flowery Branch, GA)
  • Claudia Vitale (Tampa, FL)

NUCLEAR AND RADIOLOGICAL ENGINEERING

Compact on-board reactor assembly for trains

  • Evelyn Ayers (Chattanooga, TN)
  • Samuel Cochran (Charlotte)
  • Andrew Scheuermann (Evans, GA)
  • Kayla Watanabe (Lincolnshire, IL)

Mobile Crisis Response

SCHOOL OF PUBLIC POLICY (TIE)

Mental health crisis response

  • Katie Adcock (Dublin, GA)
  • Ann Brumbaugh (Atlanta, GA)
  • Ashley Cotsman (Alpharetta, GA)
  • Blaine Kantor (Suwanee, GA)
  • Angela Kim (Cumming, GA)
  • Clare Moegerle (Atlanta, GA)
  • Adaiba Nwasike (Marietta, GA)

SCRAH Patriots

Sustainable, climate-resilient affordable housing

  • Sophia Abedi (Alpharetta, GA)
  • Noemi Carrillo (Marietta, GA)
  • Linda Liu (Lawrenceville, GA)
  • Lillian Mason
  • Sophie Opolka
  • Abigail Peters

Honorable Mentions

  • Buzz Timer Pro , Interdisciplinary
  • Reparations Art + Design Residency , Architecture
  • Arterial Avengers , Interdisciplinary
  • ROBO , Interdisciplinary
  • PopUp Spaces , Interdisciplinary
  • Plus Minus , Interdisciplinary
  • AIRplanes , ECE

Related Stories

A group of students with their invention

BME Team Shares Top Prize At Expo With Medical Device For AFib Patients

The BME discipline award went to a project that streamlined hospital wound care.

a group of students with certificates and an oversized winners check

Fall 2023 Capstone Design Expo

Winning fall capstone teams unravel solutions to problems in salons, trauma rooms, and more.

  • Bahasa Indonesia
  • Slovenščina
  • Science & Tech
  • Russian Kitchen

Moscow hopes to become first 5G city by 2020

One of the 5G network will be a speed of 100 megabits per second for residents of large cities.

One of the 5G network will be a speed of 100 megabits per second for residents of large cities.

The Moscow mayor's office is in talks with a consortium of mobile operators over the possibility of developing 5G networks, the Kommersant daily reported on April 7. The government is determined to make the project an attractive investment for the operators and hopes the Russian capital will have 5G networks in 2020. 

Moscow’s telecom market is divided between four major players: Russian companies Megfon, VimpelCom, and MTS, plus European Tele2 – which entered the fray in 2015. A query from RBTH about a 5G consortium received an optimistic response from Megafon and Tele2, but VimpelCom and MTS decided not to answer. 

"The consortium may lay the foundation for the joint development of this technology by all the operators," said Konstantin Prokshin, head of strategic communications at Tele2.

Proposed ‘big data’ law will empower Russians in the digital realm

The support of the authorities is important for telecom operators because such issues as equipment deployment and power supply can often be solved only with the government’s help, explained Yulia Dorokhina, head of the press service at Megafon.

2018 World Cup and rivalry with London

City of London Corp., which runs London's financial center at the municipal level, has announced its plans to switch to the 5G standard as soon as it becomes available, writes The Financial Times. The company has signed a multimillion dollar wireless Internet upgrade contract with Cornerstone, which is owned by the Vodafone and O2 telecom operators.

Global capitals will be competing with each other over which of them will become the first to switch to 5G, said Konstantin Prokshin. The pace at which new technologies are introduced suggests that Moscow can indeed become one of the leaders in the development of 5G, he added. "Moscow's mobile market is one of the most developed in the world, with a low average cost of services and high quality," Prokshin pointed out.

During the 2018 FIFA World Cup in Moscow and St. Petersburg, Megafon plans to set up 5G test zones, Yulia Dorokhina said. "One of the main advantages offered by the new network is its huge capacity. The client receives high-quality signal in places of mass gathering of people – stadiums, railway stations, traffic jams," she added.

What is known about 5G today

Exact 5G specifications are still being developed, but one of them – as identified by the Next Generation Mobile Networks alliance – will be a speed of 100 megabits per second for residents of large cities.

"So far, some disparate research experiments have been conducted. What exactly the 'fifth generation' will provide is not quite clear," said Vladimir Korovkin, head of Innovations and Digital Technologies at the Moscow School of Management Skolkovo.

He added that the focus of 5G developers is not to increase the bandwidth of the channel, but to provide a guaranteed high-speed signal and density of coverage. "Both these features are important for mass use of M2M (machine to machine) networks," Korovkin explained.

One of the crucial questions is who will be producing the technical equipment and how the link to international networks will work. For the first time, Chinese companies, in particular Huawei, are taking an active part in creating a new standard, Korovkin pointed out. For example, Megafon has successfully tested mobile data transmission at 1 Gbit/s using Huawei equipment and at 5 Gbit/s during network equipment tests with the Finnish company Nokia, Dorokhina said.

Read more: Russians believe their life would not change without Internet>>>

If using any of Russia Beyond's content, partly or in full, always provide an active hyperlink to the original material.

to our newsletter!

Get the week's best stories straight to your inbox

data engineering capstone project udacity

This website uses cookies. Click here to find out more.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Udacity Data Engineering Nanodegree Capstone Project

Modingwa/Data-Engineering-Capstone-Project

Folders and files, repository files navigation, data engineering capstone project, project summary.

The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events. A use case for this analytics database is to find immigration patterns to the US. For example, we could try to find answears to questions such as, do people from countries with warmer or cold climate immigrate to the US in large numbers?

Data and Code

All the data for this project was loaded into S3 prior to commencing the project. The exception is the i94res.csv file which was loaded into Amazon EMR hdfs filesystem.

In addition to the data files, the project workspace includes:

  • etl.py - reads data from S3, processes that data using Spark, and writes processed data as a set of dimensional tables back to S3
  • etl_functions.py and utility.py - these modules contains the functions for creating fact and dimension tables, data visualizations and cleaning.
  • config.cfg - contains configuration that allows the ETL pipeline to access AWS EMR cluster.
  • Jupyter Notebooks - jupyter notebook that was used for building the ETL pipeline.

Prerequisites

  • AWS EMR cluster
  • Apache Spark
  • configparser python 3 is needed to run the python scripts.

The project follows the following steps:

Step 1: scope the project and gather data, step 2: explore and assess the data, step 3: define the data model.

  • Step 4: Run ETL to Model the Data
  • Step 5: Complete Project Write Up

Project Scope

To create the analytics database, the following steps will be carried out:

  • Use Spark to load the data into dataframes.
  • Exploratory data analysis of I94 immigration dataset to identify missing values and strategies for data cleaning.
  • Exploratory data analysis of demographics dataset to identify missing values and strategies for data cleaning.
  • Exploratory data analysis of global land temperatures by city dataset to identify missing values and strategies for data cleaning.
  • Perform data cleaning functions on all the datasets.
  • Create immigration calendar dimension table from I94 immigration dataset, this table links to the fact table through the arrdate field.
  • Create country dimension table from the I94 immigration and the global temperatures dataset. The global land temperatures data was aggregated at country level. The table links to the fact table through the country of residence code allowing analysts to understand correlation between country of residence climate and immigration to US states.
  • Create usa demographics dimension table from the us cities demographics data. This table links to the fact table through the state code field.
  • Create fact table from the clean I94 immigration dataset and the visa_type dimension.

The technology used in this project is Amazon S3, Apache Sparkw. Data will be read and staged from the customers repository using Spark.

Refer to the jupyter notebook for exploratory data analysis

3.1 Conceptual Data Model

Database schema

The country dimension table is made up of data from the global land temperatures by city and the immigration datasets. The combination of these two datasets allows analysts to study correlations between global land temperatures and immigration patterns to the US.

The us demographics dimension table comes from the demographics dataset and links to the immigration fact table at US state level. This dimension would allow analysts to get insights into migration patterns into the US based on demographics as well as overall population of states. We could ask questions such as, do populous states attract more visitors on a monthly basis? One envisions a dashboard that could be designed based on the data model with drill downs into gradular information on visits to the US. Such a dashboard could foster a culture of data driven decision making within tourism and immigration departments at state level.

The visa type dimension table comes from the immigration datasets and links to the immigaration via the visa_type_key.

The immigration fact table is the heart of the data model. This table's data comes from the immigration data sets and contains keys that links to the dimension tables. The data dictionary of the immigration dataset contains detailed information on the data that makes up the fact table.

3.2 Mapping Out Data Pipelines

The pipeline steps are as follows:

  • Load the datasets
  • Clean the I94 Immigration data to create Spark dataframe for each month
  • Create visa_type dimension table
  • Create calendar dimension table
  • Extract clean global temperatures data
  • Create country dimension table
  • Create immigration fact table
  • Load demographics data
  • Clean demographics data
  • Create demographic dimension table

Step 4: Run Pipelines to Model the Data

4.1 create the data model.

Refere to the jupyter notebook for the data dictionary.

4.2 Running the ETL pipeline

The ETL pipeline is defined in the etl.py script, and this script uses the utility.py and etl_functions.py modules to create a pipeline that creates final tables in Amazon S3.

spark-submit --packages saurfang:spark-sas7bdat:2.0.0-s_2.10 etl.py
  • Jupyter Notebook 95.1%
  • Python 4.9%

IMAGES

  1. Udacity Data Engineering Capstone Project

    data engineering capstone project udacity

  2. Udacity Data Engineering Capstone Project

    data engineering capstone project udacity

  3. GitHub

    data engineering capstone project udacity

  4. GitHub

    data engineering capstone project udacity

  5. Data Engineering Capstone Project

    data engineering capstone project udacity

  6. Udacity-Data-Engineering-Capstone/Capstone Project Submission.ipynb at

    data engineering capstone project udacity

VIDEO

  1. Get Started Building Your Udacity Portfolio Today!

  2. Udacity

  3. Engineering Senior Capstone Design Presentations

  4. Udacity capstone Project for ML with AzureML

  5. Advanced Data Science Capstone Sara Iaccheo

  6. Capstone Project Data Analytics -RevoU

COMMENTS

  1. KentHsu/Udacity-Data-Engineering-Nanodgree

    Udatcity - Data Engineering Nanodgree Program. Learn to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets. Create user-friendly relational and NoSQL data models. Create scalable and efficient data warehouses. Work efficiently with massive datasets.

  2. Udacity Data Engineering Nanodegree Capstone Project

    Project Summary. This is the capstone project for the Udacity Data Engineering Nanodegree program. The idea is to take multiple disparate data sources, clean the data, and process it through an ETL pipeline to produce a usable data set for analytics. I decided to go with the Udacity provided project which is based on I94 immigration data.

  3. Udacity Data Engineering Capstone Project

    The schema was created using dbdesigner.The SQL code to create the tables is in docs/immigration_db_postgres_create.sql.As can be seen from the image, a star schema has been used as a way to model the data because the ultimate goal of the data is to analyze it using Business Intelligence applications.

  4. Introducing the Udacity Data Engineering Nanodegree Program

    In this course, students will learn to schedule, automate, and monitor data pipelines using Apache Airflow. In the project, they'll continue your work on the music streaming company's data infrastructure by creating and automating a set of data pipelines. Capstone Project. In the capstone project, each project is unique to the student.

  5. What I learned from finishing the Udacity Data Engineering Capstone

    There are lots of nuances in configuring Airflow and AWS, which I've learnt the hard way through doing this project. It's really good preparation for me to embark on my own data journey. Stay ...

  6. PDF THE SCHOOL OF DATA SCIENCE Data Engineering

    The purpose of the data engineering capstone project is to give you a chance to combine what you've learned throughout the program. This project will be an important part of your portfolio that will help you achieve your data engineering-related career goals.

  7. Udacity Data Engineer Nanodegree

    Conceptual Data Model. Our Udacity Capstone Project — US Airports and Immigration Data Integration and ETL Data Pipeline will comprise of the STAR schema data model where we have the centralized the FACT Immigration Table surrounded by the four DIMENSION tables i.e. Arrival Date Table, Country Temperature Table, Demographics Table and ...

  8. Data Engineering Training Course

    About Data Engineering with AWS. Our Data Engineering Nanodegree program is a comprehensive data engineering course designed to teach you how to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets. Skills covered include Database fundamentals, CassandraDB, PostgreSQL, and database ...

  9. Data Engineering Projects for Real World Skills

    STEDI Human Balance Analytics. In this project, learners will act as a data engineer for the STEDI team to build a data lakehouse solution for sensor data that trains a machine learning model. They will build an ELT (Extract, Load, Transform) pipeline for lakehouse architecture, load data from an AWS S3 data lake, process the data into ...

  10. Capstone Project for Udacity Data Engineering Nanodegree

    Capstone Project for Udacity Data Engineering Nanodegree 10 stars 10 forks Branches Tags Activity. Star Notifications Code; Issues 0; Pull requests 0; Actions; Projects 0; Security; Insights; dai-dao/udacity-data-engineering-capstone. This commit does not belong to any branch on this repository, and may belong to a fork outside of the ...

  11. Starbucks Offers Analysis. The capstone project for Udacity's Data

    Project Overview. This is a capstone project of the Data Scientist Nanodegree Program of Udacity. In this project, the given dataset contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app.

  12. Udacity Data Engineering Capstone Project

    Step 1: Scope the Project and Gather Data Scope. The project is one provided by Udacity to showcase the learnings of the student throughout the program. There are four datasets as follows to complete the project. i94 Immigration Sample Data : sample data which is from the US National Tourism and Trade Office. This data comes from the US ...

  13. Udacity Data Visualization Nanodegree

    T his blog article represents the last project of my Udacity Data Visualization Nanodegree. The aim of this capstone project was for me to put into practice what I have learnt throughout the whole ...

  14. Udacity launches data engineering nanodegree program

    The launch of the Data Engineering Nanodegree Program comes on the heels of Udacity's new School of AI, which features four nanodegree programs and a 3D simulator. The company also recently ...

  15. Udacity Data Engineering Capstone Project

    Data lake with Apache Spark¶ Data Engineering Capstone Project¶ Project Summary¶. The Organization for Tourism Development (OTD) want to analyze migration flux in USA, in order to find insights to significantly and sustainably develop the tourism in USA.To support their core idea they have identified a set of analysis/queries they want to run on the raw data available.

  16. Engineering students' Capstone solutions on track to real-world impact

    Following the competition, many of these ideas are poised to have a real-world impact in industry and communities. This year's showcase saw 48 design solutions by 283 students evaluated. The popular annual event is the culmination of students' learning as part of the ENG499 Engineering Capstone Design Project course.

  17. Introducing the Udacity Data Scientist Nanodegree Program

    Featured projects include using Kaggle to build an algorithm for identifying charity donors, and creating an image classifier. Term Two is "Applied Data Science," and the focus is on solving problems with data science, as well as software and data engineering. As a capstone project, students build their own data science portfolio project.

  18. Udacity Data Engineering Nanodegree Capstone Project

    Introduction. For my capstone project I developed a data pipeline that creates an analytics database for querying information about immigration into the U.S on a monthly basis. The analytics tables are hosted in a Redshift Database and the pipeline implementation was done using Apache Airflow. View Notebook for more details and project write up.

  19. Fedor Sulaev

    I am a Software Engineer with 10 years of experience developing applications and services. I have participated in many projects for small and large businesses as both a developer and consultant, working in distributed teams or as a solo contractor.<br>I have a hobby interest in Data Science, for the last few years I have studied and participated in side projects in the field of AI, Statistics ...

  20. Patrina Nina

    A driven software engineer with a strong background in building and deploying critical components in AWS services. Experienced in utilizing AWS technologies such as Lambda functions, IAM, CDK, and ...

  21. Surgical Tool, Airport Navigation Aid Top Spring 2024 Capstone Expo

    Photos: Candler Hobbs. Tuesday, 23 April 2024. A device that could save lives when things go wrong during surgery and a wearable to ease airport navigation for visually impaired people shared top honors at the Spring 2024 Capstone Design Expo April 23. For the second Expo in a row, two teams tied in judging for the best overall project: Team ...

  22. RaadAldakhil/Udacity-Data-Engineering-Nanodegree-Capstone-Project

    Capstone Project. This porject aims to study which US cities are most popular for immigration. Providing data on demographics on the arrivals, such as gender, visa types, median ages, etc.

  23. Moscow hopes to become first 5G city by 2020

    For example, Megafon has successfully tested mobile data transmission at 1 Gbit/s using Huawei equipment and at 5 Gbit/s during network equipment tests with the Finnish company Nokia, Dorokhina said.

  24. Moscow to Revolutionize School Education with Online School Project

    The implemented projects include e-Government, Digital Public Services, the United Medical Information System, the Moscow Online School, the Intellectual Public Transport System, Wi-Fi city ...

  25. GitHub

    In addition to the data files, the project workspace includes: etl.py - reads data from S3, processes that data using Spark, and writes processed data as a set of dimensional tables back to S3 etl_functions.py and utility.py - these modules contains the functions for creating fact and dimension tables, data visualizations and cleaning.