View on GitHub

Warehousing-Stock-Tweet-Data

A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.

Big Data Lake Solution for Warehousing Stock Data and Tweet Data

Created by Stuart Miller, Paul Adams, and Rikel Djoko.

Table of Contents

Problem Statement

We want to build a large-scale data framework that will enable us to store and analyze financial market data as well as drive future predictions for investment.

For this project, we will use the following types of data.

Overview of the Big Data Solution

Overview of Solution

Data Warehouse Overview

Two star schemas were designed for this data warehouse: a fully normalized schema and a denormalized schema. We will investigate the performance of the two schemas in the context of this problem. Conceptual diagrams of the data warehouse schemas are shown below.

More detailed schema diagrams were created with MySQL WorkBench the schema design can be accessed here.

Snowflake Schema

A diagram of the dataware house snowflake schema is shown below.

Snowflake

Denormalized Star Schema

A diagram of the dataware house star schema is shown below.

Optimized_Star_Schema

Big Data Solution Implementation

The big data solution is build on AWS.

Big_data_solution_AWS

Results

Queries were run on the two schemas with different EMR cluster sizes to see the impact of normalization on query time. The collected data is located here. A plot summarizing the results is shown below.

box_plot

Reports

These reports were created during the course of this project.