• How It Works
  • PhD thesis writing
  • Master thesis writing
  • Bachelor thesis writing
  • Dissertation writing service
  • Dissertation abstract writing
  • Thesis proposal writing
  • Thesis editing service
  • Thesis proofreading service
  • Thesis formatting service
  • Coursework writing service
  • Research paper writing service
  • Architecture thesis writing
  • Computer science thesis writing
  • Engineering thesis writing
  • History thesis writing
  • MBA thesis writing
  • Nursing dissertation writing
  • Psychology dissertation writing
  • Sociology thesis writing
  • Statistics dissertation writing
  • Buy dissertation online
  • Write my dissertation
  • Cheap thesis
  • Cheap dissertation
  • Custom dissertation
  • Dissertation help
  • Pay for thesis
  • Pay for dissertation
  • Senior thesis
  • Write my thesis

214 Best Big Data Research Topics for Your Thesis Paper

big data research topics

Finding an ideal big data research topic can take you a long time. Big data, IoT, and robotics have evolved. The future generations will be immersed in major technologies that will make work easier. Work that was done by 10 people will now be done by one person or a machine. This is amazing because, in as much as there will be job loss, more jobs will be created. It is a win-win for everyone.

Big data is a major topic that is being embraced globally. Data science and analytics are helping institutions, governments, and the private sector. We will share with you the best big data research topics.

On top of that, we can offer you the best writing tips to ensure you prosper well in your academics. As students in the university, you need to do proper research to get top grades. Hence, you can consult us if in need of research paper writing services.

Big Data Analytics Research Topics for your Research Project

Are you looking for an ideal big data analytics research topic? Once you choose a topic, consult your professor to evaluate whether it is a great topic. This will help you to get good grades.

  • Which are the best tools and software for big data processing?
  • Evaluate the security issues that face big data.
  • An analysis of large-scale data for social networks globally.
  • The influence of big data storage systems.
  • The best platforms for big data computing.
  • The relation between business intelligence and big data analytics.
  • The importance of semantics and visualization of big data.
  • Analysis of big data technologies for businesses.
  • The common methods used for machine learning in big data.
  • The difference between self-turning and symmetrical spectral clustering.
  • The importance of information-based clustering.
  • Evaluate the hierarchical clustering and density-based clustering application.
  • How is data mining used to analyze transaction data?
  • The major importance of dependency modeling.
  • The influence of probabilistic classification in data mining.

Interesting Big Data Analytics Topics

Who said big data had to be boring? Here are some interesting big data analytics topics that you can try. They are based on how some phenomena are done to make the world a better place.

  • Discuss the privacy issues in big data.
  • Evaluate the storage systems of scalable in big data.
  • The best big data processing software and tools.
  • Data mining tools and techniques are popularly used.
  • Evaluate the scalable architectures for parallel data processing.
  • The major natural language processing methods.
  • Which are the best big data tools and deployment platforms?
  • The best algorithms for data visualization.
  • Analyze the anomaly detection in cloud servers
  • The scrutiny normally done for the recruitment of big data job profiles.
  • The malicious user detection in big data collection.
  • Learning long-term dependencies via the Fourier recurrent units.
  • Nomadic computing for big data analytics.
  • The elementary estimators for graphical models.
  • The memory-efficient kernel approximation.

Big Data Latest Research Topics

Do you know the latest research topics at the moment? These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars.

  • Evaluate the data mining process.
  • The influence of the various dimension reduction methods and techniques.
  • The best data classification methods.
  • The simple linear regression modeling methods.
  • Evaluate the logistic regression modeling.
  • What are the commonly used theorems?
  • The influence of cluster analysis methods in big data.
  • The importance of smoothing methods analysis in big data.
  • How is fraud detection done through AI?
  • Analyze the use of GIS and spatial data.
  • How important is artificial intelligence in the modern world?
  • What is agile data science?
  • Analyze the behavioral analytics process.
  • Semantic analytics distribution.
  • How is domain knowledge important in data analysis?

Big Data Debate Topics

If you want to prosper in the field of big data, you need to try even hard topics. These big data debate topics are interesting and will help you to get a better understanding.

  • The difference between big data analytics and traditional data analytics methods.
  • Why do you think the organization should think beyond the Hadoop hype?
  • Does the size of the data matter more than how recent the data is?
  • Is it true that bigger data are not always better?
  • The debate of privacy and personalization in maintaining ethics in big data.
  • The relation between data science and privacy.
  • Do you think data science is a rebranding of statistics?
  • Who delivers better results between data scientists and domain experts?
  • According to your view, is data science dead?
  • Do you think analytics teams need to be centralized or decentralized?
  • The best methods to resource an analytics team.
  • The best business case for investing in analytics.
  • The societal implications of the use of predictive analytics within Education.
  • Is there a need for greater control to prevent experimentation on social media users without their consent?
  • How is the government using big data; for the improvement of public statistics or to control the population?

University Dissertation Topics on Big Data

Are you doing your Masters or Ph.D. and wondering the best dissertation topic or thesis to do? Why not try any of these? They are interesting and based on various phenomena. While doing the research ensure you relate the phenomenon with the current modern society.

  • The machine learning algorithms are used for fall recognition.
  • The divergence and convergence of the internet of things.
  • The reliable data movements using bandwidth provision strategies.
  • How is big data analytics using artificial neural networks in cloud gaming?
  • How is Twitter accounts classification done using network-based features?
  • How is online anomaly detection done in the cloud collaborative environment?
  • Evaluate the public transportation insights provided by big data.
  • Evaluate the paradigm for cancer patients using the nursing EHR to predict the outcome.
  • Discuss the current data lossless compression in the smart grid.
  • How does online advertising traffic prediction helps in boosting businesses?
  • How is the hyperspectral classification done using the multiple kernel learning paradigm?
  • The analysis of large data sets downloaded from websites.
  • How does social media data help advertising companies globally?
  • Which are the systems recognizing and enforcing ownership of data records?
  • The alternate possibilities emerging for edge computing.

The Best Big Data Analysis Research Topics and Essays

There are a lot of issues that are associated with big data. Here are some of the research topics that you can use in your essays. These topics are ideal whether in high school or college.

  • The various errors and uncertainty in making data decisions.
  • The application of big data on tourism.
  • The automation innovation with big data or related technology
  • The business models of big data ecosystems.
  • Privacy awareness in the era of big data and machine learning.
  • The data privacy for big automotive data.
  • How is traffic managed in defined data center networks?
  • Big data analytics for fault detection.
  • The need for machine learning with big data.
  • The innovative big data processing used in health care institutions.
  • The money normalization and extraction from texts.
  • How is text categorization done in AI?
  • The opportunistic development of data-driven interactive applications.
  • The use of data science and big data towards personalized medicine.
  • The programming and optimization of big data applications.

The Latest Big Data Research Topics for your Research Proposal

Doing a research proposal can be hard at first unless you choose an ideal topic. If you are just diving into the big data field, you can use any of these topics to get a deeper understanding.

  • The data-centric network of things.
  • Big data management using artificial intelligence supply chain.
  • The big data analytics for maintenance.
  • The high confidence network predictions for big biological data.
  • The performance optimization techniques and tools for data-intensive computation platforms.
  • The predictive modeling in the legal context.
  • Analysis of large data sets in life sciences.
  • How to understand the mobility and transport modal disparities sing emerging data sources?
  • How do you think data analytics can support asset management decisions?
  • An analysis of travel patterns for cellular network data.
  • The data-driven strategic planning for citywide building retrofitting.
  • How is money normalization done in data analytics?
  • Major techniques used in data mining.
  • The big data adaptation and analytics of cloud computing.
  • The predictive data maintenance for fault diagnosis.

Interesting Research Topics on A/B Testing In Big Data

A/B testing topics are different from the normal big data topics. However, you use an almost similar methodology to find the reasons behind the issues. These topics are interesting and will help you to get a deeper understanding.

  • How is ultra-targeted marketing done?
  • The transition of A/B testing from digital to offline.
  • How can big data and A/B testing be done to win an election?
  • Evaluate the use of A/B testing on big data
  • Evaluate A/B testing as a randomized control experiment.
  • How does A/B testing work?
  • The mistakes to avoid while conducting the A/B testing.
  • The most ideal time to use A/B testing.
  • The best way to interpret results for an A/B test.
  • The major principles of A/B tests.
  • Evaluate the cluster randomization in big data
  • The best way to analyze A/B test results and the statistical significance.
  • How is A/B testing used in boosting businesses?
  • The importance of data analysis in conversion research
  • The importance of A/B testing in data science.

Amazing Research Topics on Big Data and Local Governments

Governments are now using big data to make the lives of the citizens better. This is in the government and the various institutions. They are based on real-life experiences and making the world better.

  • Assess the benefits and barriers of big data in the public sector.
  • The best approach to smart city data ecosystems.
  • The big analytics used for policymaking.
  • Evaluate the smart technology and emergence algorithm bureaucracy.
  • Evaluate the use of citizen scoring in public services.
  • An analysis of the government administrative data globally.
  • The public values are found in the era of big data.
  • Public engagement on local government data use.
  • Data analytics use in policymaking.
  • How are algorithms used in public sector decision-making?
  • The democratic governance in the big data era.
  • The best business model innovation to be used in sustainable organizations.
  • How does the government use the collected data from various sources?
  • The role of big data for smart cities.
  • How does big data play a role in policymaking?

Easy Research Topics on Big Data

Who said big data topics had to be hard? Here are some of the easiest research topics. They are based on data management, research, and data retention. Pick one and try it!

  • Who uses big data analytics?
  • Evaluate structure machine learning.
  • Explain the whole deep learning process.
  • Which are the best ways to manage platforms for enterprise analytics?
  • Which are the new technologies used in data management?
  • What is the importance of data retention?
  • The best way to work with images is when doing research.
  • The best way to promote research outreach is through data management.
  • The best way to source and manage external data.
  • Does machine learning improve the quality of data?
  • Describe the security technologies that can be used in data protection.
  • Evaluate token-based authentication and its importance.
  • How can poor data security lead to the loss of information?
  • How to determine secure data.
  • What is the importance of centralized key management?

Unique IoT and Big Data Research Topics

Internet of Things has evolved and many devices are now using it. There are smart devices, smart cities, smart locks, and much more. Things can now be controlled by the touch of a button.

  • Evaluate the 5G networks and IoT.
  • Analyze the use of Artificial intelligence in the modern world.
  • How do ultra-power IoT technologies work?
  • Evaluate the adaptive systems and models at runtime.
  • How have smart cities and smart environments improved the living space?
  • The importance of the IoT-based supply chains.
  • How does smart agriculture influence water management?
  • The internet applications naming and identifiers.
  • How does the smart grid influence energy management?
  • Which are the best design principles for IoT application development?
  • The best human-device interactions for the Internet of Things.
  • The relation between urban dynamics and crowdsourcing services.
  • The best wireless sensor network for IoT security.
  • The best intrusion detection in IoT.
  • The importance of big data on the Internet of Things.

Big Data Database Research Topics You Should Try

Big data is broad and interesting. These big data database research topics will put you in a better place in your research. You also get to evaluate the roles of various phenomena.

  • The best cloud computing platforms for big data analytics.
  • The parallel programming techniques for big data processing.
  • The importance of big data models and algorithms in research.
  • Evaluate the role of big data analytics for smart healthcare.
  • How is big data analytics used in business intelligence?
  • The best machine learning methods for big data.
  • Evaluate the Hadoop programming in big data analytics.
  • What is privacy-preserving to big data analytics?
  • The best tools for massive big data processing
  • IoT deployment in Governments and Internet service providers.
  • How will IoT be used for future internet architectures?
  • How does big data close the gap between research and implementation?
  • What are the cross-layer attacks in IoT?
  • The influence of big data and smart city planning in society.
  • Why do you think user access control is important?

Big Data Scala Research Topics

Scala is a programming language that is used in data management. It is closely related to other data programming languages. Here are some of the best scala questions that you can research.

  • Which are the most used languages in big data?
  • How is scala used in big data research?
  • Is scala better than Java in big data?
  • How is scala a concise programming language?
  • How does the scala language stream process in real-time?
  • Which are the various libraries for data science and data analysis?
  • How does scala allow imperative programming in data collection?
  • Evaluate how scala includes a useful REPL for interaction.
  • Evaluate scala’s IDE support.
  • The data catalog reference model.
  • Evaluate the basics of data management and its influence on research.
  • Discuss the behavioral analytics process.
  • What can you term as the experience economy?
  • The difference between agile data science and scala language.
  • Explain the graph analytics process.

Independent Research Topics for Big Data

These independent research topics for big data are based on the various technologies and how they are related. Big data will greatly be important for modern society.

  • The biggest investment is in big data analysis.
  • How are multi-cloud and hybrid settings deep roots?
  • Why do you think machine learning will be in focus for a long while?
  • Discuss in-memory computing.
  • What is the difference between edge computing and in-memory computing?
  • The relation between the Internet of things and big data.
  • How will digital transformation make the world a better place?
  • How does data analysis help in social network optimization?
  • How will complex big data be essential for future enterprises?
  • Compare the various big data frameworks.
  • The best way to gather and monitor traffic information using the CCTV images
  • Evaluate the hierarchical structure of groups and clusters in the decision tree.
  • Which are the 3D mapping techniques for live streaming data.
  • How does machine learning help to improve data analysis?
  • Evaluate DataStream management in task allocation.
  • How is big data provisioned through edge computing?
  • The model-based clustering of texts.
  • The best ways to manage big data.
  • The use of machine learning in big data.

Is Your Big Data Thesis Giving You Problems?

These are some of the best topics that you can use to prosper in your studies. Not only are they easy to research but also reflect on real-time issues. Whether in University or college, you need to put enough effort into your studies to prosper. However, if you have time constraints, we can provide professional writing help. Are you looking for online expert writers? Look no further, we will provide quality work at a cheap price.

law thesis topics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment * Error message

Name * Error message

Email * Error message

Save my name, email, and website in this browser for the next time I comment.

As Putin continues killing civilians, bombing kindergartens, and threatening WWIII, Ukraine fights for the world's peaceful future.

Ukraine Live Updates


Top 15 Big Data Projects (With Source Code)

Introduction, big data project ideas, projects for beginners, intermediate big data projects, advanced projects, big data projects: why are they so important, frequently asked questions, additional resources.

Almost 6,500 million linked gadgets communicate data via the Internet nowadays. This figure will climb to 20,000 million by 2025. This “sea of data” is analyzed by big data to translate it into the information that is reshaping our world. Big data refers to massive data volumes – both organized and unstructured – that bombard enterprises daily. But it’s not simply the type or quantity of data that matters; it’s also what businesses do with it. Big data may be evaluated for insights that help people make better decisions and feel more confident about making key business decisions. Big data refers to vast, diversified amounts of data that are growing at an exponential rate. The volume of data, the velocity or speed with which it is created and collected, and the variety or scope of the data points covered (known as the “three v’s” of big data) are all factors to consider. Big data is frequently derived by data mining and is available in a variety of formats.

Unstructured and structured big data are two types of big data. For large data, the term structured data refers to data that has a set length and format. Numbers, dates, and strings, which are collections of words and numbers, are examples of organized data. Unstructured data is unorganized data that does not fit into a predetermined model or format. It includes information gleaned from social media sources that aid organizations in gathering information on customer demands.

Key Takeaway

Confused about your next job?

  • Big data is a large amount of diversified information that is arriving in ever-increasing volumes and at ever-increasing speeds.
  • Big data can be structured (typically numerical, readily formatted, to and saved) or unstructured (often non-numerical, difficult to format and store) (more free-form, less quantifiable).
  • Big data analysis may benefit nearly every function in a company, but dealing with the clutter and noise can be difficult.
  • Big data can be gathered willingly through personal devices and applications, through questionnaires, product purchases, and electronic check-ins, as well as publicly published remarks on social networks and websites.
  • Big data is frequently kept in computer databases and examined with software intended to deal with huge, complicated data sets.

Just knowing the theory of big data isn’t going to get you very far. You’ll need to put what you’ve learned into practice. You may put your big data talents to the test by working on big data projects. Projects are an excellent opportunity to put your abilities to the test. They’re also great for your resume. In this article, we are going to discuss some great Big Data projects that you can work on to showcase your big data skills.

1. Traffic control using Big Data

Big Data initiatives that simulate and predict traffic in real-time have a wide range of applications and advantages. The field of real-time traffic simulation has been modeled successfully. However, anticipating route traffic has long been a challenge. This is because developing predictive models for real-time traffic prediction is a difficult endeavor that involves a lot of latency, large amounts of data, and ever-increasing expenses.

The following project is a Lambda Architecture application that monitors the traffic safety and congestion of each street in Chicago. It depicts current traffic collisions, red light, and speed camera infractions, as well as traffic patterns on 1,250 street segments within the city borders.

These datasets have been taken from the City of Chicago’s open data portal:

  • Traffic Crashes shows each crash that occurred within city streets as reported in the electronic crash reporting system (E-Crash) at CPD. Citywide data are available starting September 2017.
  • Red Light Camera Violations reflect the daily number of red light camera violations recorded by the City of Chicago Red Light Program for each camera since 2014.
  • Speed Camera Violations reflect the daily number of speed camera violations recorded by each camera in Children’s Safety Zones since 2014.
  • Historical Traffic Congestion Estimates estimates traffic congestion on Chicago’s arterial streets in real-time by monitoring and analyzing GPS traces received from Chicago Transit Authority (CTA) buses.
  • Current Traffic Congestion Estimate shows current estimated speed for street segments covering 300 miles of arterial roads. Congestion estimates are produced every ten minutes.

The project implements the three layers of the Lambda Architecture:

  • Batch layer – manages the master dataset (the source of truth), which is an immutable, append-only set of raw data. It pre-computes batch views from the master dataset.
  • Serving layer – responds to ad-hoc queries by returning pre-computed views (from the batch layer) or building views from the processed data.
  • Speed layer – deals with up-to-date data only to compensate for the high latency of the batch layer

Source Code – Traffic Control

2. Search Engine

To comprehend what people are looking for, search engines must deal with trillions of network objects and monitor the online behavior of billions of people. Website material is converted into quantifiable data by search engines. The given project is a full-featured search engine built on top of a 75-gigabyte In this project, we will use several datasets like stopwords.txt (A text file containing all the stop words in the current directory of the code) and wiki_dump.xml (The XML file containing the full data of Wikipedia). Wikipedia corpus with sub-second search latency. The results show wiki pages sorted by TF/IDF (stands for Term Frequency — Inverse Document Frequency) relevance based on the search term/s entered. This project addresses latency, indexing, and huge data concerns with an efficient code and the K-Way merge sort method.

Source Code – Search Engine

3. Medical Insurance Fraud Detection

A unique data science model that uses real-time analysis and classification algorithms to assist predict fraud in the medical insurance market. This instrument can be utilized by the government to benefit patients, pharmacies, and doctors, ultimately assisting in improving industry confidence, addressing rising healthcare expenses, and addressing the impact of fraud. Medical services deception is a major problem that costs Medicare/Medicaid and the insurance business a lot of money.

4 different Big Datasets have been joined in this project to get a single table for final data analysis. The datasets collected are:

  • Part D prescriber services- data such as name of doctor, addres of doctor, disease, symptoms etc.
  • List of Excluded Individuals and Entities (LEIE) database: This database contains a rundown of people and substances that are prohibited from taking an interest in governmentally financed social insurance programs (for example Medicare) because of past medicinal services extortion. 
  • Payments Received by Physician from Pharmaceuticals
  • CMS part D dataset- data by Center of Medicare and Medicaid Services

It has been developed by taking consideration of different key features with applying different Machine Learning Algorithms to see which one performs better. The ML algorithms used have been trained to detect any irregularities in the dataset so that the authorities can be alerted.

Source Code – Medical Insurance Fraud

4. Data Warehouse Design for an E-Commerce Site

A data warehouse is essentially a vast collection of data for a company that assists the company in making educated decisions based on data analysis. The data warehouse designed in this project is a central repository for an e-commerce site, containing unified data ranging from searches to purchases made by site visitors. The site can manage supply based on demand (inventory management), logistics, the price for maximum profitability, and advertisements based on searches and things purchased by establishing such a data warehouse. Recommendations can also be made based on tendencies in a certain area, as well as age groups, sex, and other shared interests. This is a data warehouse implementation for an e-commerce website “Infibeam” which sells digital and consumer electronics.

Source Code – Data Warehouse Design

5. Text Mining Project

You will be required to perform text analysis and visualization of the delivered documents as part of this project. For beginners, this is one of the best deep learning project ideas. Text mining is in high demand, and it can help you demonstrate your abilities as a data scientist . You can deploy Natural Language Process Techniques to gain some useful information from the link provided below. The link contains a collection of NLP tools and resources for various languages.

Source Code – Text Mining

6. Big Data Cybersecurity

The major goal of this Big Data project is to use complex multivariate time series data to exploit vulnerability disclosure trends in real-world cybersecurity concerns. This project consists of outlier and anomaly detection technologies based on Hadoop, Spark, and Storm are interwoven with the system’s machine learning and automation engine for real-time fraud detection and intrusion detection to forensics.

For independent Big Data Multi-Inspection / Forensics of high-level risks or volume datasets exceeding local resources, it uses the Ophidia Analytics Framework. Ophidia Analytics Framework is an open-source big data analytics framework that contains cluster-aware parallel operators for data analysis and mining (subsetting, reduction, metadata processing, and so on). The framework is completely connected with Ophidia Server: it takes commands from the server and responds with alerts, allowing processes to run smoothly.

Lumify, an open-source big data analysis, and visualization platform are also included in the Cyber Security System to provide big data analysis and visualization of each instance of fraud or intrusion events into temporary, compartmentalized virtual machines, which creates a full snapshot of the network infrastructure and infected device, allowing for in-depth analytics, forensic review, and providing a transportable threat analysis for Executive level next-steps.

Lumify, a big data analysis and visualization tool developed by Cyberitis is launched using both local and cloud resources (customizable per environment and user). Only the backend servers (Hadoop, Accumulo, Elasticsearch, RabbitMQ, Zookeeper) are included in the Open Source Lumify Dev Virtual Machine. This VM allows developers to get up and running quickly without having to install the entire stack on their development workstations.

Source Code – Big Data Cybersecurity

7. Crime Detection

The following project is a Multi-class classification model for predicting the types of crimes in Toronto city. The developer of the project, using big data ( The dataset collected includes every major crime committed from 2014-2017* in the city of Toronto, with detailed information about the location and time of the offense), has constructed a multi-class classification model using a Random Forest classifier to predict the type of major crime committed based on time of day, neighborhood, division, year, month, etc. using data sourced from Toronto Police.

The use of big data analytics here is to discover crime tendencies automatically. If analysts are given automated, data-driven tools to discover crime patterns, these tools can help police better comprehend crime patterns, allowing for more precise estimates of past crimes and increasing suspicion of suspects.

Source Code – Crime Detection

8. Disease Prediction Based on Symptoms

With the rapid advancement of technology and data, the healthcare domain is one of the most significant study fields in the contemporary era. The enormous amount of patient data is tough to manage. Big Data Analytics makes it easier to manage this information (Electronic Health Records are one of the biggest examples of the application of big data in healthcare). Knowledge derived from big data analysis gives healthcare specialists insights that were not available before. In healthcare, big data is used at every stage of the process, from medical research to patient experience and outcomes. There are numerous ways of treating various ailments throughout the world. Machine Learning and Big Data are new approaches that aid in disease prediction and diagnosis. This research explored how machine learning algorithms can be used to forecast diseases based on symptoms. The following algorithms have been explored in code:

  • Naive Bayes
  • Decision Tree
  • Random Forest
  • Gradient Boosting

Source Code – Disease Prediction

9. Yelp Review Analysis

Yelp is a forum for users to submit reviews and rate businesses with a star rating. According to studies, an increase of one star resulted in a 59 percent rise in income for independent businesses. As a result, we believe the Yelp dataset has a lot of potential as a powerful insight source. Customer reviews of Yelp is a gold mine waiting to be discovered.

This project’s main goal is to conduct in-depth analyses of seven different cuisine types of restaurants: Korean, Japanese, Chinese, Vietnamese, Thai, French, and Italian, to determine what makes a good restaurant and what concerns customers, and then make recommendations for future improvement and profit growth. We will mostly evaluate customer evaluations to determine why customers like or dislike the business. We can turn the unstructured data (reviews)  into actionable insights using big data, allowing businesses to better understand how and why customers prefer their products or services and make business improvements as rapidly as feasible.

Source Code – Review Analysis

10. Recommendation System

Thousands, millions, or even billions of objects, such as merchandise, video clips, movies, music, news, articles, blog entries, advertising, and so on, are typically available through online services. The Google Play Store, for example, has millions of apps and YouTube has billions of videos. Netflix Recommendation Engine, their most effective algorithm, is made up of algorithms that select material based on each user profile. Big data provides plenty of user data such as past purchases, browsing history, and comments for Recommendation systems to deliver relevant and effective recommendations. In a nutshell, without massive data, even the most advanced Recommenders will be ineffective. Big data is the driving force behind our mini-movie recommendation system. Over 3,000 titles are filtered at a time by the engine, which uses 1,300 suggestion clusters depending on user preferences. It’s so accurate that customized recommendations from the engine drive 80 percent of Netflix viewer activity. The goal of this project is to compare the performance of various recommendation models on the Hadoop Framework.

Source Code – Recommendation System

11. Anomaly Detection in Cloud Servers

Anomaly detection is a useful tool for cloud platform managers who want to keep track of and analyze cloud behavior in order to improve cloud reliability. It assists cloud platform managers in detecting unexpected system activity so that preventative actions can be taken before a system crash or service failure occurs.

This project provides a reference implementation of a Cloud Dataflow streaming pipeline that integrates with BigQuery ML, Cloud AI Platform to perform anomaly detection. A key component of the implementation leverages Dataflow for feature extraction & real-time outlier identification which has been tested to analyze over 20TB of data.

Source Code – Anomaly Detection

12. Smart Cities Using Big Data

A smart city is a technologically advanced metropolitan region that collects data using various electronic technologies, voice activation methods, and sensors. The information gleaned from the data is utilized to efficiently manage assets, resources, and services; in turn, the data is used to improve operations throughout the city. Data is collected from citizens, devices, buildings, and assets, which is then processed and analyzed to monitor and manage traffic and transportation systems, power plants, utilities, water supply networks, waste, crime detection, information systems, schools, libraries, hospitals, and other community services. Big data obtains this information and with the help of advanced algorithms, smart network infrastructures and various analytics platforms can implement the sophisticated features of a smart city.  This smart city reference pipeline shows how to integrate various media building blocks, with analytics powered by the OpenVINO Toolkit, for traffic or stadium sensing, analytics, and management tasks.

Source Code – Smart Cities

13. Tourist Behavior Analysis

This is one of the most innovative big data project concepts. This Big Data project aims to study visitor behavior to discover travelers’ preferences and most frequented destinations, as well as forecast future tourism demand. 

What is the role of big data in the project? Because visitors utilize the internet and other technologies while on vacation, they leave digital traces that Big Data can readily collect and distribute – the majority of the data comes from external sources such as social media sites. The sheer volume of data is simply too much for a standard database to handle, necessitating the use of big data analytics.  All the information from these sources can be used to help firms in the aviation, hotel, and tourist industries find new customers and advertise their services. It can also assist tourism organizations in visualizing and forecasting current and future trends.

Source Code – Tourist Behavior Analysis

14. Web Server Log Analysis

A web server log keeps track of page requests as well as the actions it has taken. To further examine the data, web servers can be used to store, analyze, and mine the data. Page advertising can be determined and SEO (search engine optimization) can be performed in this manner. Web-server log analysis can be used to get a sense of the overall user experience. This type of processing is advantageous to any company that relies largely on its website for revenue generation or client communication. This interesting big data project demonstrates parsing (including incorrectly formatted strings) and analysis of web server log data.

Source Code – Web Server Log Analysis

15. Image Caption Generator

Because of the rise of social media and the importance of digital marketing, businesses must now upload engaging content. Visuals that are appealing to the eye are essential, but subtitles that describe the images are also required. The usage of hashtags and attention-getting subtitles might help you reach out to the right people even more. Large datasets with correlated photos and captions must be managed. Image processing and deep learning are used to comprehend the image, and artificial intelligence is used to provide captions that are both relevant and appealing. Big Data source code can be written in Python. The creation of image captions isn’t a beginner-level Big Data project proposal and is indeed challenging. The project given below uses a neural network to generate captions for an image using CNN (Convolution Neural Network) and RNN (Recurrent Neural Network) with BEAM Search (Beam search is a heuristic search algorithm that examines a graph by extending the most promising node in a small collection. 

There are currently rich and colorful datasets in the image description generating work, such as MSCOCO, Flickr8k, Flickr30k, PASCAL 1K, AI Challenger Dataset, and STAIR Captions, which are progressively becoming a trend of discussion. The given project utilizes state-of-the-art ML and big data algorithms to build an effective image caption generator.

Source Code – Image Caption Generator

Big Data is a fascinating topic. It helps in the discovery of patterns and outcomes that might otherwise go unnoticed. Big Data is being used by businesses to learn what their customers want, who their best customers are, and why people choose different products. The more information a business has about its customers, the more competitive it is.

It can be combined with Machine Learning to create market strategies based on customer predictions. Companies that use big data become more customer-centric.

This expertise is in high demand and learning it will help you progress your career swiftly. As a result, if you’re new to big data, the greatest thing you can do is brainstorm some big data project ideas. 

We’ve examined some of the best big data project ideas in this article. We began with some simple projects that you can complete quickly. After you’ve completed these beginner tasks, I recommend going back to understand a few additional principles before moving on to the intermediate projects. After you’ve gained confidence, you can go on to more advanced projects.

What are the 3 types of big data? Big data is classified into three main types:

  • Unstructured
  • Semi-structured

What can big data be used for? Some important use cases of big data are:

  • Improving Science and research
  • Improving governance
  • Smart cities
  • Understanding and targeting customers
  • Understanding and Optimizing Business Processes
  • Improving Healthcare and Public Health
  • Financial Trading
  • Optimizing Machine and Device Performance

What industries use big data? Big data finds its application in various domains. Some fields where big data can be used efficiently are:

  • Travel and tourism
  • Financial and banking sector
  • Telecommunication and media
  • Banking Sector
  • Government and Military
  • Social Media
  • Big Data Tools
  • Big Data Engineer
  • Applications of Big Data
  • Big Data Interview Questions
  • Big Data Projects

Previous Post

Top 10 power bi project ideas for practice, 14 data mining projects with source code.

  • Open access
  • Published: 29 May 2021

Big data quality framework: a holistic approach to continuous quality management

  • Ikbal Taleb 1 ,
  • Mohamed Adel Serhani   ORCID: orcid.org/0000-0001-7001-3710 2 ,
  • Chafik Bouhaddioui 3 &
  • Rachida Dssouli 4  

Journal of Big Data volume  8 , Article number:  76 ( 2021 ) Cite this article

30k Accesses

35 Citations

4 Altmetric

Metrics details

Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.


Big Data is universal [ 1 ], it consists of large volumes of data, with unconventional types. These types may be structured, unstructured, or in a continuous motion. Either it is used by the industry and governments or by research institutions, a new way to handle Big Data from a technology perspective to research approaches in its management is highly required to support data-driven decisions. The expectation from Big Data analytics varies from trends finding to pattern discovery in different application domains such as healthcare, businesses, and scientific exploration. The aim is to extract significant insights and decisions. Extracting this precious information from large datasets is not an easy task. A devoted planning and appropriate selection of tools and techniques are available to optimize the exploration of Big Data.

Owning a huge amount of data does not often lead to valuable insights and decisions since Big Data does not necessarily mean Big insights. In fact, it can complicate the processes involved in fulfilling such expectations. Also, a lot of resources may be required, in addition to adapting the existing analytics algorithms to cope with Big Data requirements. Generally, data is not ready to be processed as it is. It should go through many stages, including cleansing and pre-processing, before undergoing any refining, evaluation, and preparation treatment for the next stages along its lifecycle.

Data Quality (DQ) is a very important aspect of Big Data for assessing the aforementioned pre-processing data transformations. This is because Big Data is mostly obtained from the web, social networks, and the IoT, where they may be found in a structured or unstructured form with no schema and eventually with no quality properties. Exploring data profiling, and more specifically, DQ profiling is essential before data preparation and pre-processing for both structured and unstructured data. Also, a DQ assessment should be conducted for all data-related content, including attributes and features. Then, an analysis of the assessment results can provide the necessary elements to enhance, control, monitor, and enforce the DQ along the Big Data lifecycle; for example, maintaining high Data Quality (conforming to its requirements) in the processing phase.

Data Quality has been an active and attractive research area for several years [ 2 , 3 ]. In the context of Big Data, quality assessment processes are hard to implement, since they are time- and cost-consuming, especially for the pre-processing activities. These issues have got intensified since the available quality assessment techniques were developed initially for well-structured data and are not fully appropriate for Big Data. Consequently, new Data Quality processes must be carefully developed to assess the data origin, domain, format, and type. An appropriate DQ management scheme is critical when dealing with Big Data. Furthermore, Big Data architectures do not incorporate quality assessment practices throughout the Big Data lifecycle apart from pre-processing. Some new initiatives are still limited to specific applications [ 4 , 5 , 6 ]. However, the evaluation and estimation of Big Data Quality should be handled in all phases of the Big Data lifecycle from data inception to its analytics, thus support data-driven decisions.

The work presented in this paper is related to Big Data Quality management through the Big Data lifecycle. The objective of such a management perspective is to provide users or data scientists with a framework capable of managing DQ from its inception to its analytics and visualization, therefore support decisions. The definition of acceptable Big Data quality depends largely on the type of applications and Big Data requirements. The need for a quality Big Data evaluation before engaging in any Big Data related project is imminent. This is because the high costs involved in processing useless data at an early stage of its lifecycle can be prevented. More challenges to the data quality evaluation process may occur when dealing with unstructured, schema-less data collected from multiples sources. Moreover, a Big Data Quality Management Framework can provide quality management mechanisms to handle and ensure data quality throughout the Big Data lifecycle by:

Improving the processes of the Big Data lifecycle to be quality-driven, in a way that it integrates quality assessment (built-in) at every stage of the Big Data architecture.

Providing quality assessment and enhancement mechanisms to support cross-process data quality enforcement.

Introducing the concept of Big Data Quality Profile (DQP) to manage and trace the whole data pre-processing procedures from data source selection to final pre-processed data and beyond (processing and analytics).

Supporting profiling of data quality and quality rules discovery based on quantitative quality assessments.

Supporting deep quality assessment using qualitative quality evaluations on data samples obtained using data reduction techniques.

Supporting data-driven decision making based on the latest data assessments and analytics results.

The remainder of this paper is organized as follows. In Sect. " Overview and background ", we provide ample detail and background on Big Data and data quality, besides, the introduction of the problem statement, and the research objectives. The research literature related to Big Data quality assessment approaches is presented in Sect. " Related research studies ". The components of the proposed framework and an explanation of their main functionalities are described in Sect. " Big data quality management framework ". Finally, implementation discussion and dataflow management are detailed in Sect. " Implementations: Dataflow and quality processes development ", whereas Sect. " Conclusion " concludes the paper and points to our ongoing research developments.

Overview and background

An exponential increase in global inter-network activities and data storage has triggered the Big Data Era. Moreover, application domains, including Facebook, Amazon, Twitter, YouTube, Internet of Things Sensors, and mobile smartphones, are the main players and data generators. The amount of data generated daily is around 2.5 quintillion bytes (2.5 Exabyte, 1 EB = 1018 Bytes).

According to IBM, Big Data is a high-volume, high-velocity, and high-variety information asset that demands cost-effective, innovative forms of information processing for enhanced insights and decision-making. It is used to describe a massive volume of both structured and unstructured data; therefore, Big Data processing using traditional database and software tools is a difficult task. Big Data also refers to the technologies and storage facilities required by an organization to handle and manage large amounts of data.

Originally, in [ 7 ], the McKinsey Global Institute identifies three Big Data characteristics commonly known as ''3Vs'' for Volume, Variety, and Velocity [ 1 , 7 , 8 , 9 , 10 , 11 ]. These characteristics have been extended to more dimensions, moving to 10 Vs (Volume, Velocity, Variety, Veracity, Value, Vitality, Viscosity, Visualization, Vulnerability) [ 12 , 13 , 14 ].

In [ 10 , 15 , 16 ], the authors define important Big Data systems architectures. The data in Big Data comes from (1) heterogeneous data sources (e-Gov: Census data, Social networking: Facebook, and Web: Google page rank data), (2) data in different formats (video, text), and (3) data of various forms (unstructured: raw text data with no schema, and semi-structured: metadata, graph structure as text). Moreover, data travels through different stages, composing the Big Data lifecycle. Many aspects of Big Data architectures were compiled from the literature. Our enhanced design contributions are illustrated in Fig.  1 and described as follows:

Data generation: this is the phase of data creation. Many data sources can generate this data such as electrophysiology signals, sensors used to gather climate information, surveillance devices, posts to social media sites, videos and still images, transaction records, stock market indices, GPS location, etc.

Data acquisition: it consists of data collection, data transmission, and data pre-processing [ 1 , 10 ]. Due to the exponential growth and availability of heterogeneous data production sources, an unprecedented amount of structured, semi-structured, and unstructured data is available. Therefore, the Big Data Pre-Processing consists of typical data pre-processing activities: integration, enhancements and enrichment, transformation, reduction, discretization, and cleansing .

Data storage: it consists of the data center infrastructure, where the data is stored and distributed among several clusters and data centers, spread geographically around the world. The software storage is supported by the Hadoop ecosystem to ensure a certain degree of fault tolerance storage reliability and efficiency through replication. The data storage stage is responsible for all input and output data that circulates within the lifecycle.

Data analysis: (Processing, Analytics, and Visualization); it involves the application of data mining and machine learning algorithms to process the data and extract useful insights for better decision making. Data scientists are the most valuable users of this phase since they have the expertise to apply what is needed, on what must be analyzed.

figure 1

Big data lifecycle value chain

Data quality, quality dimensions, and metrics

The majority of studies in the area of DQ originate from the database context [ 2 , 3 ] and management research communities. According to [ 17 ], DQ is not an easy concept to define. Its definition is data domain awareness. There is a consensus that data quality always depends on the quality of the data source [ 18 ]. However, it highlights that enormous quality issues are hidden inside data and their values.

In the following, the definitions of data quality, data quality dimensions, and quality metrics and their measurements are given:

Data quality: It has many meanings that are related to data context, domain, area, and the fields from which it is used [ 19 , 20 ]. Academia interprets DQ differently than industry. In [ 21 ], data quality is reduced to “The capability of data to satisfy stated and implied needs when used under specified conditions”. Also, DQ is defined as “fitness for use”. Yet, [ 20 ] define data quality as the property corresponding to quality management, which is appropriate for use or meeting user needs.

Data quality dimensions: DQD’s are used to measure, quantify, and manage DQ [ 20 , 22 , 23 ]. Each quality dimension has a specific metric, which measures its performance. There are several DQDs, which can be organized into 4 categories according to [ 24 , 25 ], intrinsic, contextual, accessibility, and representational [ 14 , 15 , 22 , 24 , 26 , 27 ]. Two important categories (intrinsic and contextual) are illustrated in Fig.  2 . Examples of intrinsic quality dimensions are illustrated in Table 1 .

Metrics and measurements: Once the data is generated, its quality should be measured. This means that a data-driven strategy is considered to act on the data. Hence, it is mandatory to measure and quantify the DQD. Structured or semi-structured data is available as a set of attributes represented in columns or rows, and their values are respectively recorded. In [ 28 ], a quality metric, as a quantitative or categorical representation of one or more attributes, is defined. Any data quality metric should define whether the values of an attribute respect a targeted quality dimension. The author [ 29 ], quoted that data quality measurement metrics tend to evaluate binary results: correct or incorrect, or a value between 0 and 100 (with 100% representing the highest). This applies to some quality dimensions such as accuracy, completeness, consistency, and currency. Examples of DQD metrics are illustrated in Table 2 .

figure 2

Data quality dimensions

DQD’s must be relevant to data quality problems that have been identified. Thus, a metric tends to measure if attributes comply with defined DQD’s. These measurements are performed for each attribute, given their type and data ranges of values collected from the data profiling process. The measurements produce DQD’s scores for the designed metrics of all attributes [ 30 ]. Specific metrics need to be defined, to estimate specific quality dimensions of other data types such as images, videos, and audio [ 5 ].

Big data characteristics and data quality

The main Big Data characteristics, commonly named as V’s, are initially, Volume, Velocity, Variety, and Veracity. Since the Big Data inception, 10 V’s have been defined, and probably new Vs will be adopted [ 12 ]. For example, veracity tends to express and describe the trustworthiness of data, mostly known as data quality. The accuracy is often related to precision, reliability, and veracity [ 31 ]. Our tentative mapping among these characteristics, data, and data quality, is shown in Table 3 . It is based on the intuitive studies accomplished by [ 5 , 32 , 33 ]. In these studies, the authors attempted to link the V’s to the data quality dimensions. In another study, the authors [ 34 ] addressed the mapping of DQD Accuracy with the Big Data characteristic Volume and showed that the data size has an impact on DQ.

Big data lifecycle: where quality matters?

According to [ 21 , 35 ], data quality issues may appear in each phase of the Big Data value chain. Addressing data quality may follow different strategies, as each phase has its features either improving the quality of existing data or/and refining, reassessing, redesigning the whole processes, which generate and collect data, aiming at improving their quality.

Big Data quality issues were addressed by many studies in the literature [ 36 , 37 , 38 ]. These studies generally elaborated on the issues and proposed generic frameworks with no comprehensive approaches and techniques to manage quality across the Big Data lifecycle. Among these, generic frameworks are presented in [ 5 , 39 , 40 ].

In Fig.  3 , it is illustrated where data quality can and must be addressed in the Big Data value chain phases/stages from (1) to (7).

In the data generation phase, there is a need to define how and what data is generated.

In the data transmission phase, the data distribution scheme relies on the underlying networks. Unreliable networks may affect data transfer. Its quality is expressed by data loss and transmission errors.

Data collection refers to where, when, and how the data is collected and handled. Well-defined structured constraint verification on data must be established.

The pre-processing phase is one of the main focus points of the proposed work. It follows a data-driven strategy, which is largely focused on data. An evaluation process provides the necessary means to ensure the quality of data for the next phases. An evaluation of the DQ before (Pre) and after (Post) pre-processing on data samples is necessary to strengthen the DQP.

In the Big Data storage phase, some aspects of data quality, such as storage failure, are handled by replicating data on multiple storages. The latter is also valid for data transmission when a network fails to transmit data.

In the Data Processing and Analytics phases, the quality is influenced by both the applied process and data quality itself. Among the various data mining and machine learning algorithms and techniques suitable for Big Data, those that converge rapidly and consume fewer cloud resources will be highly adopted. The relation between DQ and the processing methods is substantial. A certain DQ requirement on these methods or algorithms might be imposed to ensure efficient performance.

Finally, for an ongoing iterative value chain, the visualization phase seems to be only a representation of the data in a fashionable way such as a dashboard. This helps the decision-makers to have a clear picture of the data and its valuable insights. Finally, in this work, Big Data is transformed into useful Small Data, which is easy to visualize and interpret.

figure 3

Where quality matters in big data lifecycle?

Data quality issues

Data quality issues generally appear when the quality requirements are not met on the data values [ 41 ]. These issues are due to several factors or processes having occurred at different levels:

Data source level: unreliability, trust, data copying, inconsistency, multi-sources, and data domain.

Generation level: human data entry, sensors’ readings, social media, unstructured data, and missing values.

Process level (acquisition: collection, transmission).

In [ 21 , 35 , 42 ], many causes of poor data quality were enumerated, and a list of elements, which affect the quality and DQD’s was produced. This list is illustrated in Table 4 .

Related research studies

Research directions on Big Data differ between industry and academia. Industry scientists mainly focus on the technical implementations, infrastructures, and solutions for Big Data management, whereas researchers from academia tackle theoretical issues of Big Data. Academia’s efforts mainly include the development of new algorithms for data analytics, data replication, data distribution, and optimization of data handling. In this section, the literature review is classified into 3 categories, which are described in the following sub-sections.

Data quality assessment approaches

Existing studies on data quality have been approached from different perspectives. In the majority of the papers, the authors agree that data quality is related to the phases or processes of its lifecycle [ 8 ]. Specifically, data quality is highly related to the data generation phases and/or with its origin. The methodologies adopted to assess data quality are based on traditional data strategies and should be adapted to Big Data. Moreover, the application domain and type of information (Content-based, Context-based, or Rating-based) affects the way the quality evaluation metrics are designed and applied. In content-based quality metrics, the information itself is used as a quality indicator, whereas in context-based metrics meta-data is used as quality indicators.

There are two main strategies to improve data quality according to [ 20 , 23 ]: data-driven and process-driven. The first strategy handles the data quality in the pre-processing phase by applying some pre-processing activities (PPA) such as cleansing, filtering, and normalization. These PPAs are important and occur before the data processing stage, preferably as early as possible. However, the process-driven quality strategy is applied to each stage of the Big Data value chain.

Data quality assessment was discussed early in the literature [ 10 ]. It is divided into two main categories: subjective and objective. Moreover, an approach that combines these two categories to provide organizations with usable data quality metrics to evaluate their data was proposed. However, the proposed approach was not developed to deal with Big Data.

In summary, Big Data quality should be addressed early in the pre-processing stage during the data lifecycle. The aforementioned Big Data quality challenges have not been investigated in the literature from all perspectives. There are still many open issues, which must be addressed especially at the pre-processing stage.

Rule-based quality methodologies

Since the data quality concept is context-driven, it may differ from an application domain to another. The definition of quality rules involves establishing a set of constraints on data generation, entry, and creation. Poor data can always exist, and rules are created or discovered to correct or eliminate this data. Rules themselves are only one part of the data quality assessment approach. The necessity to establish a consistent process for creating, discovering, and applying the quality rules should consider the following:

Characterize the quality of data being good or bad from its profile and quality requirements.

Select the data quality dimensions that apply to the data quality assessment context.

Generate quality rules based on data quality requirements, quantitative, and qualitative assessments.

Check, filter, optimize, validate, run, and test rules on data samples for efficient rules’ management.

Generate a statistical quality profile with quality rules. These rules represent an overview of successful valid rules with the expected quality levels.

Hereafter, the data quality rules are discovered from data quality evaluation. These rules will be used in Big Data pre-processing activities to improve the quality of data. The discovery process reveals many challenges, which should consider different factors, including data attributes, data quality dimensions, data quality rules discovery, and their relationship with pre-processing activities.

In (Lee et al., 2003), the authors concluded that the data quality problems depend on data, time, and context. Quality rules are applied to the data to solve and/or avoid quality problems. Accordingly, quality rules must be continuously assessed, updated, and optimized.

Most studies on the discovery of data quality rules come from the database community. These studies are often based on conditional functional dependencies (CFDs) to detect inconsistencies in data. CFDs are used to formulate data quality rules, which are generally expressed manually and discovered automatically using several CFD approaches [ 3 , 43 ].

Data quality assessment in Big Data has been addressed in several studies. In [ 32 ], a Data Quality-in-Use model was proposed to assess the quality of Big Data. Business rules for data quality are used to decide on which data these rules must meet the pre-defined constraints or requirements. In [ 44 ], a new quality assessment approach was introduced and involved both the data provider and the data consumer. The assessment was mainly based on data consistency rules provided as metadata.

The majority of research studies on data quality and discovery of data quality rules are based on CFD’s and database. In Big Data quality, the size, variety, and veracity of data are key characteristics that must be considered. These characteristics should be processed to reduce the quality assessment time and resources since they are handled before the pre-processing phase. Regarding quality rules, it is fundamental to consider these rules to eliminate poor data and enforce quality on existing data, while following a data-driven quality context.

Big data pre-processing frameworks

The pre-processing of data before performing any analytics is primeval. However, several challenges have emerged at this crucial phase of the Big Data value chain [ 10 ]. Data quality is one of these challenges, which must be highly considered in the Big Data context.

As pointed out in [ 45 ], data quality problems arise when dealing with multiple data sources. This increases the requirements for data cleansing significantly. Additionally, the large size of datasets, which arrive at an uncontrolled speed, generates an overhead on the cleansing processes. In [ 46 , 47 , 48 ], NADEEF, an extensible data cleaning system, was proposed. The extension for Big Data cleaning based on NADEEF was presented in [ 49 ] for streaming data. The system deals with data quality from the data cleaning activity using data quality rules and functional dependencies rules [ 14 ].

Numerous other studies on Big Data management frameworks exist. In these studies, the authors surveyed and proposed Big Data management models dealing with storage, pre-processing, and processing [ 50 , 51 , 52 ]. An up-to-date review of the techniques and methods for each process involved in the management processes is also included.

The importance of quality evaluation in Big Data Management has not been, generally, addressed. In some studies, Big Data characteristics are the only recommendations for quality. However, no mechanisms have been proposed to map or handle quality issues that might be a consequence of these Big Data Vs. A Big Data Management Framework, which includes data quality management, must be developed to cope with end-to-end quality management across the Big Data lifecycle.

Finally, it is worth mentioning that research initiatives and solutions on Big Data quality are still in their preliminary phase; there is much to do on the development and standardization of Big Data quality. Big Data quality is a multidisciplinary, complex, and multi-variant domain, where new evaluation techniques, processing and analytics algorithms, storage and processing technologies, and platforms will play a key role in the development and maturity of this active research area. We anticipate that researchers from academia will contribute to the development of new Big Data quality approaches, algorithms, and optimization techniques, which will advance beyond the traditional approaches used in databases and data warehouses. Additionally, industries will lead development initiatives of new platforms, solutions, and technologies optimized to support end-to-end quality management within the Big Data lifecycle.

Big data quality management framework

The purpose of proposing a Big Data Quality Management Framework (BDQMF) is to address the quality at all stages of the Big Data lifecycle. This can be achieved by managing data quality before and after the pre-processing stage while providing feedback at each stage and loop back to the previous phase, whenever possible. We also believe that data quality must be handled at data inception. However, this is not considered in this work.

To overcome the limitations of the existing Big Data architectures for managing data quality, a Big Data Quality pre-processing approach is proposed: a Quality Framework [ 53 ]. In our framework, the quality evaluation process tends to extract the actual quality status of Big Data and proposes efficient actions to avoid, eliminate, or enhance poor data, thus improving its quality. The framework features the creation and management of a DQP and its repository. The proposed scheme deals with data quality evaluation before and after the pre-processing phase. These practices are essential to ensure a certain quality level for the next phases while maintaining the optimal cost of the evaluation.

In this work, a quantitative approach is used. This approach consists of an end-to-end data quality management system that deals with DQ through the execution of pre-pre-processing tasks to evaluate BDQ on data. It starts with data sampling, data and DQ profiling, and gathering user DQ requirements. It then proceeds to DQD evaluation and discovery of Quality rules from quality scores and requirements. Each data quality rule is represented by one-to-many Pre-Processing Functions (PPF’s) under a specific Pre-Processing Activity (PPA). A PPA, such as cleansing, aims at increasing data quality. Pre-processing is applied to Big Data samples and re-evaluated once again to update and certify that the quality profile is complete. It is applied to the whole Big Dataset, not only to data samples. Before pre-processing, the DQP is tuned and revisited by quality experts for endorsement based on an equivalent data quality report. This report states the quality scores of the data, not the rules.

Framework description

The BDQM framework is illustrated in Fig.  4 , where all the components cooperate, relying on the Data Quality Profile. It is initially created as a Data Profile and is progressively extended from the data collection phase to the analytics phase to capture important quality-related information. For example, it contains quality requirements, targeted data quality dimensions, quality scores, and quality rules.

figure 4

Big data sources

Data lifecycle stages are part of the BDQMF. Generated feedbacks in all the stages are analyzed and used to correct, improve the data quality, and detect any DQ management related failures. The key components of the proposed BDQMF include:

Big Data Quality Project (Data Sources, Data Model, User/App Quality Requirements, Data domain),

Data Quality Profile and its Repository,

Data Preparation (Sampling and Profiling),

Exploratory Quality Profiling,

Quality Parameters and Mapping,

Quantitative Quality Evaluation,

Quality Control,

Quality Rules Discovery,

Quality Rules Validation,

Quality Rules Optimization,

Big Data Pre-Processing,

Data Processing,

Data Visualization, and

Quality Monitoring.

A detailed description of each of these components is provided hereafter.

Framework key components

In the following sub-sections, each component is described. Its input(s) and output(s), its main functions, and its roles and interactions with the other framework’s components, are also described. Consequently, at each Big Data stage, the Data Quality Profile is created, updated, and adapted until it achieves the quality requirements already set by the users or applications at the beginning of the Big Data Quality Project.

Big data quality project module

The Big Data Quality Project Module contains all the elements that define the data sources, and the quality requirements set by either the Big Data users or Big Data applications to represent the quality foundations of the Big Data project. As illustrated in Error! Reference source not found., any Big Data Quality Project should specify a set of quality requirements as targeted quality goals (Fig. 5 ).

It represents the first module of the framework. The Big Data quality project represents the starting point of the BDQMF, where specifications of the data model, data sources, and targeted quality goals for DQD and data attributes are defined. These requirements are represented as data quality scores/ratios, which express the acceptance level of the evaluated data quality dimensions. For example, 80% of data accuracy, 60% data completeness, and 85% data consistency are judged by quality experts as accepted levels (or tolerance ratios). These levels can be relaxed using a range of values, depending on the context, the application domain, and the targeted processing algorithm’s requirements.

Let us denote by BDQP(DS , DS’ , Req) a Big Data Quality Project Request that initiates many automatic processes:

A data sampling and profiling process.

An exploratory quality profiling process, which is included in many quality assessment procedures.

A pre-processing phase is eventually considered if the resulted quality scores are not met.

The BDQP contains the input dataset DS , output dataset DS’ , and Req . The Quality requirements are presented as a tuple of sets Req  = ( D , L , A ), where:

D represents a set of data quality dimensions DQD’s (e.g., accuracy, consistency): \({D}=\left\{{{\varvec{d}}}_{0},\dots ,{{\varvec{d}}}_{{\varvec{i}}},\dots ,{{\varvec{d}}}_{{\varvec{m}}}\right\},\)

L is a set of DQD acceptance (tolerance) level ratios (%) set by the user or the application related to the quality project and associated with each DQD, respectively: \({L}=\left\{{{\varvec{l}}}_{0},\dots ,{{\varvec{l}}}_{{\varvec{i}}},\dots ,{{\varvec{l}}}_{{\varvec{m}}}\right\},\)

A is the set of targeted data attributes. If it is not specified, the DQD’s are assessed for the dataset, which includes all possible attributes, since some dimensions need more detailed requirements to be assessed. Therefore, it depends on the DQD and the attribute type: \({A}=\left\{{{\varvec{a}}}_{0},\dots ,{{\varvec{a}}}_{{\varvec{i}}},\dots ,{{\varvec{a}}}_{{\varvec{m}}}\right\}\)

The Data quality requirements might be updated with some more aspects, whereas the profiling component provides well-detailed information about the data ( DQP Level 0 ). This update is performed within the quality mapping component and interfaces with user experts to refine, reconfirm, and restructure their data quality parameters over the data attributes.

Data sources: There are multiple Big Data sources. Most of them are generated from the new media (e.g., social media) based on the internet. Other data sources are based on the context of new technologies such as the cloud, sensors, and IoT. A list of Big Data sources is illustrated in Error! Reference source not found.

Data users, data applications, and quality requirements: This module identifies and specifies the input sources of the quality requirements parameters for the data sources. These sources include user’s quality requirements (e.g., Domain Experts, Researchers, Analysts, and Data scientists) or application quality requirements. (Applications may vary from simple data processing to machine learning applications or AI-based applications). For the users, a dashboard-like interface is used to capture user’s data requirements and other quality information. This interface can be enriched with information from the data sources as attributes and their types, if available. This can efficiently guide users to the inputs and ensure the right data is used. This phase can be initiated after sample profiling or exploratory quality profiling. Otherwise, a general quality request is entered in the form of targeted Data Quality dimensions and their expected quality scores after the pre-processing phase. All the quality requirements parameters and settings are recorded in the Data Quality Profile ( DQP 0 ). DQP Level 0 is created when the quality project is set.

The quality requirements are specifically set as quality score ratios, goals, or targets to be achieved by the BDQMF. They are expressed as targeted DQDs in the Big Data Quality Project.

Let us denote by Req , a set of quality requirements presented as Req = \(\left\{{{\varvec{r}}}_{0},\dots ,{{\varvec{r}}}_{{\varvec{i}}},\dots ,{{\varvec{r}}}_{{\varvec{m}}}\right\}\) and constructed with a tuple ( D , L, A ). The Req quality requirements list is identified by elements, where each of these elements is a quality requirement characterized by \({{\varvec{r}}}_{{\varvec{i}}}=\left({{\varvec{d}}}_{{\varvec{i}}},{{\varvec{l}}}_{{\varvec{i}}},{{\varvec{a}}}_{{\varvec{i}}}\right)\) ; \({{\varvec{r}}}_{{\varvec{i}}}\) represents a \({{\varvec{d}}}_{{\varvec{i}}}\) in the DQD with a minimum accepted ratio level \({{\varvec{l}}}_{{\varvec{i}}}\) for all or a sub-list of selected attributes \({{\varvec{a}}}_{{\varvec{i}}}.\)

The initial DQP originating from this module is a DQP Level 0, containing the following tuple, as illustrated in Fig.  6 : BDQP (DS, DS’, Req) with Req  =  ( D , L, A )

Data models and data domains

Data models: If the Data is structured, then a schema is provided to add more detailed quality settings for all attributes. In other cases, if there are no such attributes or types, the data is considered as unstructured data, and its quality evaluation will consist of a set of general Quality Indicators (QI). In our Framework, these QI are provided especially for the cases, where a direct identification of DQD’s is not available for an easy quality assessment.

Data domains: Each data domain has a unique set of default quality requirements. Some are very sensitive to accuracy and completeness; others, prioritize data currency and higher timeliness. This module adds value to users or applications when it comes to quality requirements elicitation.

figure 6

BDQP and quality requirements settings

figure 7

Exploratory quality profiling modules

Data quality profile creation: Once the Big Data Quality Project (BDQP) is initiated, the DQP level 0 (DQP0) is created and consists of the following elements, as illustrated in Fig. 7 :

Data sources information, which may include datasets, location, URL, origin, type, and size.

Information about data that can be created or extracted from metadata if available, such as database schema, data attributes names and types, data profile, or basic data profile.

Data domains such as business, health, commerce, or transportation.

Data users, which may include the names and positions of each member of the project, security credentials, and data access levels.

Data application platforms, software, programming languages, or applications that are used to process the data. These may include R, Python, Java, Julia, Orange, Rapid Miner, SPSS, Spark, and Hadoop.

Data quality requirements: for each dataset, its expected quality ratios, and tolerance levels are accepted; otherwise, the data is discarded or repaired. It can also be set as a range of quality tolerance levels. For example, the DQD completeness is defined as equal to or higher than 67%, which means the acceptance ratio of missing values, is equal to or less than 33% (100% –67%).

Data quality profile (DQP) and repository (DQPREPO)

We describe hereafter the content of DQP and the DQP repository and the DQP levels captured through the lifecycle of framework processes.

  • Data quality profile

The data quality profile is generated once a Big Data Quality Project is created. It contains, for example, information about the data sources, domain, attributes, or features. This information may be retrieved from metadata, data provenance, schema, or from the dataset itself. If not available, data preparation (sampling and profiling) is needed to collect and extract important information, which will support the upcoming processes, as the Data Profile (DP) is created.

An Exploratory Quality Profiling will generate a quality rules proposal list. The DP is updated with these rules and converted into a DQP. This will help the user to obtain an overview of some DQDs and make better attributes selection based on this first quality approximation with a ready-to-use list of rules for pre-processing.

The User/App quality requirements (Quality tolerance levels, DQDs, and targeted attributes) are set and added to the DQP. Updated and tuned-up previously proposed quality rules are more likely, or a complete redefinition of the quality requirement parameters is performed.

The mapping and selection phase will update the DQP with a DQES, which contains the set of attributes to be evaluated for a set of DQDs, using a set of metrics from the DQP repository.

The Quantitative Quality Evaluation component assesses the DQ and updates the DQES with DQD Scores.

The DQES scores pass through quality control if validated. The DQP is executed in the pre-processing stage and confirmed in the repository.

If the scores (based on the quality requirements) are not valid, a quality rules discovery, validation, and optimization will be added/updated to the DQP configuration to obtain a valid DQD score that satisfies the quality requirements.

A continuous quality monitoring is performed for an eventual DQ failure that triggers a DQP update.

The DQP Repository: The DQPREPO contains detailed data quality profiles per data source and dataset. In the following, an information list managed by the repository is presented:

Data Quality User/App requirements.

Data Profiles, Metadata, and Data Provenance.

Data Quality Profiles (e.g. Data Quality Evaluation Schemes, and Data Quality Rules).

Data Quality Dimensions and related Metrics (metrics formulas and aggregate functions).

Data Domains (DQD’s, BD Characteristics).

DQD’s vs BD Characteristics.

Pre-processing Activities (e.g. Cleansing, and Normalizing) and functions (to replace missing values).

DQD’s vs DQ Issues vs PPF: Pre-processing Functions.

DQD’s priority processing in Quality Rules.

At every stage, module, task, or process, the DQP repository is incrementally updated with quality-related information. This includes, for example, quality requirements, DQES, DQD scores, data quality rules, Pre-Processing activities, activity functions, DQD metrics, and Data Profiles. Moreover, the DQP’s are organized per Data Domain and datatype to allow reuse. Adaptation is performed in the case of additional Big Datasets.

In Table 5 , an example of DQP Repository managed information along with its preprocessing activities (PPA) and their related functions (PPAF), is presented.

DQP lifecycle (Levels) : The DQP goes through the complete process flow of the proposed BDQMF. It starts with the specification of the Big Data Quality Project and ends with quality monitoring as an ongoing process that closes the quality enforcement loop and triggers other processes, which handle DQP adaptation, upgrade, or reuse. In Table 6 , the various DQP levels and their interaction within the BDQM Framework components are described. Each component involves process operations applied to the DQP.

Data preparation: sampling and profiling

Data preparation generates representative Big Data samples that serve as an entry for profiling, quality evaluation, and quality rules validation.

Sampling: Several sampling strategies can be applied to Big Data as surveyed in [ 54 , 55 ]. In this work, the authors evaluated the effect of sampling methods on Big Data and concluded that the sampling of large datasets reduces the run-time and computational footprint of link prediction algorithms, maintaining an adequate prediction performance. In statistics, the Bootstrap sampling technique evaluates the sampling distribution of an estimator using sampling, which replaces the original samples. In the Big Data context, Bootstrap sampling has been studied in several works [ 56 , 57 ]. In the proposed data quality evaluation scheme, it was decided to use the Bag of Little Bootstrap (BLB) [ 58 ]. This combines the results of bootstrapping multiple small subsets of a Big Data dataset. The BLB algorithm employs an original Big Dataset, which is used to generate small samples without replacements. For each generated sample, another set of samples is created by re-sampling with replacements.

Profiling: The data profiling module performs the data quality screening based on statistics and information summary [ 59 , 60 , 61 ]. Since profiling is meant to discover data characteristics from data sources, it is considered as a data assessment process that provides a first summary of the data quality reported in its data profile. Such information includes, for example, data format description, different attributes their types, values, and basic quality dimensions’ evaluations, data constraints (if any), and data ranges (max and min, a set of specific values or subsets).

More precisely, the information about the data is presented in two types: technical and functional data. This information can be extracted from the data itself without any additional representation using metadata or any descriptive header file or by parsing the data using analysis tools. This task may become very costly in Big Data. Therefore, to avoid costs generated by the data size, the same sampling process (based on BLB) is used. Thus, the data is reduced to a representative population sample, in addition to the combination of profiling results. More precisely, a data profile in the proposed framework is represented as a data quality profile of the first level ( DQP1 ), which is generated after the profiling phase. Moreover, data profiling provides some useful information that leads to significant data quality rules, usually named as data constraints. These rules are mostly equivalent to a structured-data schema, which is represented as technical and functional rules.

According to [ 61 ], there are many activities and techniques used to profile the data. These may range from online, incremental, and structural, to continuous profiling. Profiling tasks aim at discovering information about the data schema. Some data sources are already provided with their data profiles, sometimes with minimal information. In the following, some other techniques are introduced. These techniques can enrich and bring value-added information to a data profile:

Data provenance inquiry : it tracks the data origin and provides information about data transformations, data copying, and its related data quality through the data lifecycle [ 62 , 63 , 64 ].

Metadata : it provides descriptive and structural information about the data. Many data types, such as images, videos, and documents, use metadata to provide deep information about their contents. Metadata can be represented in many formats, including XML, or it can be extracted directly from the data itself without any additional representation.

Data parsing (supervised/manual/automatic) : data parsing is required since not all the data has a provenance or metadata that describes the data. The hardest way to gather extra information about the data is to parse it. Automatic parsing can be initially applied. Then, it is tuned and supervised manually by a data expert. This task may become very costly when Big Data is concerned, especially in the case of unstructured data. Consequently, a data profile is generated to represent only certain parts of the data that make sense. Therefore, multiple data profiles for multiple data partitions must be taken into consideration.

Data profile : it is generated early in the Big Data Project as DQP Level 0 (Data profile in its early form) and upgraded as a data quality profile within the data preparation component as DQP Level 1. Then, it is updated and extended through all the components of the Big Data Quality Management Framework until it reaches a DQP Level 2 . The DQP Level 8 is the profile applied to the data in the pre-processing phase with its quality rules and related activities to output a pre-processed data conformed to the quality requirements.

Exploratory quality profiling

Since a data-driven approach that uses a quantitative approach to quality dimensions’ evaluation from the data itself is followed, two evaluation steps are adopted: Quantitative Quality Evaluation based on user requirements and Exploratory Quality Profiling.

The exploratory quality profiling component is responsible for automatic data quality dimensions’ exploration without user interventions. The Quality Rules Proposals module, which produces a list of actions to elevate data quality, is based on some elementary DQDs that fit all varieties and data types.

A list of quality rules proposition, which is based on the quality evaluation of the most likely considered DQDs (e.g., completeness, accuracy, and uniqueness), is produced. This preliminary assessment is performed based on the data itself and using predefined scenarios. These scenarios are meant to increase data quality for some basic DQDs. In Fig. 7 , the steps involved in the exploratory quality profiling for quality rules proposals generation are depicted. DQP1 is extended to DQP2, after adding the Data Quality Rules Proposal ( DQRP ), which is generated by the “quality rules proposals” process.

This module is part of the DQ profiling process, which varies the DQD tolerance levels from min to max scores and applies a systematic list of predefined quality rules. These predefined rules are a set of actions applied to the data when the measured DQD scores are not in the tolerance level defined by the min, max value scores. The actions vary from deleting only attributes, discarding only observations, or a combination of both. After these actions, a re-evaluation of the new DQD scores will lead to a quality rules proposal (DQRP) with known DQD target scores after performing an analysis. In Table 7 , some examples of these predefined rules scenarios for the DQD completeness ( dqd  =  Comp ) with an execution priority for each set of grouped actions, are described. The DQD levels are set to vary from a 5% to 95% tolerance score with a granularity step of 5. They can be set differently according to the DQD choice and its sensitivity to the data model and domain. The selection of the best-proposed data quality rules is based on the KNN algorithm using Euclidean distance (Deng et al. 2016.; [ 65 ]. It gives the closest quality rules parameters that achieve (by default) high completeness with less data reduction. The process might be refined by specifying other quality parameters.

A list of quality rules proposal based on quality evaluation of the most likely considered DQD’s (e.g., completeness, accuracy, and uniqueness), is produced. This preliminary assessment is based on the data itself using predefined scenarios. The quality rules are meant to increase data quality for some basic DQD’s. In Fig.  8 , the modules involved in the exploratory quality profiling for quality rules proposals generation, are illustrated.

figure 8

Quality rules proposals with exploratory quality profiling

Quality mapping and selection

The quality mapping and selection module of the BDQM framework is responsible for mapping data features or attributes to DQD’s to target pre-required quality evaluation scores. It generates a Data Quality Evaluation Scheme ( DQES ) and then adds it (updates) to the DQP. The DQES contains the DQD’s of the appropriate attributes to be evaluated using adequate metric formulas. The DQES, as a part of DQP, contains (for each of the selected data attributes) the following list, which is considered essential for the quantitative quality evaluation:

The attributes: all or a selected list,

The data quality dimensions (DQD’s) to be evaluated for each selected attribute,

Each DQD has a metric that returns the quality score, and

The quality requirement scores for each DQD needed in the score’s validation.

These requirements are general and target many global quality levels. The mapping component acts as a refinement of the global settings with precise qualities’ goals. Therefore, a mapping must be performed between the data quality dimensions and targeted data features/attributes before proceeding with the quality assessment. Each DQD is measured for each attribute and sample. The mapping generates a DQES , which contains Quality Evaluation Requests ( QER ) Q x . Each QER Q x targets a data quality dimension (DQD) for an attribute, all attributes, or a set of selected attributes, where x is the number of requests.

Quality mapping: Many approaches are available to accomplish an efficient mapping process. These include automatic, interactive, manual, and based on quality rules proposals techniques:

Automatic : it completes the alignment and comparison of the data attributes (from DQP) with the data quality requirements (either per attribute type, or name). A set of DQDs is associated with each attribute for quality evaluation. It results in a set of associations to be executed and evaluated in the quality assessment component.

Interactive : it relies on experts’ involvement to refine, amend, or confirm the previous automated associations.

Manual : it uses a similar but advanced dashboard to that illustrated in Error! Reference source not found. and a more detailed one in the attribute level.

Quality rules proposals : the proposal list collected from the DQP2 is used to obtain an understanding of the impact of a DQD level and the data reduction ratio. These quality insights help decide which DQD is best when compared to the quality requirements.

Quality selection (of DQD, Metrics and Attributes): It consists of a selection of an appropriate quality metric to evaluate data quality dimensions for an attribute of a Big Data sample set and returns a count of correct values, which comply with the metric formula. Each metric will be computed if the attribute values reflect the DQD constraints. For example, accuracy can be defined as a count of correct attributes in a certain range of values [v 1 , v 2 ]. Similarly, it can be defined to satisfy a certain number of constraints related to the type of data such as zip code, email, social security number, dates, or addresses.

Let us define the tuple DQES (S, D, A, M) . Most of the information is provided by the BDQP(DS , DS’ , Req) with Req  =  ( D , L, A ) parameters. The profiling information is used to select the appropriate quality metrics \({{\varvec{m}}}_{{\varvec{l}}}\) to evaluate the data quality dimensions \({{\varvec{q}}}_{{\varvec{l}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) with a weight \({{\varvec{w}}}_{{\varvec{j}}}\) . In addition to the previous settings, let us consider the following: S : S ( DS , N , n, R ) \(\to\) \({{\varvec{S}}}_{{\varvec{i}}}\) a sampling strategy

Let us denote by M , a set of quality metrics \({\varvec{M}}=\left\{{{\varvec{m}}}_{1},..,{{\varvec{m}}}_{{\varvec{l}}},..,{{\varvec{m}}}_{{\varvec{d}}}\right\}\) where \({{\varvec{m}}}_{{\varvec{l}}}\) is a quality metric that measures and evaluates a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for each value of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) in the sample \({{\varvec{s}}}_{{\varvec{i}}}\) and returns 1, if correct, and 0, if not. Each \({{\varvec{m}}}_{{\varvec{l}}}\) metric will be computed if the value of the attribute reflects the \({{\varvec{q}}}_{{\varvec{l}}}\) constraint. For example, the accuracy of an attribute is defined as a range of values between 0 and 100. Otherwise, it is incorrect. If the same DQD \({{\varvec{q}}}_{{\varvec{l}}}\) is evaluated for a set of attributes, and if the weights are all equal, a simple mean is computed. The metric \({{\varvec{m}}}_{{\varvec{l}}}\) will be evaluated to measure if each attribute has its \({{\varvec{m}}}_{{\varvec{l}}}\) correct. This is performed for each instance (cell or row) of the sample \({{\varvec{s}}}_{{\varvec{i}}}\) .

Let us denote by \({{{\varvec{M}}}_{{\varvec{l}}}}^{\left(i\right)}, i=1,\dots ,{\varvec{N}}\) , a metric total \({{\varvec{m}}}_{{\varvec{l}}}\) , which evaluates and counts the number of observations that satisfy this metric, for a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) of N samples from the dataset DS . The proportion of observations under the adequacy rule is calculated by:

The proportion of observations under the adequacy rule in a sample \({{\varvec{s}}}_{{\varvec{i}}}\) is given by:

The total proportion of observations under the adequacy rule for all samples is given by:

where \({{\varvec{M}}}_{{\varvec{l}}}\) characterizes the \({{\varvec{q}}}_{{\varvec{l}}}\) mean score for the whole dataset.

Let \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) represents a request for a quality evaluation, which results in the mean quality score for a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for a measurable attribute \({{\varvec{a}}}_{{\varvec{k}}}\) calculated by M l . The process by which Big Data samples are evaluated for a DQD \({{\varvec{q}}}_{{\varvec{j}}}\) in a sample \({{\varvec{s}}}_{{\varvec{i}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) with a metric \({{\varvec{m}}}_{{\varvec{l}}}\) , providing a \({{\varvec{q}}}_{{\varvec{l}}}{{\varvec{s}}}_{{\varvec{i}}}\) score for each sample (described below in Quantitative Quality Evaluation ). Then, a sample mean \({{\varvec{q}}}_{{\varvec{l}}}\) is the final score for \({{\varvec{a}}}_{{\varvec{k}}}\) .

Let us denote a process, which sorts and combines the requests of a quality evaluation (QER) by DQD or by an attribute, resulting in a re-arrangement of the \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) tuple into two types, depending on the evaluation selection group parameter:

Per DQD identified as \({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) where AList(a z ) represents the attributes \({{\varvec{a}}}_{{\varvec{z}}}\) ( z:1…R ) to be evaluated for the DQD \({{\varvec{q}}}_{{\varvec{l}}}\) .

Per attributes identified as Q x (a k , DList( \({{\varvec{q}}}_{{\varvec{l}}}\) , m l )) , where DList( \({{\varvec{q}}}_{{\varvec{l}}}\) , m l ) represents the data quality dimensions \({{\varvec{d}}}_{{\varvec{l}}}\) ( l:1… d ) to be evaluated for the attribute \({{\varvec{a}}}_{{\varvec{k}}}\) .

In some cases, the type of combination is automatically selected for a certain DQD, such as consistency, when all the attributes are constrained towards specific conditions. The combination is either based on attributes or DQD’s, and the DQES will be constructed as follows:

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) ,…,…) or.

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) ,…,…)

The completion of the quality mapping process updates the DQP Level 2 with a DQES set as follows (Also illustrated in Error! Reference source not found.):

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) ,…,…) , where x ranges from 1 to a defined number of evaluation requests. Each Q x element is a quality evaluation request of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) for a quality dimension \({{\varvec{q}}}_{{\varvec{l}}}\) , with a DQD metric m l .

The output of this phase generates a DQES score, which contains the mean score for each DQ dimension for one or many attributes. The mapping and selection data flow initiated using Big Data quality project (BDQP) settings is illustrated in Fig.  9 . This is accomplished either using the same BDQP Req or defining more detailed and refined quality parameters and a sampling strategy. Two types of DQES can be yielded:

Data Quality Dimension-wise evaluation of a list of attributes or

Attribute-wise evaluation of many DQD’s. As described before, the quality mapping and selection component generates a DQES evaluation scheme for the dataset, identifying which DQD and attributes tuples to evaluate using a specific quality metric. Therefore, a more detailed and refined set of parameters can also be set, as described in previous sections. In the following, the steps that construct the DQES in the mapping component are depicted:

The QMS function extracts the Req parameters from BDQP as (D, L, A) .

A quality evaluation request \(\left({a}_{k},{q}_{l},{m}_{l}\right)\) , is generated from the (D, A) tuple.

A list is constructed with these quality evaluation requests.

A list sorting is performed either by DQD or by Attributes producing two types of lists:

A combination of requests per DQD generates quality requests for a set of attributes \(\left(AList\left({a}_{z}\right),{q}_{l},{m}_{l}\right)\) .

A combination of requests per attribute generates quality requests for a set of DQD’s \(\left({a}_{k},DList({q}_{l},{m}_{l})\right)\) .

A DQES is returned based on the evaluation selection group parameter (per DQD, per attribute).

figure 9

DQES parameters settings

Quantitative quality evaluation

The Authors in [ 66 ], addressed how to evaluate a set of DQDs over a set of attributes. According to this study, the evaluation of Big Data quality is applied and iterated to many samples. The aggregation and combination of DQD’s scores are performed after each iteration. The evaluation scores are added to the DQES, which results in updating the DQP. We proposed an algorithm, which computes the quality scores for a dataset based on a certain quality mapping and quality metrics.

This algorithm is based on quality metrics evaluation using scores after collecting and validating the scores with quality requirements and generating quality rules from these scores [ 66 , 67 ]. There are rules related to each pre-processing activity, such as data cleaning rules, which eliminate data, and data enrichment, which replaces or adds data. Other activities, such as data reduction, reduce the data size by decreasing the number of features or attributes that have certain characteristics such as low variance, and highly correlated features.

In this phase, all the information collected from previous components (profiling, mapping, DQES) is included in the data quality profile level 3. The important elements are the set of samples and the data quality evaluation scheme, which are executed on each sample to evaluate its quality attributes for a specific DQD.

DQP Level 3 provides all the information needed about the settings represented by the DQES to proceed with the quality evaluation. The DQES contains the following:

The selected DQDs and their related metrics.

The selected attributes with the DQD to be evaluated.

The DQD selection, which is based on the Big Data quality requirements expressed early when initiating a Big Data Quality Project.

Attributes selection is set in the quality selection mapping component (3).

The quantitative quality evaluation methodology is described as follows:

The selected DQD quality metrics will measure and evaluate the DQD for each attribute observation in each sample from the sample set. For each attribute observation, it returns a value 1, if correct, or 0, if incorrect.

Each metric will be computed if all the sample observations attribute values reflect the constraints. For example, the metric accuracy of an attribute defines that a range of values between 20 and 70 is valid. Otherwise, it is invalid. The count of correct values out of the total sample observations is the DQD ratio represented by a percentage (%). This is performed for all selected attributes and their selected DQDs.

The sample mean from all samples for each evaluated DQD represents a Data Quality Score (DQS) estimation \(\left(\overline{DQS }\right)\) of a data quality dimension of the data source.

DQP Level 4 : an update to the DQP level 3 includes a data quality evaluation scheme (DQES) with the quality scores per DQD and per attribute ( DQES  +  Scores ).

In summary, the quantitative quality evaluation starts with sampling, DQD’s and DQDs metrics selection, mapping with data attributes, quality measurements, and the sample mean DQD’s ratios.

Let us denote by \({{\varvec{Q}}}_{{\varvec{x}}}\) Score (quality score), the evaluation results of each quality evaluation request \({{\varvec{Q}}}_{{\varvec{x}}}\) in the DQES . Two types of DQES, depending on the evaluation type, which means two kind of results scores organized per DQD of all attributes or per attribute for all DQD’s, can be identified:

\({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\to\) \({{\varvec{Q}}}_{{\varvec{x}}}\) ScoreList \(\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) or.

\({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) \(\to\) Q x ScoreList \(\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right)\right)\)

where \({\varvec{z}}=1,\dots ,{\varvec{r}},\boldsymbol{ }{\varvec{r}}\) is the number of selected attributes, and \({\varvec{l}}=1,\dots ,{\varvec{d}},\) \({\varvec{d}}\) is the number of selected DQD’s.

The quality evaluation generates quality scores \({{\varvec{Q}}}_{{\varvec{x}}}\) Score . A quality scoring model is used to assess these results. It is provided in the form of quality requirements to comprehend the resulted scores, which are expressed as quality acceptance level percentages. These quality requirements might be a set of values, or an interval in which values are accepted or rejected, or a single score ratio percentage. The analysis of these scores against quality requirements leads to the discovery and generation of quality rules for attributes violating the quality requirements.

The quantitative quality evaluation process follows the steps described below for the case of the evaluation of a DQD’s list among several attributes ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) ):

N samples (of size n ) are generated from the dataset DS using a BLB-based bootstrap sampling approach.

For each sample \({{\varvec{s}}}_{{\varvec{i}}}\) generated in step 1, and

For each \({{\varvec{a}}}_{{\varvec{z}}}\) ( \({\varvec{z}}=1,\dots ,{\varvec{r}}\) ) selected attribute in DQES in step 1, evaluate all the DQD’s in the DList using their related metrics to obtain Q x ScoreList \(\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right),{{\varvec{s}}}_{{\varvec{i}}}\right)\) for each sample \({{\varvec{s}}}_{{\varvec{i}}}\) .

For all the samples scores, evaluate the sample mean of all N samples for each attribute \({{\varvec{a}}}_{{\varvec{z}}}\) related to the \({{\varvec{q}}}_{{\varvec{l}}}\) evaluation scores, as \(\stackrel{-}{{\overline{{\varvec{q}}} }_{{\varvec{z}}{\varvec{l}}}}.\)

For the dataset DS , evaluate the quality score mean \({\overline{{\varvec{q}}} }_{{\varvec{l}}}\) for each DQD for all attributes \({{\varvec{a}}}_{{\varvec{z}}}\) , as follows:

The illustration in Fig.  10 shows that the \({{\varvec{q}}}_{{\varvec{z}}{\varvec{l}}}{{\varvec{s}}}_{{\varvec{i}}}{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\) is the evaluation of DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for the sample \({{\varvec{s}}}_{{\varvec{i}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{z}}}\) with a metric m l \(\boldsymbol{ }{\overline{{\varvec{q}}} }_{{\varvec{z}}{\varvec{l}}}\) represents the quality score sample mean for the attributes \({{\varvec{a}}}_{{\varvec{z}}}\) .

figure 10

Big data sampling and quantitative quality evaluation

Quality control

The quality control is initiated when the quality evaluation results are available and reported in the DQES in DQP Level 4 . During quality control, all the quality scores with the quality requirements of the Big Data project are checked. If any detected anomalies or a non-conformance are found, the quality control component forwards a DQP Level 5 to the data quality rules discovery component.

At this point, various cases are highlighted. An iteration process is performed until the required quality levels are satisfied, or the experts decide to stop the quality evaluation process and re-evaluate their requirements. At each phase, there is a kind of quality control, even if it is not explicitly specified, within each quality process.

The quality control acts in the following cases:

Case 1: This case applies when the quality is estimated, and no rules are yet included in the DQP Level 4 (the DQP is considered as a report, since the data quality is still inspected, and only reports are generated with no actions yet to be performed).

In the case of accepted quality scores, no quality actions need to be applied to data. The DQP Level 4 remains unchanged and acts as a full data quality report, which is updated with positive validation of the data per quality requirement. However, it might include some simple pre-processing such as attribute selection and filtering. According to the data analytics requirements and expected results planned in the Big Data project, more specific data pre-processing actions are performed but not related to quality in this case.

In the case when quality scores are not accepted, the DQP Level 4 DQES scores are analyzed, and the DQP is updated with a quality error report about the related DQD scores and their data attributes. DQP Level 5 is created, and it will be analyzed by the quality rules discovery component for the pre-processing activities to be executed on the data.

Case 2: In the presence of a DQP Level 6 that contains a quality evaluation request of the pre-processed samples with discovered quality rules, the following situations may occur:

When the quality control checks that the DQP Level 6 rules are valid and satisfy the quality requirements, the DQP Level 6 is updated to DQP Level 7 and confirmed as the final data quality profile, which will be applied to the data in the pre-processing phase. DQP Level 7 is considered as important if it contains validated quality rules.

When the quality control is not totally or partially satisfied, the DQP Level 6 is sent back for an adaptation of the quality selection and mapping component with valid and invalid quality rules, quality scores, and error reports. These reports highlight with an unacceptable score interval the quality rules that have not satisfied the quality requirements. The quality selection and mapping component provide automatic or manual analysis and assessment of the unsatisfied quality rules concerning their targeted DQD’s, attributes, and quality requirements. An adaptation of quality requirements is needed to re-validate these rules. Finally, the user experts have the final word to continue or break the process and proceed to the pre-processing phase with the valid rules. As part of the framework reuse specification, the invalid rules are kept within the DQP for future re-evaluation.

Case 3: The control component will always proceed based on the quality scores and quality requirements for both input and pre-processed data. Continuous control and monitoring are responsible for initiating DQP updates and adaptation if the quality requirements are relaxed.

Quality rules, discovery, validation, optimization, and execution

In [ 67 ] work, it was reported that if the DQD scores do not conform to the quality requirements, then failed scores are used to discover data quality rules. When executed on data, these rules enhance its quality. They are based on known pre-processing activities such as data cleansing. Each activity has a set of functions targeting different types of data in order to increase its DQD ratio and the whole Data Quality (of the Data source or the Dataset(s)).

When Quality Rules ( QR) are applied to a sample set S , a pre-processed sample set S’ is generated. A quality evaluation process is invoked on S’ , generating DQD scores for S’ . Thus, a score comparison between S and S’ is conducted to filter only qualified and valid rules with a higher percentage of success among data. Then, an optimization scheme is applied to the list of valid quality rules before their application on production data. The predefined optimization schemes vary from (1) rules priority to (2) rules redundancy, (3) rules removal, (4) rules grouping per attribute, or (5) per DQD’s, or (6) per duplicate rules.

Quality rules discovery: The discovery is based on the DQP Level 5 from the quality control component. An analysis of the quality scores is initiated, and an error report is extracted. If the DQD scores do not conform to the quality requirements, then failed scores are used to discover data quality rules. When executed on data, these rules enhance its quality. They are based on known pre-processing activities such as data cleansing. Error! Reference source not found. illustrates the several modules of the discovery component from DQES DQDs scores analysis versus requirements, attributes pre-processing activities combination for each targeted DQD, and the rules generation.

For example, an attribute having a 50% score of missing data is not accepted for a required score of 20% or less. This initiates the generation of a quality rule, which consists of a data cleansing activity for observations that do not satisfy the quality requirements. The data cleansing or data enrichment activity is selected from the Big Data quality profile repository. The quality rule will target all the related attributes marked for pre-processing to reduce the 50% to 20% for the DQD completeness. Moreover, in the case of completeness, not only cleansing can be applied to missing values, but many alternatives are available for pre-processing activities. These activities are related to completeness such as missing values replacement activity with many functions for several replacements’ methods like the mean, mode, and the median.

The pre-processing activities are provided by the repository to achieve the required data quality. Many possibilities for pre-processing activities selection are available:

Automatic , by discovering and suggesting a set of activities or DQ rules.

Predefined , by selecting ready-to-use quality rules proposals from the exploratory quality profiling component, predefined pre-processing activity functions from the repository, indexed by DQDs.

Manual, giving the expert the ability to query the exploratory quality profiling results for the best rules, achieving the required quality using KNN-based filtering.

Quality rules validation: The generated quality rules from the discovery components are set in the DQP Level 6. As illustrated in Error! Reference source not found., the rules validation component process starts when the DQR list is applied to the sample set S , resulting in a pre-processed sample set S’ , which is generated by the related pre-processing activities. Then, a quality evaluation process is invoked on S’ , generating DQD scores for S’ . Thus, a score comparison between S and S’ is conducted to filter only qualified and valid rules with a higher percentage of success among data. After analyzing this score, two sets of rules are identified: successful and failed rules.

Quality rules optimization: After the set of discovered valid quality rules is selected, an optimization process is activated to reorganize and filter the rules. This is due to the nature of the evaluation parameters set in the mapping component and the refinement of the quality requirement. These choices with the rule’s validation process will produce a list of individual quality rules that, if applied as generated, might have the following consequences:

Redundant rules.

Ineffective rules due to the order of execution.

Multiple rules, which target the same DQD with the same requirements.

Multiple rules, which target the same attributes for the same DQD and requirements.

Rules, which drop attributes or rows, must be applied first or have a higher priority to avoid applying rules on data items that are meant to be dropped (Table 8 ).

The quality rules optimization component applies an optimization scheme to the list of valid quality rules before their application to production data in the pre-processing phase. The predefined optimization schemes vary according to the following, as illustrated in Error! Reference source not found.:

Rules execution priority per attribute or DQD, per pre-processing activity, or pre-processing function.

Rules redundancy removal per attributes or DQDs.

Rules grouping, combination, per activity, per attribute, per DQD’s, or duplicates.

For invalid rules, the component consists of several actions, including rules removal or rules adaptation from previously generated proposals in the exploratory quality profiling component for the same targeted tuple (attributes, DQDs).

Quality rules optimization: The Quality Rules execution consists of pre-processing data using the DQP, which embeds the data quality rules that enhance the quality to reach the agreed requirements. As part of the monitoring module, a sampling set from the pre-processed data is used to re-assess the quality and detect eventual failures.

Quality monitoring

Quality Monitoring is a continuous quality control process, which relies on the DQP. The purpose of monitoring is to validate the DQP across all the Big Data lifecycle processes. The QP repository is updated during and after the complete lifecycle as well as after the user’s feedback data, quality requirements, and mapping.

As illustrated in Fig.  11 , the monitoring process takes a scheduled snapshot of the pre-processed Big Data all along the BDQMF for the BDQ project. This data snapshot is a set of samples that have their quality evaluated in the BDQMF component (4). Then, quality control is conducted on the quality scores, and an update is performed to the DQP. The quality report may highlight the quality failure and its ratio evolution through multiple sampling snapshots of data.

figure 11

Quality monitoring component

The monitoring process strengthens and enforces the quality across the Big Data value chain using the BDQM framework while reusing the data quality profile information. For each quality monitoring iteration on the datasets from the data source, quality reports are added to the data quality profile, updating it to a DQP Level 10 .

Data processing, analytics, and visualization

This process involves the application of algorithms or methodologies, which extract insights from the ready-to-use data, with enhanced quality. Then, the value of processed data is projected visually as a dashboard and graphically enhanced charts for the decision-makers to act economically. Big Data visualization approaches are of high importance for the final exploitation of the data.

Implementations: Dataflow and quality processes development

In this section, we overview the dataflow across the various processes of the framework, we also highlight the implemented quality management processes along with the supporting application interfaces developed to support main processes. Finally, we describe the ongoing processes’ implementations and evaluations.

Framework dataflow

In Fig.  12 , we illustrate the whole process flow of the framework, from the inception of the quality project in its specification and requirements to the quality monitoring phase. As an ongoing process, monitoring is a part of the quality enforcement loop and may trigger other processes that handle several quality profile operations like DQP adaptation, upgrade, or reuse.

figure 12

Big data quality management framework data flow

In Table 9 , we enumerate and detail the multiple processes and their interactions within the BDQM Framework components including their inputs and outputs after executing related activities with the quality profile (DQP), as detailed in the previous section.

Quality management processes’ implementation

In this section, we describe the implementation of our framework's important components, processes, and their contributions towards the quality management of Big Data across its lifecycle.

Core processes implementation

As depicted above, core framework processes have been implemented and evaluated, in the following, we describe how these components have been implemented and evaluated.

Quality profiling : one of the central components of our framework is the data quality profile (DQP). Initially, the DQP implements a simple data profile of a Big Data set as an XML file (DQP Sample illustrated in Fig.  13 ).

figure 13

Example of data quality profile

After traversing several framework component’s processes, it is updated to a data quality profile. The data quality evaluation process is one of the activities that updates the DQP with quality scores that are later used to discover data quality rules. These rules, when applied to the original data, will ensure an output data set with higher quality. The DQP is finally executed by the pre-processing component. Through the end of the lifecycle, the DQP contains all pieces of information such as data quality rules that target a set of data sources with multiple datasets, data attributes and data quality dimensions such as accuracy, and pre-processing activities like data cleansing, data integration, and data normalization. The Data Quality Profile (DQP) contains all the information about the Data, its Quality, the User Quality Requirements, DQD’s, Quality Levels, Attributes, the Data Quality Evaluation Scheme (DQES), Quality Scores, and the Data Quality Rules. The DQP is stored in the DQP repository, which contains the following modules, and performs many tasks related to DQP. In the following, the DQP lifecycle and its repository are described.

Quality requirement dashboard : developed as a web-based application as shown in Fig.  14 below to capture user’s requirements and other quality information. Such requirements include for instance data quality dimension requirements specification. This application can be extended with extra information about data sources such as attributes and their types. The user is guided through the interface to specify the right attributes’ values and also given the option to upload an XML file containing the relationship between attributes. The recorded requirements are finally saved to a data quality profile level 0 which will be used in the next stage of the quality management process.

figure 14

Quality requirements dashboard

Data preparation and sampling : The framework operations start when the quality project's minimal specifications are set. It initiates and provides a data quality summary named data quality profile (DQP) by running an exploratory quality profiling assessment on data samples (using BLB sampling algorithm). This DQP is projected to be the core component of the framework and every update and every result regarding the quality is noted/recorded. The DQP is stored in a quality repository and registered in the Big Data’s provenance to keep track of data changes due to quality enhancements.

Data quality mapping and rule discovery components : data quality mapping alleviates and adds more data quality control to the whole data quality assessment process. The implemented mapping links and categorizes all the quality project required elements, from Big Data quality characteristics, pre-processing activities, and their related techniques functions, to data quality rules, dimensions, and their metrics. The Data Quality Rules’ discovery from evaluation results implementation reveals the required actions and transformations that when applied on the data set will accomplish the targeted quality level. These rules are the main ingredients of pre-processing activities. The role of a DQ rule is to undertake the sources of bad quality by defining a list of actions related to each quality score. The DQ rules are the results of systematic and planned data quality assessment analysis.

Quality profile repository (QPREPO) : Finally, our framework implements the QPREPO to manage the data quality profiles for different data types and domains and to adapt or optimize existing profiles. This repository manages the data quality dimensions with their related metrics, and the pre-processing activities, and their activity functions. A QPREPO entry is implemented for each Big Data quality project with the related DQP containing information’s about each dataset, data source, data domain, and data user. This information is essential for DQP reuse, adaptation, and enhancement for the same or different data sources.

Implemented approaches for quality assessment.

The framework uses various approaches for quality assessment: (1) Exploratory Quality Profiling; (2) a Quantitative Quality Assessment approach using DQD metrics; and it's anticipated to add a new component for (3) a Qualitative quality assessment.

Exploratory Quality Profiling implements an automatic quality evaluation that is done systematically on all data attributes for basic DQDs. The resulted in calculated scores are used to generate quality rules for each quality tolerance ratio variation. These rules are then applied to other data samples and the quality is reassessed. An analysis of the results provides an interactive quality-based rules search using several ranking algorithms (maximization, minimization, applying weight).

The Quantitative Quality Assessment implements a quick data quality evaluation strategy supported through sampling and profiling processes for Big Data. The evaluation is conducted by measuring the data quality dimensions (DQDs) for attributes using specific metrics to calculate a quality score.

The Qualitative Quality Assessment approach implements a deep quality assessment to discover hidden quality aspects and their impact on the Big Data Lifecycle outputs. These quality aspects must be quantified into scores and mapped with related attributes and DQD’s. This quantification is achieved by applying several feature selection strategies and algorithms to data samples. These qualitative insights are combined with those obtained before the quantitative quality evaluation early in the Quality management process.

Framework development, deployment, and evaluation

Development, deployment, and evaluation of our BDQMF framework follow a systematic modular approach where various components of the framework are developed and tested independently then integrated with the other components to compose the integrated solution. Most of the components are implemented in R and |Python using SparkR and PySpark libraries respectively. The supporting files like the DQP, DQES, and configuration files are written in XML and JSON formats. Big Data quality project requests and constraints including the data sources and the quality expectation are implemented within the solution where more than one module might be involved. The BDQMF components are deployed following Apache Hadoop and Spark ecosystem architecture.

The BDQMF deployed modules implementation description and developed APIs are listed in the following:

Quality setting mapper (QSP): it implements an interface for automatic selection and mapping of DQD’s and dataset attributes from the initial DQP.

Quality settings parser (QSP): responsible for parsing and loading parameters to the execution environment from DQP settings to data files. It is also used to extract quality rules and scores from the DQES in the DQP.

Data loader (DL): implements filtering, selecting, and loading all types of data files required by the BDQMF including datasets from data sources into the Spark environment (e.g. DataFrames, tables), it will be used by various processes or it will persist in the database for further reuse. For data selection the uses SQL to retrieve only attributes being set in the DQP settings.

Data samples generator (DSG): it generates data samples from multiple data sources.

Quality inspector and profiler (QIP): it is responsible for all qualitative and quantitative quality evaluations among data samples for all the BDQMF lifecycle phases. The inspector assesses all the default and required DQD’s, and all quality evaluations are set into the DQES within the DQP file.

Preprocessing activities and functions execution engine (PPAF-E ): all the repository preprocessing activities along with their related functions are implemented as APIs in python and R. When requested this library will load the necessary methods and execute them within the preprocessing activities for rules validation and rules execution in phase 9.

Quality rules manager (QRM): it is one of the important modules of the framework. It implements and deliver the following features:

Analyzes Quality results

Discovers and generates Quality rules proposals.

Quality rules validation among requirements settings.

Quality rules refinement and optimizations

Quality rules ACID operations in the DQP files and the repository.

Quality monitor (QM) : it is responsible for monitoring, triggering, and reporting any quality change all over the Big Data lifecycle to assure the efficiency of quality improvement of the discovered data quality rules.

BDQMF-Repo: is the repository where all the quality-related files, settings, requirements, results are stored. The repo is using HBase or Mongo DB to fulfill requirements of the Big Data ecosystem environments and scalability for intensive data updates.

Big data quality has attracted the attention of researchers regarding Big Data as it is considered the key differentiator, which leads to high-quality insights and data-driven decisions. In this paper, a Big Data Quality Management Framework for addressing end-to-end Quality in the Big Data lifecycle was proposed. The framework is based on a Data Quality Profile, which is augmented with valuable information while traveling across different stages of the framework, starting from Big Data project parameters, quality requirements, quality profiling, and quality rules proposals. The exploratory quality profiling feature, which extracts quality information from the data, helped in building a robust DQP with a quality rules proposal and a step over for the configuration of the data quality evaluation scheme. Moreover, the extracted quality rules proposals are of high benefit for the quality dimensions mapping and attribute selection component. This fact supports the users with quality data indicators characterized by their profile.

The framework dataflow shows that any Big Data set quality is evaluated through the exploratory quality profiling component and the quality rules extraction and validation towards an improvement in its quality. It is of great importance to ensure the right selection of a combination of targeted DQD levels, observations (rows), and attributes (columns) for efficient quality results, while not sacrificing vital data because of considering only one DQD. The resulted quality profile based on the quality assessment results confirms that the contained quality information significantly improves the quality of Big Data.

In future work, we plan to extend the quantitative quality profiling with qualitative evaluation. We also plan to extend the framework to cope with unstructured Big Data quality assessment.

Availability of data and materials

Data used in this work is available with the first author and can be provided up on request. The data includes sampling data, pre-processed data, etc.

Chen M, Mao S, Liu Y. Big data: A survey. Mobile Netw Appl. 2014;19:171–209. https://doi.org/10.1007/s11036-013-0489-0 .

Article   Google Scholar  

Chiang F, Miller RJ. Discovering data quality rules. Proceed VLDB Endowment. 2008;1:1166–77.

Yeh, P.Z., Puri, C.A., 2010. An Efficient and Robust Approach for Discovering Data Quality Rules, in: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI). Presented at the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 248–255. https://doi.org/10.1109/ICTAI.2010.43

Ciancarini, P., Poggi, F., Russo, D., 2016. Big Data Quality: A Roadmap for Open Data, in: 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). Presented at the 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pp. 210–215. https://doi.org/10.1109/BigDataService.2016.37

Firmani D, Mecella M, Scannapieco M, Batini C. On the meaningfulness of “big data quality” (Invited Paper). Data Sci Eng. 2016;1:6–20. https://doi.org/10.1007/s41019-015-0004-7 .

Rivas, B., Merino, J., Serrano, M., Caballero, I., Piattini, M., 2015. I8K|DQ-BigData: I8K Architecture Extension for Data Quality in Big Data, in: Advances in Conceptual Modeling, Lecture Notes in Computer Science. Presented at the International Conference on Conceptual Modeling, Springer, Cham, pp. 164–172. https://doi.org/10.1007/978-3-319-25747-1_17

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H., 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute 1–137.

Chen CP, Zhang C-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci. 2014;275:314–47.

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S. The rise of “big data” on cloud computing: Review and open research issues. Inf Syst. 2015;47:98–115. https://doi.org/10.1016/j.is.2014.07.006 .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87. https://doi.org/10.1109/ACCESS.2014.2332453 .

Wielki J. The Opportunities and Challenges Connected with Implementation of the Big Data Concept. In: Mach-Król M, Olszak CM, Pełech-Pilichowski T, editors. Advances in ICT for Business. Springer International Publishing: Industry and Public Sector, Studies in Computational Intelligence; 2015. p. 171–89.

Google Scholar  

Ali-ud-din Khan, M., Uddin, M.F., Gupta, N., 2014. Seven V’s of Big Data understanding Big Data to extract value, in: American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of The. Presented at the American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the, pp. 1–5. https://doi.org/10.1109/ASEEZone1.2014.6820689

Kepner, J., Gadepally, V., Michaleas, P., Schear, N., Varia, M., Yerukhimovich, A., Cunningham, R.K., 2014. Computing on masked data: a high performance method for improving big data veracity, in: 2014 IEEE High Performance Extreme Computing Conference (HPEC). Presented at the 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. https://doi.org/10.1109/HPEC.2014.7040946

Saha, B., Srivastava, D., 2014. Data quality: The other face of Big Data, in: 2014 IEEE 30th International Conference on Data Engineering (ICDE). Presented at the 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. https://doi.org/10.1109/ICDE.2014.6816764

Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage. 2015;35:137–44.

Pääkkönen P, Pakkala D. Reference architecture and classification of technologies, products and services for big data systems. Big Data Research. 2015;2:166–86. https://doi.org/10.1016/j.bdr.2015.01.001 .

Oliveira, P., Rodrigues, F., Henriques, P.R., 2005. A Formal Definition of Data Quality Problems., in: IQ.

Maier, M., Serebrenik, A., Vanderfeesten, I.T.P., 2013. Towards a Big Data Reference Architecture. University of Eindhoven.

Caballero, I., Piattini, M., 2003. CALDEA: a data quality model based on maturity levels, in: Third International Conference on Quality Software, 2003. Proceedings. Presented at the Third International Conference on Quality Software, 2003. Proceedings, pp. 380–387. https://doi.org/10.1109/QSIC.2003.1319125

Sidi, F., Shariat Panahy, P.H., Affendey, L.S., Jabar, M.A., Ibrahim, H., Mustapha, A., 2012. Data quality: A survey of data quality dimensions, in: 2012 International Conference on Information Retrieval Knowledge Management (CAMP). Presented at the 2012 International Conference on Information Retrieval Knowledge Management (CAMP), pp. 300–304. https://doi.org/10.1109/InfRKM.2012.6204995

Chen, M., Song, M., Han, J., Haihong, E., 2012. Survey on data quality, in: 2012 World Congress on Information and Communication Technologies (WICT). Presented at the 2012 World Congress on Information and Communication Technologies (WICT), pp. 1009–1013. https://doi.org/10.1109/WICT.2012.6409222

Batini C, Cappiello C, Francalanci C, Maurino A. Methodologies for data quality assessment and improvement. ACM Comput Surv. 2009;41:1–52. https://doi.org/10.1145/1541880.1541883 .

Glowalla, P., Balazy, P., Basten, D., Sunyaev, A., 2014. Process-Driven Data Quality Management–An Application of the Combined Conceptual Life Cycle Model, in: 2014 47th Hawaii International Conference on System Sciences (HICSS). Presented at the 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 4700–4709. https://doi.org/10.1109/HICSS.2014.575

Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Commun ACM. 1996;39:86–95. https://doi.org/10.1145/240455.240479 .

Wang, R.Y., Strong, D.M., 1996. Beyond accuracy: What data quality means to data consumers. Journal of management information systems 5–33.

Cappiello, C., Caro, A., Rodriguez, A., Caballero, I., 2013. An Approach To Design Business Processes Addressing Data Quality Issues.

Hazen BT, Boone CA, Ezell JD, Jones-Farmer LA. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int J Prod Econ. 2014;154:72–80. https://doi.org/10.1016/j.ijpe.2014.04.018 .

Caballero, I., Verbo, E., Calero, C., Piattini, M., 2007. A Data Quality Measurement Information Model Based On ISO/IEC 15939., in: ICIQ. pp. 393–408.

Juddoo, S., 2015. Overview of data quality challenges in the context of Big Data, in: 2015 International Conference on Computing, Communication and Security (ICCCS). Presented at the 2015 International Conference on Computing, Communication and Security (ICCCS), pp. 1–9. https://doi.org/10.1109/CCCS.2015.7374131

Woodall P, Borek A, Parlikad AK. Data quality assessment: The hybrid approach. Inf Manage. 2013;50:369–82. https://doi.org/10.1016/j.im.2013.05.009 .

Goasdoué, V., Nugier, S., Duquennoy, D., Laboisse, B., 2007. An Evaluation Framework For Data Quality Tools., in: ICIQ. pp. 280–294.

Caballero, I., Serrano, M., Piattini, M., 2014. A Data Quality in Use Model for Big Data, in: Indulska, M., Purao, S. (Eds.), Advances in Conceptual Modeling, Lecture Notes in Computer Science. Springer International Publishing, pp. 65–74. https://doi.org/10.1007/978-3-319-12256-4_7

Cai L, Zhu Y. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015. https://doi.org/10.5334/dsj-2015-002 .

Philip Woodall, A.B., 2014. An Investigation of How Data Quality is Affected by Dataset Size in the Context of Big Data Analytics.

Laranjeiro, N., Soydemir, S.N., Bernardino, J., 2015. A Survey on Data Quality: Classifying Poor Data, in: 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC). Presented at the 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 179–188. https://doi.org/10.1109/PRDC.2015.41

Liu, J., Li, J., Li, W., Wu, J., 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS Journal of Photogrammetry and Remote Sensing, Theme issue “State-of-the-art in photogrammetry, remote sensing and spatial information science” 115, 134–142. https://doi.org/10.1016/j.isprsjprs.2015.11.006

Rao, D., Gudivada, V.N., Raghavan, V.V., 2015. Data quality issues in big data, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), pp. 2654–2660. https://doi.org/10.1109/BigData.2015.7364065

Zhou, H., Lou, J.G., Zhang, H., Lin, H., Lin, H., Qin, T., 2015. An Empirical Study on Quality Issues of Production Big Data Platform, in: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE). Presented at the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), pp. 17–26. https://doi.org/10.1109/ICSE.2015.130

Becker, D., King, T.D., McMullen, B., 2015. Big data, big data quality problem, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), IEEE, Santa Clara, CA, USA, pp. 2644–2653. https://doi.org/10.1109/BigData.2015.7364064

Maślankowski, J., 2014. Data Quality Issues Concerning Statistical Data Gathering Supported by Big Data Technology, in: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (Eds.), Beyond Databases, Architectures, and Structures, Communications in Computer and Information Science. Springer International Publishing, pp. 92–101. https://doi.org/10.1007/978-3-319-06932-6_10

Fürber, C., Hepp, M., 2011. Towards a Vocabulary for Data Quality Management in Semantic Web Architectures, in: Proceedings of the 1st International Workshop on Linked Web Data Management, LWDM ’11. ACM, New York, NY, USA, pp. 1–8. https://doi.org/10.1145/1966901.1966903

Corrales DC, Corrales JC, Ledezma A. How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry. 2018;10:99.

Fan, W., 2008. Dependencies revisited for improving data quality, in: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170.

Kläs, M., Putz, W., Lutz, T., 2016. Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results, in: 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA). Presented at the 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA), pp. 115–124. https://doi.org/10.1109/IWSM-Mensura.2016.026

Rahm E, Do HH. Data cleaning: Problems and current approaches. IEEE Data Eng Bull. 2000;23:3–13.

Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N., 2013. NADEEF: A Commodity Data Cleaning System, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13. ACM, New York, NY, USA, pp. 541–552. https://doi.org/10.1145/2463676.2465327

Ebaid A, Elmagarmid A, Ilyas IF, Ouzzani M, Quiane-Ruiz J-A, Tang N, Yin S. NADEEF: A generalized data cleaning system. Proceed VLDB Endowment. 2013;6:1218–21.

Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S., 2014. NADEEF/ER: generic and interactive entity resolution. ACM Press, pp. 1071–1074. https://doi.org/10.1145/2588555.2594511

Tang N. Big Data Cleaning. In: Chen L, Jia Y, Sellis T, Liu G, editors. Web Technologies and Applications. Lecture Notes in Computer Science: Springer International Publishing; 2014. p. 13–24.

Chapter   Google Scholar  

Ge M, Dohnal V. Quality management in big data informatics. 2018;5:19. https://doi.org/10.3390/informatics5020019 .

Jimenez-Marquez JL, Gonzalez-Carrasco I, Lopez-Cuadrado JL, Ruiz-Mezcua B. Towards a big data framework for analyzing social media content. Int J Inf Manage. 2019;44:1–12. https://doi.org/10.1016/j.ijinfomgt.2018.09.003 .

Siddiqa A, Hashem IAT, Yaqoob I, Marjani M, Shamshirband S, Gani A, Nasaruddin F. A survey of big data management: Taxonomy and state-of-the-art. J Netw Comput Appl. 2016;71:151–66. https://doi.org/10.1016/j.jnca.2016.04.008 .

Taleb, I., Dssouli, R., Serhani, M.A., 2015. Big Data Pre-processing: A Quality Framework, in: 2015 IEEE International Congress on Big Data (BigData Congress). Presented at the 2015 IEEE International Congress on Big Data (BigData Congress), pp. 191–198. https://doi.org/10.1109/BigDataCongress.2015.35

Cormode, G., Duffield, N., 2014. Sampling for Big Data: A Tutorial, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. ACM, New York, NY, USA, pp. 1975–1975. https://doi.org/10.1145/2623330.2630811

Gadepally, V., Herr, T., Johnson, L., Milechin, L., Milosavljevic, M., Miller, B.A., 2015. Sampling operations on big data, in: 2015 49th Asilomar Conference on Signals, Systems and Computers. Presented at the 2015 49th Asilomar Conference on Signals, Systems and Computers, pp. 1515–1519. https://doi.org/10.1109/ACSSC.2015.7421398

Liang F, Kim J, Song Q. A bootstrap metropolis-hastings algorithm for bayesian analysis of big data. Technometrics. 2016. https://doi.org/10.1080/00401706.2016.1142905 .

Article   MathSciNet   Google Scholar  

Satyanarayana, A., 2014. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality, in: 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE). Presented at the 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), IEEE, Toronto, ON, Canada, pp. 1–6. https://doi.org/10.1109/CCECE.2014.6901029

Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M., 2012. The big data bootstrap. arXiv preprint

Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J., 2016. Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking, in: Latifi, S. (Ed.), Information Technolog: New Generations. Springer International Publishing, Cham, pp. 439–450. https://doi.org/10.1007/978-3-319-32467-8_39

Loshin, D., 2010. Rapid Data Quality Assessment Using Data Profiling 15.

Naumann F. Data profiling revisited. ACM. SIGMOD Record. 2014;42:40–9.

Buneman, P., Davidson, S.B., 2010. Data provenance–the foundation of data quality.

Glavic, B., 2014. Big Data Provenance: Challenges and Implications for Benchmarking, in: Specifying Big Data Benchmarks. Springer, pp. 72–80.

Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I., 2015. Big data provenance: Challenges, state of the art and opportunities, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), pp. 2509–2516. https://doi.org/10.1109/BigData.2015.7364047

Hwang W-J, Wen K-W. Fast kNN classification algorithm based on partial distance search. Electron Lett. 1998;34:2062–3.

Taleb, I., Kassabi, H.T.E., Serhani, M.A., Dssouli, R., Bouhaddioui, C., 2016. Big Data Quality: A Quality Dimensions Evaluation, in: 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). Presented at the 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 759–765. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0122

Taleb, I., Serhani, M.A., 2017. Big Data Pre-Processing: Closing the Data Quality Enforcement Loop, in: 2017 IEEE International Congress on Big Data (BigData Congress). Presented at the 2017 IEEE International Congress on Big Data (BigData Congress), pp. 498–501. https://doi.org/10.1109/BigDataCongress.2017.73

Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S., n.d. Efficient kNN classification algorithm for big data. Neurocomputing. https://doi.org/10.1016/j.neucom.2015.08.112

Firmani, D., Mecella, M., Scannapieco, M., Batini, C., 2015. On the Meaningfulness of “Big Data Quality” (Invited Paper), in: Data Science and Engineering. Springer Berlin Heidelberg, pp. 1–15. https://doi.org/10.1007/s41019-015-0004-7

Lee YW. Crafting rules: context-reflective data quality problem solving. J Manag Inf Syst. 2003;20:93–119.

Download references


Not applicable.

This work is supported by fund #12R005 from ZCHS at UAE University.

Author information

Authors and affiliations.

College of Technological Innovation, Zayed University, P.O. Box 144534, Abu Dhabi, United Arab Emirates

Ikbal Taleb

College of Information Technology, UAE University, P.O. Box 15551, Al Ain, United Arab Emirates

Mohamed Adel Serhani

Department of Statistics, College of Business and Economics, UAE University, P.O. Box 15551, Al Ain, United Arab Emirates

Chafik Bouhaddioui

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, H4B 1R6, Canada

Rachida Dssouli

You can also search for this author in PubMed   Google Scholar


IT conceived the main conceptual ideas related to Big data quality framework and proof outline. He designed the framework and their main modules, he also worked on the implementation and validation of some of the framework’s components. MAS supervised the study and was in charge of direction and planning, he also contributed to couple of sections including the introduction, abstract, the framework and the implementation and conclusion section. CB contributed to data preparation sampling and profiling, he also reviewed and validated all formulations and statistical modeling included in this work. RD contributed in the review and discussion of the core contributions and their validation. All authors read and approved the final manuscript.

Authors’ information

Dr. Ikbal Taleb is currently an Assistant Professor, College of Technological Information, Zayed University, Abu Dhabi, U.A.E. He got his Ph.D. in information and systems engineering from Concordia University in 2019, and MSc. in Software Engineering from the University of Montreal, Canada in 2006. His research interests include data and Big data quality, quality profiling, quality assessment, cloud computing, web services, and mobile web services.

Prof. M. Adel Serhani is currently a Professor, and Assistant Dean for Research and Graduate Studies College of Information Technology, U.A.E University, Al Ain, U.A.E. He is also an Adjunct faculty in CIISE, Concordia University, Canada. He holds a Ph.D. in Computer Engineering from Concordia University in 2006, and MSc. in Software Engineering from University of Montreal, Canada in 2002. His research interests include: Cloud for data intensive e-health applications, and services; SLA enforcement in Cloud Data centers, and Big data value chain, Cloud federation and monitoring, Non-invasive Smart health monitoring; management of communities of Web services; and Web services applications and security. He has a large experience earned throughout his involvement and management of different R&D projects. He served on several organizing and Technical Program Committees and he was the program Co-Chair of International Conference in Web Services (ICWS’2020), Co-chair of the IEEE conference on Innovations in Information Technology (IIT´13), Chair of IEEE Workshop on Web service (IWCMC´13), Chair of IEEE workshop on Web, Mobile, and Cloud Services (IWCMC´12), and Co-chair of International Workshop on Wireless Sensor Networks and their Applications (NDT´12). He has published around 130 refereed publications including conferences, journals, a book, and book chapters.

Dr. Chafik Bouhaddioui is an Associate Professor of Statistics in the College of Business and Economics at UAE University. He got his Ph.D. from University of Montreal in Canada. He worked as lecturer at Concordia University for 4 years. He has a rich experience in applied statistics in finance in private and public sectors. He worked as assistant researcher in Finance Ministry in Canada. He worked as Senior Analyst in National Bank of Canada and developed statistical methods used in stock market forecasting. He joined in 2004 a team of researchers in finance group at CIRANO in Canada to develop statistical tools and modules in finance and risk analysis. He published several papers in well-known journals in multivariate time series analysis and their applications in economics and finance. His area of research is diversified and includes modeling and prediction in multivariate time series, causality and independence tests, biostatistics, and Big Data.

Prof. Rachida Dssouli is a full professor and Director of Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science, Concordia University. Dr. Dssouli received a Master (1978), Diplome d'études Approfondies (1979), Doctorat de 3eme Cycle in Networking (1981) from Université Paul Sabatier, Toulouse, France. She earned her PhD degree in Computer Science (1987) from Université de Montréal, Canada. Her research interests are in Communication Software Engineering a sub discipline of Software Engineering. Her contributions are in Testing based on Formal Methods, Requirements Engineering, Systems Engineering, Telecommunication Service Engineering and Quality of Service. She published more than 200 papers in journals and referred conferences in her area of research. She supervised/ co-supervised more than 50 graduate students among them 20 PhD students. Dr. Dssouli is the founding Director of Concordia Institute for Information and Systems Engineering (CIISE) June 2002. The Institute hosts now more than 550 graduate students and 20 faculty members, 4 master programs, and a PhD program.

Corresponding author

Correspondence to Mohamed Adel Serhani .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Taleb, I., Serhani, M.A., Bouhaddioui, C. et al. Big data quality framework: a holistic approach to continuous quality management. J Big Data 8 , 76 (2021). https://doi.org/10.1186/s40537-021-00468-0

Download citation

Received : 06 February 2021

Accepted : 15 May 2021

Published : 29 May 2021

DOI : https://doi.org/10.1186/s40537-021-00468-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data quality
  • Quality assessment
  • Quality metrics and scores
  • Pre-processing

research projects on big data

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Open access
  • Published: 05 March 2020

Big data in digital healthcare: lessons learnt and recommendations for general practice

  • Raag Agrawal 1 , 2 &
  • Sudhakaran Prabakaran   ORCID: orcid.org/0000-0002-6527-1085 1 , 3 , 4  

Heredity volume  124 ,  pages 525–534 ( 2020 ) Cite this article

42k Accesses

97 Citations

84 Altmetric

Metrics details

  • Developing world

Big Data will be an integral part of the next generation of technological developments—allowing us to gain new insights from the vast quantities of data being produced by modern life. There is significant potential for the application of Big Data to healthcare, but there are still some impediments to overcome, such as fragmentation, high costs, and questions around data ownership. Envisioning a future role for Big Data within the digital healthcare context means balancing the benefits of improving patient outcomes with the potential pitfalls of increasing physician burnout due to poor implementation leading to added complexity. Oncology, the field where Big Data collection and utilization got a heard start with programs like TCGA and the Cancer Moon Shot, provides an instructive example as we see different perspectives provided by the United States (US), the United Kingdom (UK) and other nations in the implementation of Big Data in patient care with regards to their centralization and regulatory approach to data. By drawing upon global approaches, we propose recommendations for guidelines and regulations of data use in healthcare centering on the creation of a unique global patient ID that can integrate data from a variety of healthcare providers. In addition, we expand upon the topic by discussing potential pitfalls to Big Data such as the lack of diversity in Big Data research, and the security and transparency risks posed by machine learning algorithms.

Similar content being viewed by others

research projects on big data

Harnessing big data for health equity through a comprehensive public database and data collection framework

Cameron Sabet, Alessandro Hammond, … Fatima Cody Stanford

research projects on big data

Putting the data before the algorithm in big data addressing personalized healthcare

Eli M. Cahan, Tina Hernandez-Boussard, … Daniel L. Rubin

research projects on big data

Axes of a revolution: challenges and promises of big data in healthcare

Smadar Shilo, Hagai Rossman & Eran Segal


The advent of Next Generation Sequencing promises to revolutionize medicine as it has become possible to cheaply and reliably sequence entire genomes, transcriptomes, proteomes, metabolomes, etc. (Shendure and Ji 2008 ; Topol 2019a ). “Genomical” data alone is predicted to be in the range of 2–40 Exabytes by 2025—eclipsing the amount of data acquired by all other technological platforms (Stephens et al. 2015 ). In 2018, the price for the research-grade sequencing of the human genome had dropped to under $1000 (Wetterstrand 2019 ). Other “omics” techniques such as Proteomics have also become accessible and cheap, and have added depth to our knowledge of biology (Hasin et al. 2017 ; Madhavan et al. 2018 ). Consumer device development has also led to significant advances in clinical data collection, as it becomes possible to continuously collect patient vitals and analyze them in real-time. In addition to the reductions in cost of sequencing strategies, computational power, and storage have become extremely cheap. All these developments have brought enormous advances in disease diagnosis and treatments, they have also introduced new challenges as large-scale information becomes increasingly difficult to store, analyze, and interpret (Adibuzzaman et al. 2018 ). This problem has given way to a new era of “Big Data” in which scientists across a variety of fields are exploring new ways to understand the large amounts of unstructured and unlinked data generated by modern technologies, and leveraging it to discover new knowledge (Krumholz 2014 ; Fessele 2018 ). Successful scientific applications of Big Data have already been demonstrated in Biology, as initiatives such as the Genotype-Expression Project are producing enormous quantities of data to better understand genetic regulation (Aguet et al. 2017 ). Yet, despite these advances, we see few examples of Big Data being leveraged in healthcare despite the opportunities it presents for creating personalized and effective treatments.

Effective use of Big Data in Healthcare is enabled by the development and deployment of machine learning (ML) approaches. ML approaches are often interchangeably used with artificial intelligence (AI) approaches. ML and AI only now make it possible to unravel the patterns, associations, correlations and causations in complex, unstructured, nonnormalized, and unscaled datasets that the Big Data era brings (Camacho et al. 2018 ). This allows it to provide actionable analysis on datasets as varied as sequences of images (applicable in Radiology) or narratives (patient records) using Natural Language Processing (Deng et al. 2018 ; Esteva et al. 2019 ) and bringing all these datasets together to generate prediction models, such as response of a patient to a treatment regimen. Application of ML tools is also supplemented by the now widespread adoption of Electronic Health Records (EHRs) after the passage of the Affordable Care Act (2010) and Health Information Technology for Economic and Clinical Health Act (2009) in the US, and recent limited adoption in the National Health Service (NHS) (Garber et al. 2014 ). EHRs allow patient data to become more accessible to both patients and a variety of physicians, but also researchers by allowing for remote electronic access and easy data manipulation. Oncology care specifically is instructive as to how Big Data can make a direct impact on patient care. Integrating EHRs and diagnostic tests such as MRIs, genomic sequencing, and other technologies is the big opportunity for Big Data as it will allow physicians to better understand the genetic causes behind cancers, and therefore design more effective treatment regimens while also improving prevention and screening measures (Raghupathi and Raghupathi 2014 ; Norgeot et al. 2019 ). Here, we survey the current challenges in Big Data in healthcare and use oncology as an instructive vignette, highlighting issues of data ownership, sharing, and privacy. Our review builds on findings from the US, UK, and other global healthcare systems to propose a fundamental reorganization of EHRs around unique patient identifiers and ML.

Current successes of Big Data in healthcare

The UK and the US are both global leaders in healthcare that will play important roles in the adoption of Big Data. We see this global leadership already in oncology (The Cancer Genome Atlas (TCGA), Pan-Cancer Analysis of Whole Genomes (PCAWG)) and neuropsychiatric diseases (PsychENCODE) (Tomczak et al. 2015 ; Akbarian et al. 2015 ; Campbell et al. 2020 ). These Big Data generation and open-access models have resulted in hundreds of applications and scientific publications. The success of these initiatives in convincing the scientific and healthcare communities of the advantages of sharing clinical and molecular data have led to major Big Data generation initiatives in a variety of fields across the world such as the “All of Us” project in the US (Denny et al. 2019 ). The UK has now established a clear national strategy that has resulted in the likes of the UK Biobank and 100,000 Genomes projects (Topol 2019b ). These projects dovetail with a national strategy for the implementation of genomic medicine with the opening of multiple genome-sequencing sites, and the introduction of genome sequencing as a standard part of care for the NHS (Marx 2015 ). The US has no such national strategy, and while it has started its own large genomic study—“All of Us”—it does not have any plans for implementation in its own healthcare system (Topol 2019b ). In this review, we have focussed our discussion on developments in Big Data in Oncology as a method to understand this complex and fast moving field, and to develop general guidelines for healthcare at large.

Big Data initiatives in the United Kingdom

The UK Biobank is a prospective cohort initiative that is composed of individuals between the ages of 40 and 69 before disease onset (Allen et al. 2012 ; Elliott et al. 2018 ). The project has collected rich data on 500,000 individuals, collating together biological samples, physical measures of patient health, and sociological information such as lifestyle and demographics (Allen et al. 2012 ). In addition to its size, the UK Biobank offers an unparalleled link to outcomes through integration with the NHS. This unified healthcare system allows researchers to link initial baseline measures with disease outcomes, and with multiple sources of medical information from hospital admission to clinical visits. This allows researchers to be better positioned to minimize error in disease classification and diagnosis. The UK Biobank will also be conducting routine follow-up trials to continue to provide information regarding activity and further expanded biological testing to improve disease and risk factor association.

Beyond the UK Biobank, Public Health England launched the 100,000 Genomes project with the intent to understand the genetic origins behind common cancers (Turnbull et al. 2018 ). The massive effort consists of NHS patients consenting to have their genome sequenced and linked to their health records. Without the significant phenotypic information collected in the UK Biobank—the project holds limited use as a prospective epidemiological study—but as a great tool for researchers interested in identifying disease causing single-nucleotide polymorphisms (SNPs). The size of the dataset itself is its main advance—as it provides the statistical power to discover the associated SNPs even for rare diseases. Furthermore, the 100,000 Genomes Project’s ancillary aim is to stimulate private sector growth in the genomics industry within England.

Big Data initiatives in the United States and abroad

In the United States, the “All of Us” project is expanding upon the UK Biobank model by creating a direct link between patient genome data and their phenotypes by integrating EHRs, behavioral, and family data into a unique patient profile (Denny et al. 2019 ). By creating a standardized and linked database for all patients—“All of Us” will allow researchers greater scope than the UK BioBank to understand cancers and discover the associated genetic causes. In addition, “All of Us” succeeds in focusing on minority populations and health, an area of focus that sets it apart and gives it greater clinical significance. The UK should learn from this effort by expanding the UK Biobank project to further include minority populations and integrate it with ancillary patient data such as from wearables—the current UK Biobank has ~500,000 patients that identify as white versus ~12,000 (i.e., just <2.5%) that identified as non-white (Cohn et al. 2017 ). Meanwhile, individuals of Asian ethnicities made up over 7.5% of the UK population as per the 2011 UK Census, with the proportion of minorities projected to rise in the coming years (O’Brien and Potter-Collins 2015 ; Cohn et al. 2017 ).

Sweden too provides an informative example of the power of investment in rich electronic research registries (Webster 2014 ). The Swedish government has committed over $70 million dollars in funding per annum to expand a variety of cancer registries that would allow researchers insight into risk factors for oncogenesis. In addition, its data sources are particularly valuable for scientists, as each patient’s entries are linked to unique identity numbers that can be cross references with over 90 other registries to give a more complete understanding of a patient’s health and social circumstances. These registries are not limited to disease states and treatments, but also encompass extensive public administrative records that can provide researchers considerable insight into social indicators of health such as income, occupation, and marital status (Connelly et al. 2016 ). These data sources become even more valuable to Swedish researchers as they have been in place for decades with commendable consistency—increasing the power of long-term analysis (Connelly et al. 2016 ). Other nations can learn from the Swedish example by paying particular attention to the use of unique patient identifiers that can map onto a number of datasets collected by government and academia—an idea that was first mentioned in the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) but has not yet been implemented (Davis 2019 ).

China has recently become a leader in implementation and development of new digital technologies, and it has begun to approach healthcare with an emphasis on data standardization and volume. Already, the central government in China has initiated several funding initiatives aimed at pushing Big Data into healthcare use cases, with a particular eye on linking together administrative data, regional claims data from the national health insurance program, and electronic medical records (Zhang et al. 2018 ). China hopes to do this through leveraging its existing personal identification system that covers all Chinese nationals—similar to the Swedish model of maintaining a variety of regional and national registries linked by personal identification numbers. This is particularly relevant to cancer research as China has established a new cancer registry (National Central Cancer Registry of China) that will take advantage of the nation’s population size to give unique insight into otherwise rare oncogenesis. Major concerns regarding this initiative are data quality and time. China has only relatively recently adopted the International Classification of Diseases (ICD) revision ten coding system, a standardized method for recording disease states alongside prescribed treatments. China is also still implementing standardized record keeping terminologies at the regional level. This creates considerable heterogeneity in data quality—as well as inoperability between regions—a major obstacle in any national registry effort (Zhang et al. 2018 ). The recency of these efforts also mean that some time is required until researchers will be able to take advantage of longitudinal analysis—vital for oncology research that aims to spot recurrences or track patient survival. In the future we can expect significant findings to come out of China’s efforts to bring hundreds of millions of patient files available to researchers, but significant advances in standards of care and interoperability must be first surpassed.

The large variety of “Big Data” research projects being undertaken around the world are proposing different approaches to the future of patient records. The UK is broadly leveraging the centralization of the NHS to link genomic data with clinical care records, and opening up the disease endpoints to researchers through a patient ID. Sweden and China are also adopting this model—leveraging unique identity numbers issued to citizens to link otherwise disconnected datasets from administrative and healthcare records (Connelly et al. 2016 ; Cnudde et al. 2016 ; Zhang et al. 2018 ). In this way, tests, technologies and methods will be integrated in a way that is specific to the patient but not necessarily to the hospital or clinic. This allows for significant flexibility in the seamless transfer of information between sites and for physicians to take full advantage of all the data generated. The US’ “All of Us” program is similar in integrating a variety of patient records into a single-patient file that is stored in the cloud (Denny et al. 2019 ). However, it does not significantly link to public administrative data sources, and thus is limited in its usefulness for long-term analysis of the effects of social contributors to cancer progression and risk. This foretells greater problems with the current ecosystem of clinical data—where lack of integration, misguided design, and ambiguous data ownership make research and clinical care more difficult rather than easier.

Survey of problems in clinical data use


Fragmentation is the primary problem that needs to be addressed if EHRs have any hope of being used in any serious clinical capacity. Fragmentation arises when EHRs are unable to communicate effectively between each other—effectively locking patient information into a proprietary system. While there are major players in the US EHR space such as Epic and General Electric, there are also dozens of minor and niche companies that also produce their own products—many of which are not able to communicate effectively or easily with one another (DeMartino and Larsen 2013 ). The Clinical Oncology Requirements for the EHR and the National Community Cancer Centers Program have both spoken out about the need for interoperability requirements for EHRs and even published guidelines (Miller 2011 ). In addition, the Certification Commission for Health Information Technology was created to issue guidelines and standards for interoperability of EHRs (Miller 2011 ). Fast Healthcare Interoperability Resources (FHIR) is the current new standard for data exchange for healthcare published by Health Level 7 (HL7). It builds upon past standards from both HL7 and a variety of other standards such as the Reference Information Model. FHIR offers new principles on which data sharing can take place through RESTful APIs—and projects such as Argonaut are working to expand adoption to EHRs (Chambers et al. 2019 ). Even with the introduction of the HL7 Ambulatory Oncology EHR Functional Profile, EHRs have not improved and have actually become pain points for clinicians as they struggle to integrate the diagnostics from separate labs or hospitals, and can even leave physicians in the dark about clinical history if the patient has moved providers (Reisman 2017 ; Blobel 2018 ). Even in integrated care providers such as Kaiser Permanente there are interoperability issues that make EHRs unpopular among clinicians as they struggle to receive outside test results or the narratives of patients who have recently moved (Leonard and Tozzi 2012 ).

The UK provides an informative contrast in its NHS, a single government-run enterprise that provides free healthcare at the point of service. Currently, the NHS is able to successfully integrate a variety of health records—a step ahead of the US—but relies on outdated technology with security vulnerabilities such as fax machines (Macaulay 2016 ). The NHS has recently also begun the process of digitizing its health service, with separate NHS Trusts adopting American EHR solutions, such as the Cambridgeshire NHS trust’s recent agreement with Epic (Honeyman et al. 2016 ). However, the NHS still lags behind the US in broad use and uptake across all of its services (Wallace 2016 ). Furthermore, it will need to force the variety of EHRs being adopted to conform to centralized standards and interoperability requirements that allow services as far afield as genome sequencing to be added to a patient record.

Misguided EHR design

Another issue often identified with the modern incarnation of EHRs is that they are often not helpful for doctors in diagnosis—and have been identified by leading clinicians as a hindrance to patient care (Lenzer 2017 ; Gawande 2018 ). A common denominator among the current generation of EHRs is their focus on billing codes, a set of numbers assigned to every task, service, and drug dispensed by a healthcare professional that is used to determine the level of reimbursement the provider will receive. This focus on billing codes is a necessity of the insurance system in the US, which reimburses providers on a service-rendered basis (Essin 2012 ; Lenzer 2017 ). Due to the need for every part of the care process to be billed to insurers (of which there are many) and sometimes to multiple insurers simultaneously, EHRs in the US are designed foremost with insurance needs in mind. As a result, EHRs are hampered by government regulations around billing codes, the requirements of insurance companies, and only then are able to consider the needs of providers or researchers (Bang and Baik 2019 ). And because purchasing decisions for EHRs are not made by physicians, the priority given to patient care outcomes falls behind other needs. The American Medical Association has cited the difficulty of EHRs as a contributing factor in physician burnout and as a waste of valuable time (Lenzer 2017 ; Gardner et al. 2019 ). The NHS, due to its reliance on American manufacturers of EHRs, must suffer through the same problems despite its fundamentally different structure.

Related to the problem of EHRs being optimized for billing, not patient care, is their lack of development beyond repositories of patient information into diagnostic aids. A study of modern day EHR use in the clinic notes many pain points for physicians and healthcare teams (Assis-Hassid et al. 2019 ). Foremost was the variance in EHR use within the clinic—in part because these programs are often not designed with provider workflows in mind (Assis-Hassid et al. 2019 ). In addition, EHRs were found to distract from interpersonal communication and did not integrate the many different types of data being created by nurses, physician assistants, laboratories, and other providers into usable information for physicians (Assis-Hassid et al. 2019 ).

Data ownership

One of the major challenges of current implementations of Big Data are the lack of regulations, incentives, and systems to manage ownership and responsibilities for data. In the clinical space, in the US, this takes the form of compliance with HIPAA, a now decade-old law that aimed to set rules for patient privacy and control for data (Adibuzzaman et al. 2018 ). As more types of data are generated for patients and uploaded to electronic platforms, HIPAA becomes a major roadblock to data sharing as it creates significant privacy concerns that hamper research. Today, if a researcher is to search for even simple demographic and disease states—they can rapidly identify an otherwise de-identified patient (Adibuzzaman et al. 2018 ). Concerns around breaking HIPAA prevent complete and open data sharing agreements—blocking a path to the specificity needed for the next generation of research from being achieved, and also throws a wrench into clinical application of these technologies as data sharing becomes bogged down by nebulousness surrounding old regulations on patient privacy. Furthermore, compliance with the General Data Protection Regulation (GDPR) in the EU has hampered international collaborations as compliance with both HIPAA and GDPR is not yet standardized (Rabesandratana 2019 ).

Data sharing is further complicated by the need to develop new technologies to integrate across a variety of providers. Taking from the example of the Informatics for Integrating Biology and the Bedside (i2b2) program funded by the NIH with Partners Healthcare, it is difficult and enormously expensive to overlay programs on top of existing EHRs (Adibuzzaman et al. 2018 ). Rather, a new approach needs to be developed to solve the solution of data sharing. Blockchain provides an innovative approach and has been recently explored in the literature as a solution that centers patient control of their data, and also promotes safe and secure data sharing through data transfer transactions secured by encryption (Gordon and Catalini 2018 ). Companies exploring this mechanism for data sharing include Nebula Genomics, a firm founded by George Church, that is aimed at securing genomic data in blockchain in a way that scales commercially, and can be used for research purposes with permission only from data owners—the patients themselves. Other firms are exploring using a variety of data types stored in blockchain to create predictive models of disease—such as Doc.Ai—but all are centrally based on the idea of a blockchain to secure patient data and ensure private accurate transfer between sites (Agbo et al. 2019 ). Advantages of blockchain for healthcare data transfer and storage lie in its security and privacy, but the approach has yet to gain widespread use.

Recommendations for clinical application

Design a new generation of ehrs.

It is conceivable that physicians in the near future will be faced with terabytes of data—patients coming to their clinics with years of continuous data monitoring their heart rate, blood sugar, and a variety of other factors (Topol 2019a ). Gaining clinical insight from such a large quantity of data is an impossible expectation to place upon physicians. In order to solve this problem of the exploding numbers of tests, assays, and results, EHRs will need to be extended from simply being records of patient–physician interactions and digital folders, to being diagnostic aids (Fig. 1 ). Companies such as Roche–Flatiron are already moving towards this model by building predictive and analytical tools into their EHRs when they provide them to providers. However, broader adoption across a variety of providers—and the transparency and portability of the models generated will also be vital. AI-based clinical decision-making support will need to be auditable in order to avoid racial bias, and other potential pitfalls (Char et al. 2018 ). Patients will soon request to have permanent access to the models and predictions being generated by ML models to gain greater clarity into how clinical decisions were made, and to guard against malpractice.

figure 1

In this example we demonstrate how many possible factors may come together to better target patients for early screening measures, which can lower aggregate costs for the healthcare system.

Designing this next generation of EHRs will require collaboration between physicians, patients, providers, and insurers in order to ensure ease of use and efficacy. In terms of specific recommendations for the NHS, the Veterans Administration provides a fruitful approach as it was able to develop its own EHR that compares extremely favorably with the privately produced Epic EHR (Garber et al. 2014 ). Its solution was open access, public-domain, and won the loyalty of physicians in improving patient care (Garber et al. 2014 ). However, the VA’s solution was not actively adopted due to lack of support for continuous maintenance and limited support for billing (Garber et al. 2014 ). While the NHS does not need to consider the insurance industry’s input, it does need to take note that private EHRs were able to gain market prominence in part because they provided a hand to hold for providers, and were far more responsive to personalized concerns raised (Garber et al. 2014 ). Evidence from Denmark suggests that EHR implementation in the UK would benefit from private competitors implementing solutions at the regional rather than national level in order to balance the need for competition and standardization (Kierkegaard 2013 ).

Develop new EHR workflows

Already, researchers and enterprise are developing predictive models that can better diagnose cancers based on imaging data (Bibault et al. 2016 ). While these products and tools are not yet market ready and are far off from clinical approval—they portend things to come. We envision a future where the job of an Oncologist becomes increasingly interpretive rather than diagnostic. But to get to that future, we will need to train our algorithms much like we train our future doctors—with millions of examples. In order to build this corpus of data, we will need to create a digital infrastructure around Big Data that can both handle the demands of researchers and enterprise as they continuously improve their models—with those of patients and physicians who must continue their important work using existing tools and knowledge. In Fig. 2 , we demonstrate a hypothetical workflow based on models provided by other researchers in the field (Bibault et al. 2016 ; Topol 2019a ). This simplified workflow posits EHRs as an integrative tool that can facilitate the capture of a large variety of data sources and can transform them into a standardized format to be stored in a secure cloud storage facility (Osong et al. 2019 ). Current limitations in HIPAA in the US have prevented innovation in this field, so reform will need to both guarantee the protection of private patient data and the open access to patient histories for the next generation of diagnostic tools. The introduction of accurate predictive models for patient treatment will mean that cancer diagnosis will fundamentally change. We will see the job of oncologists transforming itself as they balance recommendations provided by digital tools that can instantly integrate literature and electronic records from past patients, and their own best clinical judgment.

figure 2

Here, various heterogeneous data types are fed into a centralized EHR system that will be uploaded to a secure digital cloud where it can be de-identified and used by research and enterprise, but primarily by physicians and patients.

Use a global patient ID

While we are already seeing the fruits of decades of research into ML methods, there is a whole new set of techniques that will soon be leaving research labs and being applied to the clinic. This set of “omics”—often used to refer to proteomics, genomics, metabolomics, and others—will reveal even more specificity about a patient’s cancer at lower cost (Cho 2015 ). However, they like other technologies, will create petabytes of data that will need to be stored and integrated to help physicians.

As the number of tests and healthcare providers diversify—EHRs will need to address the question of extensibility and flexibility. Providers as disparate as counseling offices and MRI imaging centers cannot be expected to use the same software—or even similar software. As specific solutions for diverse providers are created—they will need to interface in a standard format with existing EHRs. The UK Biobank creates a model for these types of interactions in its use of a singular patient ID to link a variety of data types—allowing for extensibility as future iterations and improvements add data sources for the project. Also, Sweden and China are informative examples in their usage of national citizen identification numbers as a method of linking clinical and administrative datasets together (Cnudde et al. 2016 ; Zhang et al. 2018 ). Singular patient identification numbers do not yet exist in the US despite their inclusion in HIPAA due to subsequent Congressional action preventing their creation (Davis 2019 ). Instead private providers have stepped in to bridge the gap, but have also called on the US government to create an official patient ID system (Davis 2019 ). Not only would a singular patient ID allow for researchers to link US administrative data together with clinical outcomes, but also provide a solution to the questions of data ownership and fragmentation that plague the current system.

Healthcare future will build on the Big Data projects currently being pioneered around the world. The models of data integration being pioneered by the “All of Us” trial and analytics championed by P4 medicine will come to define the patient experience (Flores et al. 2013 ). However, in this piece we have demonstrated a series of hurdles that the field must overcome to avoid imposing additional burdens on physicians and to deliver significant value. We recommend a set of proposals built upon an examination of the NHS and other publicly administered healthcare models and the US multi-payer system to bridge the gap between the market competition needed to develop these new technologies and effective patient care.

Access to patient data must be a paramount guiding principle as regulators begin to approach the problem of wrangling the many streams of data that are already being generated. Data must both be accessible to physicians and patients, but must also be secured and de-identified for the benefit of research. A pathway taken by the UK Biobank to guarantee data integration and universal access has been through the creation of a single database and protocol for accessing its contents (Allen et al. 2012 ). It is then feasible to suggest a similar system for the NHS which is already centralized with a single funding source. However, this system will necessarily also be a security concern due to its centralized nature, even if patient data is encrypted (Fig. 3 ). Another approach is to follow in the footsteps of the US’ HIPAA, which suggested the creation of unique patient IDs over 20 years ago. With a single patient identifier, EHRs would then be allowed to communicate with heterogeneous systems especially designed for labs or imaging centers or counseling services and more (Fig. 4 ). However, this design presupposes a standardized format and protocol for communication across a variety of databases—similar to the HL7 standards that already exist (Bender and Sartipi 2013 ). In place of a centralized authority building out a digital infrastructure to house and communicate patient data, mandating protocols and security standards will allow for the development of specialized EHR solutions for an ever diversifying set of healthcare providers and encourage the market needed for continual development and support of these systems. Avoiding data fragmentation as seen already in the US then becomes an exercise in mandating data sharing in law.

figure 3

Future implementations of Big Data will need to not only integrate data, but also encrypt and de-identify it for secure storage.

figure 4

Hypothetical healthcare system design based on unique patient identifiers that function across a variety of systems and providers—linking together disparate datasets into a complete patient profile.

The next problem then becomes the inevitable application of AI to healthcare. Any such tool created will have to stand up to the scrutiny not just of being asked to outclass human diagnoses, but to also reveal its methods. Because of the opacity of ML models, the “black box” effect means that diagnoses cannot be scrutinized or understood by outside observers (Fig. 5 ). This makes clinical use extremely limited, unless further techniques are developed to deconvolute the decision-making process of these models. Until then, we expect that AI models will only provide support for diagnoses.

figure 5

Without transparency in many of the models being implemented as to why and how decisions are being made, there exists room for algorithmic bias and no room for improvement or criticism by physicians. The “black box” of machine learning obscures why decisions are made and what actually affects predictions.

Furthermore, many times AI models simply replicate biases in existing datasets. Cohn et al. 2017 demonstrated clear areas of deficiency in the minority representation of patients in the UK Biobank. Any research conducted on these datasets will necessarily only be able to create models that generalize to the population in them (a largely homogenous white-British group) (Fig. 6 ). In order to protect against algorithmic bias and the black box of current models hiding their decision-making, regulators must enforce rules that expose the decision-making of future predictive healthcare models to public and physician scrutiny. Similar to the existing FDA regulatory framework for medical devices, algorithms too must be put up to regulatory scrutiny to prevent discrimination, while also ensuring transparency of care.

figure 6

The “All of Us” study will meet this need by specifically aiming to recruit a diverse pool of participants to develop disease models that generalize to every citizen, not just the majority (Denny et al. 2019 ). Future global Big Data generation projects should learn from this example in order to guarantee equality of care for all patients.

The future of healthcare will increasingly live on server racks and be built in glass office buildings by teams of programmers. The US must take seriously the benefits of centralized regulations and protocols that have allowed the NHS to be enormously successful in preventing the problem of data fragmentation—while the NHS must approach the possibility of freer markets for healthcare devices and technologies as a necessary condition for entering the next generation of healthcare delivery which will require constant reinvention and improvement to deliver accurate care.

Overall, we are entering a transition in how we think about caring for patients and the role of a physician. Rather than creating a reactive healthcare system that finds cancers once they have advanced to a serious stage—Big Data offers us the opportunity to fine tune screening and prevention protocols to significantly reduce the burden of diseases such as advanced stage cancers and metastasis. This development allows physicians to think more about a patient individually in their treatment plan as they leverage information beyond rough demographic indicators such as genomic sequencing of their tumor. Healthcare is not yet prepared for this shift, so it is the job of governments around the world to pay attention to how each other have implemented Big Data in healthcare to write the regulatory structure of the future. Ensuring competition, data security, and algorithmic transparency will be the hallmarks of how we think about guaranteeing better patient care.

Adibuzzaman M, DeLaurentis P, Hill J, Benneyworth BD (2018) Big data in healthcare—the promises, challenges and opportunities from a research perspective: a case study with a model database. AMIA Annu Symp Proc 2017:384–392

PubMed   PubMed Central   Google Scholar  

Agbo CC, Mahmoud QH, Eklund JM (2019) Blockchain technology in healthcare: a systematic review. Healthcare 7:56

Article   PubMed Central   Google Scholar  

Aguet F, Brown AA, Castel SE, Davis JR, He Y, Jo B et al. (2017) Genetic effects on gene expression across human tissues. Nature 550:204–213

Article   Google Scholar  

Akbarian S, Liu C, Knowles JA, Vaccarino FM, Farnham PJ, Crawford GE et al. (2015) The PsychENCODE project. Nat Neurosci 18:1707–1712

Article   CAS   PubMed   PubMed Central   Google Scholar  

Allen N, Sudlow C, Downey P, Peakman T, Danesh J, Elliott P et al. (2012) UK Biobank: current status and what it means for epidemiology. Health Policy Technol 1:123–126

Assis-Hassid S, Grosz BJ, Zimlichman E, Rozenblum R, Bates DW (2019) Assessing EHR use during hospital morning rounds: a multi-faceted study. PLoS ONE 14:e0212816

Bang CS, Baik GH (2019) Using big data to see the forest and the trees: endoscopic submucosal dissection of early gastric cancer in Korea. Korean J Intern Med 34:772–774

Article   PubMed   PubMed Central   Google Scholar  

Bender D, Sartipi K (2013) HL7 FHIR: an agile and RESTful approach to healthcare information exchange. In Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems, IEEE. pp 326–331

Bibault J-E, Giraud P, Burgun A (2016) Big Data and machine learning in radiation oncology: state of the art and future prospects. Cancer Lett 382:110–117

Article   CAS   PubMed   Google Scholar  

Blobel B (2018) Interoperable EHR systems—challenges, standards and solutions. Eur J Biomed Inf 14:10–19

Google Scholar  

Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ (2018) Next-generation machine learning for biological networks. Cell 173:1581–1592

Campbell PJ, Getz G, Stuart JM, Korbel JO, Stein LD (2020) Pan-cancer analysis of whole genomes. Nature https://www.nature.com/articles/s41586-020-1969-6

Chambers DA, Amir E, Saleh RR, Rodin D, Keating NL, Osterman TJ, Chen JL (2019) The impact of Big Data research on practice, policy, and cancer care. Am Soc Clin Oncol Educ Book Am Soc Clin Oncol Annu Meet 39:e167–e175

Char DS, Shah NH, Magnus D (2018) Implementing machine learning in health care—addressing ethical challenges. N Engl J Med 378:981–983

Cho WC (2015) Big Data for cancer research. Clin Med Insights Oncol 9:135–136

Cnudde P, Rolfson O, Nemes S, Kärrholm J, Rehnberg C, Rogmark C, Timperley J, Garellick G (2016) Linking Swedish health data registers to establish a research database and a shared decision-making tool in hip replacement. BMC Musculoskelet Disord 17:414

Cohn EG, Hamilton N, Larson EL, Williams JK (2017) Self-reported race and ethnicity of US biobank participants compared to the US Census. J Community Genet 8:229–238

Connelly R, Playford CJ, Gayle V, Dibben C (2016) The role of administrative data in the big data revolution in social science research. Soc Sci Res 59:1–12

Article   PubMed   Google Scholar  

Davis J (2019) National patient identifier HIPAA provision removed in proposed bill. HealthITSecurity https://healthitsecurity.com/news/national-patient-identifier-hipaa-provision-removed-in-proposed-bill

DeMartino JK, Larsen JK (2013) Data needs in oncology: “Making Sense of The Big Data Soup”. J Natl Compr Canc Netw 11:S1–S12

Deng J, El Naqa I, Xing L (2018) Editorial: machine learning with radiation oncology big data. Front Oncol 8:416

Denny JC, Rutter JL, Goldstein DB, Philippakis Anthony, Smoller JW, Jenkins G et al. (2019) The “All of Us” research program. N Engl J Med 381:668–676

Elliott LT, Sharp K, Alfaro-Almagro F, Shi S, Miller KL, Douaud G et al. (2018) Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature 562:210–216

Essin D (2012) Improve EHR systems by rethinking medical billing. Physicians Pract. https://www.physicianspractice.com/ehr/improve-ehr-systems-rethinking-medical-billing

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K et al. (2019) A guide to deep learning in healthcare. Nat Med 25:24–29

Fessele KL (2018) The rise of Big Data in oncology. Semin Oncol Nurs 34:168–176

Flores M, Glusman G, Brogaard K, Price ND, Hood L (2013) P4 medicine: how systems medicine will transform the healthcare sector and society. Pers Med 10:565–576

Article   CAS   Google Scholar  

Garber S, Gates SM, Keeler EB, Vaiana ME, Mulcahy AW, Lau C et al. (2014) Redirecting innovation in U.S. Health Care: options to decrease spending and increase value: Case Studies 133

Gardner RL, Cooper E, Haskell J, Harris DA, Poplau S, Kroth PJ et al. (2019) Physician stress and burnout: the impact of health information technology. J Am Med Inf Assoc 26:106–114

Gawande A (2018) Why doctors hate their computers. The New Yorker , 12 https://www.newyorker.com/magazine/2018/11/12/why-doctors-hate-their-computers

Gordon WJ, Catalini C (2018) Blockchain technology for healthcare: facilitating the transition to patient-driven interoperability. Comput Struct Biotechnol J 16:224–230

Hasin Y, Seldin M, Lusis A (2017) Multi-omics approaches to disease. Genome Biol 18:83

Honeyman M, Dunn P, McKenna H (2016) A Digital NHS. An introduction to the digital agenda and plans for implementation https://www.kingsfund.org.uk/sites/default/files/field/field_publication_file/A_digital_NHS_Kings_Fund_Sep_2016.pdf

Kierkegaard P (2013) eHealth in Denmark: A Case Study. J Med Syst 37

Krumholz HM (2014) Big Data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff 33:1163–1170

Lenzer J (2017) Commentary: the real problem is that electronic health records focus too much on billing. BMJ 356:j326

Leonard D, Tozzi J (2012) Why don’t more hospitals use electronic health records. Bloom Bus Week

Macaulay T (2016) Progress towards a paperless NHS. BMJ 355:i4448

Madhavan S, Subramaniam S, Brown TD, Chen JL (2018) Art and challenges of precision medicine: interpreting and integrating genomic data into clinical practice. Am Soc Clin Oncol Educ Book Am Soc Clin Oncol Annu Meet 38:546–553

Marx V (2015) The DNA of a nation. Nature 524:503–505

Miller RS (2011) Electronic health record certification in oncology: role of the certification commission for health information technology. J Oncol Pr 7:209–213

Norgeot B, Glicksberg BS, Butte AJ (2019) A call for deep-learning healthcare. Nat Med 25:14–15

O’Brien R, Potter-Collins A (2015) 2011 Census analysis: ethnicity and religion of the non-UK born population in England and Wales: 2011. Office for National Statistics. https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/ethnicity/articles/2011censusanalysisethnicityandreligionofthenonukbornpopulationinenglandandwales/2015-06-18

Osong AB, Dekker A, van Soest J (2019) Big data for better cancer care. Br J Hosp Med Lond Engl 2005 80:304–305

Rabesandratana T (2019) European data law is impeding studies on diabetes and Alzheimer’s, researchers warn. Sci AAAS. https://doi.org/10.1126/science.aba2926

Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2:3

Reisman M (2017) EHRs: the challenge of making electronic data usable and interoperable. Pharm Ther 42:572–575

Shendure J, Ji H (2008) Next-generation DNA sequencing. Nature Biotechnology 26:1135–1145

Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al. (2015) Big Data: astronomical or genomical? PLOS Biol 13:e1002195

Tomczak K, Czerwińska P, Wiznerowicz M (2015) The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol 19:A68–A77

Topol E (2019a) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25:44

Topol E (2019b) The topol review: preparing the healthcare workforce to deliver the digital future. Health Education England https://topol.hee.nhs.uk/

Turnbull C, Scott RH, Thomas E, Jones L, Murugaesu N, Pretty FB, Halai D, Baple E, Craig C, Hamblin A, et al. (2018) The 100 000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ 361

Wallace WA (2016) Why the US has overtaken the NHS with its EMR. National Health Executive Magazine, pp 32–34 http://www.nationalhealthexecutive.com/Comment/why-the-us-has-overtaken-the-nhs-with-its-emr

Webster PC (2014) Sweden’s health data goldmine. CMAJ Can Med Assoc J 186:E310

Wetterstrand KA (2019) DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). Natl Hum Genome Res Inst. www.genome.gov/sequencingcostsdata , Accessed 2019

Zhang L, Wang H, Li Q, Zhao M-H, Zhan Q-M (2018) Big data and medical research in China. BMJ 360:j5910

Download references

Author information

Authors and affiliations.

Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK

Raag Agrawal & Sudhakaran Prabakaran

Department of Biology, Columbia University, 116th and Broadway, New York, NY, 10027, USA

Raag Agrawal

Department of Biology, Indian Institute of Science Education and Research, Pune, Maharashtra, 411008, India

Sudhakaran Prabakaran

St Edmund’s College, University of Cambridge, Cambridge, CB3 0BN, UK

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Sudhakaran Prabakaran .

Ethics declarations

Conflict of interest.

SP is co-founder of Nonexomics.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Associate editor: Frank Hailer

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Agrawal, R., Prabakaran, S. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity 124 , 525–534 (2020). https://doi.org/10.1038/s41437-020-0303-2

Download citation

Received : 28 June 2019

Revised : 25 February 2020

Accepted : 25 February 2020

Published : 05 March 2020

Issue Date : April 2020

DOI : https://doi.org/10.1038/s41437-020-0303-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Lightweight federated learning for stis/hiv prediction.

  • Thi Phuoc Van Nguyen
  • Wencheng Yang

Scientific Reports (2024)

An open source knowledge graph ecosystem for the life sciences

  • Tiffany J. Callahan
  • Ignacio J. Tripodi
  • Lawrence E. Hunter

Scientific Data (2024)

Using machine learning approach for screening metastatic biomarkers in colorectal cancer and predictive modeling with experimental validation

  • Amirhossein Ahmadieh-Yazdi
  • Ali Mahdavinezhad
  • Saeid Afshar

Scientific Reports (2023)

Introducing AI to the molecular tumor board: one direction toward the establishment of precision medicine using large-scale cancer clinical and biological information

  • Ryuji Hamamoto
  • Takafumi Koyama
  • Noboru Yamamoto

Experimental Hematology & Oncology (2022)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research projects on big data

25+ Solved End-to-End Big Data Projects with Source Code

Solved End-to-End Real World Mini Big Data Projects Ideas with Source Code For Beginners and Students to master big data tools like Hadoop and Spark.

25+ Solved End-to-End Big Data Projects with Source Code

Ace your big data analytics interview by adding some unique and exciting Big Data projects to your portfolio. This blog lists over 20 big data analytics projects you can work on to showcase your big data skills and gain hands-on experience in big data tools and technologies. You will find several big data projects depending on your level of expertise- big data projects for students, big data projects for beginners, etc.


Build a big data pipeline with AWS Quicksight, Druid, and Hive

Downloadable solution code | Explanatory videos | Tech Support

Have you ever looked for sneakers on Amazon and seen advertisements for similar sneakers while searching the internet for the perfect cake recipe? Maybe you started using Instagram to search for some fitness videos, and now, Instagram keeps recommending videos from fitness influencers to you. And even if you’re not very active on social media, I’m sure you now and then check your phone before leaving the house to see what the traffic is like on your route to know how long it could take you to reach your destination. None of this would have been possible without the application of big data analysis process on by the modern data driven companies. We bring the top big data projects for 2023 that are specially curated for students, beginners, and anybody looking to get started with mastering data skills.

Table of Contents

What is a big data project, how do you create a good big data project, 25+ big data project ideas to help boost your resume , big data project ideas for beginners, intermediate projects on data analytics, advanced level examples of big data projects, real-time big data projects with source code, sample big data project ideas for final year students, big data project ideas using hadoop , big data projects using spark, gcp and aws big data projects, best big data project ideas for masters students, fun big data project ideas, top 5 apache big data projects, top big data projects on github with source code, level-up your big data expertise with projectpro's big data projects, faqs on big data projects.

A big data project is a data analysis project that uses machine learning algorithms and different data analytics techniques on structured and unstructured data for several purposes, including predictive modeling and other advanced analytics applications. Before actually working on any big data projects, data engineers must acquire proficient knowledge in the relevant areas, such as deep learning, machine learning, data visualization , data analytics, data science, etc. 

Many platforms, like GitHub and ProjectPro, offer various big data projects for professionals at all skill levels- beginner, intermediate, and advanced. However, before moving on to a list of big data project ideas worth exploring and adding to your portfolio, let us first get a clear picture of what big data is and why everyone is interested in it.

ProjectPro Free Projects on Big Data and Data Science

Kicking off a big data analytics project is always the most challenging part. You always encounter questions like what are the project goals, how can you become familiar with the dataset, what challenges are you trying to address,  what are the necessary skills for this project, what metrics will you use to evaluate your model, etc.

Well! The first crucial step to launching your project initiative is to have a solid project plan. To build a big data project, you should always adhere to a clearly defined workflow. Before starting any big data project, it is essential to become familiar with the fundamental processes and steps involved, from gathering raw data to creating a machine learning model to its effective implementation.

Understand the Business Goals of the Big Data Project

The first step of any good big data analytics project is understanding the business or industry that you are working on. Go out and speak with the individuals whose processes you aim to transform with data before you even consider analyzing the data. Establish a timeline and specific key performance indicators afterward. Although planning and procedures can appear tedious, they are a crucial step to launching your data initiative! A definite purpose of what you want to do with data must be identified, such as a specific question to be answered, a data product to be built, etc., to provide motivation, direction, and purpose.

Here's what valued users are saying about ProjectPro

user profile

Abhinav Agarwal

Graduate Student at Northwestern University

user profile

Savvy Sahai

Data Science Intern, Capgemini

Not sure what you are looking for?

Collect Data for the Big Data Project

The next step in a big data project is looking for data once you've established your goal. To create a successful data project, collect and integrate data from as many different sources as possible. 

Here are some options for collecting data that you can utilize:

Connect to an existing database that is already public or access your private database.

Consider the APIs for all the tools your organization has been utilizing and the data they have gathered. You must put in some effort to set up those APIs so that you can use the email open and click statistics, the support request someone sent, etc.

There are plenty of datasets on the Internet that can provide more information than what you already have. There are open data platforms in several regions (like data.gov in the U.S.). These open data sets are a fantastic resource if you're working on a personal project for fun.

Data Preparation and Cleaning

The data preparation step, which may consume up to 80% of the time allocated to any big data or data engineering project, comes next. Once you have the data, it's time to start using it. Start exploring what you have and how you can combine everything to meet the primary goal. To understand the relevance of all your data, start making notes on your initial analyses and ask significant questions to businesspeople, the IT team, or other groups. Data Cleaning is the next step. To ensure that data is consistent and accurate, you must review each column and check for errors, missing data values, etc.

Making sure that your project and your data are compatible with data privacy standards is a key aspect of data preparation that should not be overlooked. Personal data privacy and protection are becoming increasingly crucial, and you should prioritize them immediately as you embark on your big data journey. You must consolidate all your data initiatives, sources, and datasets into one location or platform to facilitate governance and carry out privacy-compliant projects. 

New Projects

Data Transformation and Manipulation

Now that the data is clean, it's time to modify it so you can extract useful information. Starting with combining all of your various sources and group logs will help you focus your data on the most significant aspects. You can do this, for instance, by adding time-based attributes to your data, like:

Acquiring date-related elements (month, hour, day of the week, week of the year, etc.)

Calculating the variations between date-column values, etc.

Joining datasets is another way to improve data, which entails extracting columns from one dataset or tab and adding them to a reference dataset. This is a crucial component of any analysis, but it can become a challenge when you have many data sources.

 Visualize Your Data

Now that you have a decent dataset (or perhaps several), it would be wise to begin analyzing it by creating beautiful dashboards, charts, or graphs. The next stage of any data analytics project should focus on visualization because it is the most excellent approach to analyzing and showcasing insights when working with massive amounts of data.

Another method for enhancing your dataset and creating more intriguing features is to use graphs. For instance, by plotting your data points on a map, you can discover that some geographic regions are more informative than some other nations or cities.

Build Predictive Models Using Machine Learning Algorithms

Machine learning algorithms can help you take your big data project to the next level by providing you with more details and making predictions about future trends. You can create models to find trends in the data that were not visible in graphs by working with clustering techniques (also known as unsupervised learning). These organize relevant outcomes into clusters and more or less explicitly state the characteristic that determines these outcomes.

Advanced data scientists can use supervised algorithms to predict future trends. They discover features that have influenced previous data patterns by reviewing historical data and can then generate predictions using these features. 

Lastly, your predictive model needs to be operationalized for the project to be truly valuable. Deploying a machine learning model for adoption by all individuals within an organization is referred to as operationalization.

Repeat The Process

This is the last step in completing your big data project, and it's crucial to the whole data life cycle. One of the biggest mistakes individuals make when it comes to machine learning is assuming that once a model is created and implemented, it will always function normally. On the contrary, if models aren't updated with the latest data and regularly modified, their quality will deteriorate with time.

You need to accept that your model will never indeed be "complete" to accomplish your first data project effectively. You need to continually reevaluate, retrain it, and create new features for it to stay accurate and valuable. 

If you are a newbie to Big Data, keep in mind that it is not an easy field, but at the same time, remember that nothing good in life comes easy; you have to work for it. The most helpful way of learning a skill is with some hands-on experience. Below is a list of Big Data analytics project ideas and an idea of the approach you could take to develop them; hoping that this could help you learn more about Big Data and even kick-start a career in Big Data. 

Yelp Data Processing Using Spark And Hive Part 1

Yelp Data Processing using Spark and Hive Part 2

Hadoop Project for Beginners-SQL Analytics with Hive

Tough engineering choices with large datasets in Hive Part - 1

Finding Unique URL's using Hadoop Hive

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Orchestrate Redshift ETL using AWS Glue and Step Functions

Analyze Yelp Dataset with Spark & Parquet Format on Azure Databricks

Data Warehouse Design for E-commerce Environments

Analyzing Big Data with Twitter Sentiments using Spark Streaming

PySpark Tutorial - Learn to use Apache Spark with Python

Tough engineering choices with large datasets in Hive Part - 2

Event Data Analysis using AWS ELK Stack

Web Server Log Processing using Hadoop

Data processing with Spark SQL

Build a Time Series Analysis Dashboard with Spark and Grafana

GCP Data Ingestion with SQL using Google Cloud Dataflow

Deploying auto-reply Twitter handle with Kafka, Spark, and LSTM

Dealing with Slowly Changing Dimensions using Snowflake

Spark Project -Real-Time data collection and Spark Streaming Aggregation

Snowflake Real-Time Data Warehouse Project for Beginners-1

Real-Time Log Processing using Spark Streaming Architecture

Real-Time Auto Tracking with Spark-Redis

Building Real-Time AWS Log Analytics Solution

Explore real-world Apache Hadoop projects by ProjectPro and land your Big Data dream job today!

In this section, you will find a list of good big data project ideas for masters students.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

Online Hadoop Projects -Solving small file problem in Hadoop

Airline Dataset Analysis using Hadoop, Hive, Pig, and Impala

AWS Project-Website Monitoring using AWS Lambda and Aurora

Explore features of Spark SQL in practice on Spark 2.0

MovieLens Dataset Exploratory Analysis

Bitcoin Data Mining on AWS

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

Spark Project-Analysis and Visualization on Yelp Dataset

Project Ideas on Big Data Analytics

Let us now begin with a more detailed list of good big data project ideas that you can easily implement.

This section will introduce you to a list of project ideas on big data that use Hadoop along with descriptions of how to implement them.

1. Visualizing Wikipedia Trends

Human brains tend to process visual data better than data in any other format. 90% of the information transmitted to the brain is visual, and the human brain can process an image in just 13 milliseconds. Wikipedia is a page that is accessed by people all around the world for research purposes, general information, and just to satisfy their occasional curiosity. 

Visualizing Wikipedia Trends Big Data Project

Raw page data counts from Wikipedia can be collected and processed via Hadoop. The processed data can then be visualized using Zeppelin notebooks to analyze trends that can be supported based on demographics or parameters. This is a good pick for someone looking to understand how big data analysis and visualization can be achieved through Big Data and also an excellent pick for an Apache Big Data project idea. 

Visualizing Wikipedia Trends Big Data Project with Source Code .

2. Visualizing Website Clickstream Data

Clickstream data analysis refers to collecting, processing, and understanding all the web pages a particular user visits. This analysis benefits web page marketing, product management, and targeted advertisement. Since users tend to visit sites based on their requirements and interests, clickstream analysis can help to get an idea of what a user is looking for. 

Visualization of the same helps in identifying these trends. In such a manner, advertisements can be generated specific to individuals. Ads on webpages provide a source of income for the webpage, and help the business publishing the ad reach the customer and at the same time, other internet users. This can be classified as a Big Data Apache project by using Hadoop to build it.

Big Data Analytics Projects Solution for Visualization of Clickstream Data on a Website

3. Web Server Log Processing

A web server log maintains a list of page requests and activities it has performed. Storing, processing, and mining the data on web servers can be done to analyze the data further. In this manner, webpage ads can be determined, and SEO (Search engine optimization) can also be done. A general overall user experience can be achieved through web-server log analysis. This kind of processing benefits any business that heavily relies on its website for revenue generation or to reach out to its customers. The Apache Hadoop open source big data project ecosystem with tools such as Pig, Impala, Hive, Spark, Kafka Oozie, and HDFS can be used for storage and processing.

Big Data Project using Hadoop with Source Code for Web Server Log Processing 

This section will provide you with a list of projects that utilize Apache Spark for their implementation.

4. Analysis of Twitter Sentiments Using Spark Streaming

Sentimental analysis is another interesting big data project topic that deals with the process of determining whether a given opinion is positive, negative, or neutral. For a business, knowing the sentiments or the reaction of a group of people to a new product launch or a new event can help determine the profitability of the product and can help the business to have a more extensive reach by getting an idea of the feel of the customers. From a political standpoint, the sentiments of the crowd toward a candidate or some decision taken by a party can help determine what keeps a specific group of people happy and satisfied. You can use Twitter sentiments to predict election results as well. 

Sentiment Analysis Big Data Project

Sentiment analysis has to be done for a large dataset since there are over 180 million monetizable daily active users ( https://www.businessofapps.com/data/twitter-statistics/) on Twitter. The analysis also has to be done in real-time. Spark Streaming can be used to gather data from Twitter in real time. NLP (Natural Language Processing) models will have to be used for sentimental analysis, and the models will have to be trained with some prior datasets. Sentiment analysis is one of the more advanced projects that showcase the use of Big Data due to its involvement in NLP.

Access Big Data Project Solution to Twitter Sentiment Analysis

5. Real-time Analysis of Log-entries from Applications Using Streaming Architectures

If you are looking to practice and get your hands dirty with a real-time big data project, then this big data project title must be on your list. Where web server log processing would require data to be processed in batches, applications that stream data will have log files that would have to be processed in real-time for better analysis. Real-time streaming behavior analysis gives more insight into customer behavior and can help find more content to keep the users engaged. Real-time analysis can also help to detect a security breach and take necessary action immediately. Many social media networks work using the concept of real-time analysis of the content streamed by users on their applications. Spark has a Streaming tool that can process real-time streaming data.

Access Big Data Spark Project Solution to Real-time Analysis of log-entries from applications using Streaming Architecture

6. Analysis of Crime Datasets

Analysis of crimes such as shootings, robberies, and murders can result in finding trends that can be used to keep the police alert for the likelihood of crimes that can happen in a given area. These trends can help to come up with a more strategized and optimal planning approach to selecting police stations and stationing personnel. 

With access to CCTV surveillance in real-time, behavior detection can help identify suspicious activities. Similarly, facial recognition software can play a bigger role in identifying criminals. A basic analysis of a crime dataset is one of the ideal Big Data projects for students. However, it can be made more complex by adding in the prediction of crime and facial recognition in places where it is required.

Big Data Analytics Projects for Students on Chicago Crime Data Analysis with Source Code

Explore Categories

In this section, you will find big data projects that rely on cloud service providers such as AWS and GCP.

7. Build a Scalable Event-Based GCP Data Pipeline using DataFlow

Suppose you are running an eCommerce website, and a customer places an order. In that case, you must inform the warehouse team to check the stock availability and commit to fulfilling the order. After that, the parcel has to be assigned to a delivery firm so it can be shipped to the customer. For such scenarios, data-driven integration becomes less comfortable, so you must prefer event-based data integration.

This project will teach you how to design and implement an event-based data integration pipeline on the Google Cloud Platform by processing data using DataFlow .

Scalable Event-Based GCP Data Pipeline using DataFlow

Data Description: You will use the Covid-19 dataset(COVID-19 Cases.csv) from data.world , for this project, which contains a few of the following attributes:




Language Used: Python 3.7

Services: Cloud Composer , Google Cloud Storage (GCS), Pub-Sub , Cloud Functions, BigQuery, BigTable

Big Data Project with Source Code: Build a Scalable Event-Based GCP Data Pipeline using DataFlow  

8. Topic Modeling

The future is AI! You must have come across similar quotes about artificial intelligence (AI). Initially, most people found it difficult to believe that could be true. Still, we are witnessing top multinational companies drift towards automating tasks using machine learning tools. 

Understand the reason behind this drift by working on one of our repository's most practical data engineering project examples .

Topic Modeling Big Data Project

Project Objective: Understand the end-to-end implementation of Machine learning operations (MLOps) by using cloud computing .

Learnings from the Project: This project will introduce you to various applications of AWS services . You will learn how to convert an ML application to a Flask Application and its deployment using Gunicord webserver. You will be implementing this project solution in Code Build. This project will help you understand ECS Cluster Task Definition.

Tech Stack:

Language: Python

Libraries: Flask, gunicorn, scipy , nltk , tqdm, numpy, joblib, pandas, scikit_learn, boto3

Services: Flask, Docker, AWS, Gunicorn

Source Code: MLOps AWS Project on Topic Modeling using Gunicorn Flask

9. MLOps on GCP Project for Autoregression using uWSGI Flask

Here is a project that combines Machine Learning Operations (MLOps) and Google Cloud Platform (GCP). As companies are switching to automation using machine learning algorithms, they have realized hardware plays a crucial role. Thus, many cloud service providers have come up to help such companies overcome their hardware limitations. Therefore, we have added this project to our repository to assist you with the end-to-end deployment of a machine learning project .

Project Objective: Deploying the moving average time-series machine-learning model on the cloud using GCP and Flask.

Learnings from the Project: You will work with Flask and uWSGI model files in this project. You will learn about creating Docker Images and Kubernetes architecture. You will also get to explore different components of GCP and their significance. You will understand how to clone the git repository with the source repository. Flask and Kubernetes deployment will also be discussed in this project.

Tech Stack: Language - Python

Services - GCP, uWSGI, Flask, Kubernetes, Docker

Build Professional SQL Projects for Data Analysis with ProjectPro

Unlock the ProjectPro Learning Experience for FREE

This section has good big data project ideas for graduate students who have enrolled in a master course.

10. Real-time Traffic Analysis

Traffic is an issue in many major cities, especially during some busier hours of the day. If traffic is monitored in real-time over popular and alternate routes, steps could be taken to reduce congestion on some roads. Real-time traffic analysis can also program traffic lights at junctions – stay green for a longer time on higher movement roads and less time for roads showing less vehicular movement at a given time. Real-time traffic analysis can help businesses manage their logistics and plan their commute accordingly for working-class individuals. Concepts of deep learning can be used to analyze this dataset properly.

11. Health Status Prediction

“Health is wealth” is a prevalent saying. And rightly so, there cannot be wealth unless one is healthy enough to enjoy worldly pleasures. Many diseases have risk factors that can be genetic, environmental, dietary, and more common for a specific age group or sex and more commonly seen in some races or areas. By gathering datasets of this information relevant for particular diseases, e.g., breast cancer, Parkinson’s disease, and diabetes, the presence of more risk factors can be used to measure the probability of the onset of one of these issues. 

Health Status Prediction Big Data Project

In cases where the risk factors are not already known, analysis of the datasets can be used to identify patterns of risk factors and hence predict the likelihood of onset accordingly. The level of complexity could vary depending on the type of analysis that has to be done for different diseases. Nevertheless, since prediction tools have to be applied, this is not a beginner-level big data project idea.

12. Analysis of Tourist Behavior

Tourism is a large sector that provides a livelihood for several people and can adversely impact a country's economy.. Not all tourists behave similarly simply because individuals have different preferences. Analyzing this behavior based on decision-making, perception, choice of destination, and level of satisfaction can be used to help travelers and locals have a more wholesome experience. Behavior analysis, like sentiment analysis, is one of the more advanced project ideas in the Big Data field.

13. Detection of Fake News on Social Media

Detection of Fake News on Social Media

With the popularity of social media, a major concern is the spread of fake news on various sites. Even worse, this misinformation tends to spread even faster than factual information. According to Wikipedia, fake news can be visual-based, which refers to images, videos, and even graphical representations of data, or linguistics-based, which refers to fake news in the form of text or a string of characters. Different cues are used based on the type of news to differentiate fake news from real. A site like Twitter has 330 million users , while Facebook has 2.8 billion users. A large amount of data will make rounds on these sites, which must be processed to determine the post's validity. Various data models based on machine learning techniques and computational methods based on NLP will have to be used to build an algorithm that can be used to detect fake news on social media.

Access Solution to Interesting Big Data Project on Detection of Fake News

14. Prediction of Calamities in a Given Area

Certain calamities, such as landslides and wildfires, occur more frequently during a particular season and in certain areas. Using certain geospatial technologies such as remote sensing and GIS (Geographic Information System) models makes it possible to monitor areas prone to these calamities and identify triggers that lead to such issues. 

Calamity Prediction Big Data Project

If calamities can be predicted more accurately, steps can be taken to protect the residents from them, contain the disasters, and maybe even prevent them in the first place. Past data of landslides has to be analyzed, while at the same time, in-site ground monitoring of data has to be done using remote sensing. The sooner the calamity can be identified, the easier it is to contain the harm. The need for knowledge and application of GIS adds to the complexity of this Big Data project.

15. Generating Image Captions

With the emergence of social media and the importance of digital marketing, it has become essential for businesses to upload engaging content. Catchy images are a requirement, but captions for images have to be added to describe them. The additional use of hashtags and attention-drawing captions can help a little more to reach the correct target audience. Large datasets have to be handled which correlate images and captions. 

Image Caption Generating Big Data Project

This involves image processing and deep learning to understand the image and artificial intelligence to generate relevant but appealing captions. Python can be used as the Big Data source code. Image caption generation cannot exactly be considered a beginner-level Big Data project idea. It is probably better to get some exposure to one of the projects before proceeding with this.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

16. Credit Card Fraud Detection

Credit Card Fraud Detection

The goal is to identify fraudulent credit card transactions, so a customer is not billed for an item that the customer did not purchase. This can tend to be challenging since there are huge datasets, and detection has to be done as soon as possible so that the fraudsters do not continue to purchase more items. Another challenge here is the data availability since the data is supposed to be primarily private. Since this project involves machine learning, the results will be more accurate with a larger dataset. Data availability can pose a challenge in this manner. Credit card fraud detection is helpful for a business since customers are likely to trust companies with better fraud detection applications, as they will not be billed for purchases made by someone else. Fraud detection can be considered one of the most common Big Data project ideas for beginners and students.

If you are looking for big data project examples that are fun to implement then do not miss out on this section.

17. GIS Analytics for Better Waste Management

Due to urbanization and population growth, large amounts of waste are being generated globally. Improper waste management is a hazard not only to the environment but also to us. Waste management involves the process of handling, transporting, storing, collecting, recycling, and disposing of the waste generated. Optimal routing of solid waste collection trucks can be done using GIS modeling to ensure that waste is picked up, transferred to a transfer site, and reaches the landfills or recycling plants most efficiently. GIS modeling can also be used to select the best sites for landfills. The location and placement of garbage bins within city localities must also be analyzed. 

18. Customized Programs for Students

We all tend to have different strengths and paces of learning. There are different kinds of intelligence, and the curriculum only focuses on a few things. Data analytics can help modify academic programs to nurture students better. Programs can be designed based on a student’s attention span and can be modified according to an individual’s pace, which can be different for different subjects. E.g., one student may find it easier to grasp language subjects but struggle with mathematical concepts.

In contrast, another might find it easier to work with math but not be able to breeze through language subjects. Customized programs can boost students’ morale, which could also reduce the number of dropouts. Analysis of a student’s strong subjects, monitoring their attention span, and their responses to specific topics in a subject can help build the dataset to create these customized programs.

19. Real-time Tracking of Vehicles

Transportation plays a significant role in many activities. Every day, goods have to be shipped across cities and countries; kids commute to school, and employees have to get to work. Some of these modes might have to be closely monitored for safety and tracking purposes. I’m sure parents would love to know if their children’s school buses were delayed while coming back from school for some reason. 

Vehicle Tracking Big Data Project

Taxi applications have to keep track of their users to ensure the safety of the drivers and the users. Tracking has to be done in real-time, as the vehicles will be continuously on the move. Hence, there will be a continuous stream of data flowing in. This data has to be processed, so there is data available on how the vehicles move so that improvements in routes can be made if required but also just for information on the general whereabouts of the vehicle movement.

20. Analysis of Network Traffic and Call Data Records

There are large chunks of data-making rounds in the telecommunications industry. However, very little of this data is currently being used to improve the business. According to a MindCommerce study: “An average telecom operator generates billions of records per day, and data should be analyzed in real or near real-time to gain maximum benefit.” 

The main challenge here is that these large amounts of data must be processed in real-time. With big data analysis, telecom industries can make decisions that can improve the customer experience by monitoring the network traffic. Issues such as call drops and network interruptions must be closely monitored to be addressed accordingly. By evaluating the usage patterns of customers, better service plans can be designed to meet these required usage needs. The complexity and tools used could vary based on the usage requirements of this project.

This section contains project ideas in big data that are primarily open-source and have been developed by Apache.

Apache Hadoop is an open-source big data processing framework that allows distributed storage and processing of large datasets across clusters of commodity hardware. It provides a scalable, reliable, and cost-effective solution for processing and analyzing big data.

22. Apache Spark

Apache Spark is an open-source big data processing engine that provides high-speed data processing capabilities for large-scale data processing tasks. It offers a unified analytics platform for batch processing, real-time processing, machine learning, and graph processing.

23. Apache Nifi 

Apache NiFi is an open-source data integration tool that enables users to easily and securely transfer data between systems, databases, and applications. It provides a web-based user interface for creating, scheduling, and monitoring data flows, making it easy to manage and automate data integration tasks.

24. Apache Flink

Apache Flink is an open-source big data processing framework that provides scalable, high-throughput, and fault-tolerant data stream processing capabilities. It offers low-latency data processing and provides APIs for batch processing, stream processing, and graph processing.

25. Apache Storm

Apache Storm is an open-source distributed real-time processing system that provides scalable and fault-tolerant stream processing capabilities. It allows users to process large amounts of data in real-time and provides APIs for creating data pipelines and processing data streams.

Does Big Data sound difficult to work with? Work on end-to-end solved Big Data Projects using Spark , and you will know how easy it is!

This section has projects on big data along with links of their source code on GitHub.

26. Fruit Image Classification

This project aims to make a mobile application to enable users to take pictures of fruits and get details about them for fruit harvesting. The project develops a data processing chain in a big data environment using Amazon Web Services (AWS) cloud tools, including steps like dimensionality reduction and data preprocessing and implements a fruit image classification engine. 

Fruit Image Classification Big Data Project

The project involves generating PySpark scripts and utilizing the AWS cloud to benefit from a Big Data architecture (EC2, S3, IAM) built on an EC2 Linux server. This project also uses DataBricks since it is compatible with AWS.

Source Code: Fruit Image Classification

27. Airline Customer Service App

In this project, you will build a web application that uses machine learning and Azure data bricks to forecast travel delays using weather data and airline delay statistics. Planning a bulk data import operation is the first step in the project. Next comes preparation, which includes cleaning and preparing the data for testing and building your machine learning model. 

Airline Customer Service App Big Data Project

This project will teach you how to deploy the trained model to Docker containers for on-demand predictions after storing it in Azure Machine Learning Model Management. It transfers data using Azure Data Factory (ADF) and summarises data using Azure Databricks and Spark SQL . The project uses Power BI to visualize batch forecasts.

Source Code: Airline Customer Service App

28. Criminal Network Analysis

This fascinating big data project seeks to find patterns to predict and detect links in a dynamic criminal network. This project uses a stream processing technique to extract relevant information as soon as data is generated since the criminal network is a dynamic social graph. It also suggests three brand-new social network similarity metrics for criminal link discovery and prediction. The next step is to develop a flexible data stream processing application using the Apache Flink framework, which enables the deployment and evaluation of the newly proposed and existing metrics.

Source Code- Criminal Network Analysis

Trying out these big data project ideas mentioned above in this blog will help you get used to the popular tools in the industry. But these projects are not enough if you are planning to land a job in the big data industry. And if you are curious about what else will get you closer to landing your dream job, then we highly recommend you check out ProjectPro . ProjectPro hosts a repository of solved projects in Data Science and Big Data prepared by experts in the industry. It offers a subscription to that repository that contains solutions in the form of guided videos along with supporting documentation to help you understand the projects end-to-end. So, don’t wait more to get your hands dirty with ProjectPro projects and subscribe to the repository today!

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

1. Why are big data projects important?

Big data projects are important as they will help you to master the necessary big data skills for any job role in the relevant field. These days, most businesses use big data to understand what their customers want, their best customers, and why individuals select specific items. This indicates a huge demand for big data experts in every industry, and you must add some good big data projects to your portfolio to stay ahead of your competitors.

2. What are some good big data projects?

Design a Network Crawler by Mining Github Social Profiles. In this big data project, you'll work on a Spark GraphX Algorithm and a Network Crawler to mine the people relationships around various Github projects.

Visualize Daily Wikipedia Trends using Hadoop - You'll build a Spark GraphX Algorithm and a Network Crawler to mine the people relationships around various Github projects. 

Modeling & Thinking in Graphs(Neo4J) using Movielens Dataset - You will reconstruct the movielens dataset in a graph structure and use that structure to answer queries in various ways in this Neo4j big data project.

3. How long does it take to complete a big data project?

A big data project might take a few hours to hundreds of days to complete. It depends on various factors such as the type of data you are using, its size, where it's stored, whether it is easily accessible, whether you need to perform any considerable amount of ETL processing on the data, etc. 

Access Solved Big Data and Data Science Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

Help | Advanced Search

Computer Science > Databases

Title: open research issues and tools for visualization and big data analytics.

Abstract: The new age of digital growth has marked all fields. This technological evolution has impacted data flows which have witnessed a rapid expansion over the last decade that makes the data traditional processing unable to catch up with the rapid flow of massive data. In this context, the implementation of a big data analytics system becomes crucial to make big data more relevant and valuable. Therefore, with these new opportunities appear new issues of processing very high data volumes requiring companies to look for big data-specialized solutions. These solutions are based on techniques to process these masses of information to facilitate decision-making. Among these solutions, we find data visualization which makes big data more intelligible allowing accurate illustrations that have become accessible to all. This paper examines the big data visualization project based on its characteristics, benefits, challenges and issues. The project, also, resulted in the provision of tools surging for beginners as well as well as experienced users.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Measuring benefits from big data analytics projects: an action research study

  • Original Article
  • Published: 28 February 2023
  • Volume 21 , pages 323–352, ( 2023 )

Cite this article

research projects on big data

  • Maria Hoffmann Jensen   ORCID: orcid.org/0000-0003-1038-7029 1 ,
  • John Stouby Persson 2 &
  • Peter Axel Nielsen 2  

730 Accesses

4 Citations

2 Altmetric

Explore all metrics

Big data analytics (BDA) projects are expected to provide organizations with several benefits once the project closes. Nevertheless, many BDA projects are unsuccessful as benefits did not materialize as expected. Organization can manage the expected benefits by measuring these, yet very few organizations actually measure on benefits post project development, and little has been written about BDA benefits measurements that extends beyond those typically identified in the project business case. This study examines how we should establish measures for BDA benefits in the context of a large wind turbine manufacturer investing in BDA to improve their practices when defining BDA benefits measures. We present lessons learned from our action research, that were found useful in establishing BDA benefit measurements. There are three lessons on (1) change, (2) specification of who, and (3) explicitness in establishing a useful BDA benefit measure. We contribute to BDA benefits realization in proposing the lessons to establish BDA benefits measurements. Finally, we discuss the lessons and contributions related to research on BDA value creation and benefits management.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

research projects on big data

Enhancing big data analytics deployment: uncovering stakeholder dynamics and balancing salience in project roles

Maria Hoffmann Jensen & Maja Due Kadenic

research projects on big data

Implementation Considerations for Big Data Analytics (BDA): A Benefit Dependency Network Approach

research projects on big data

Quantitative Comparison of Big Data Analytics and Business Intelligence Project Success Factors

Akter S, Wamba SF, Gunasekaran A, Dubey R, Childe SJ (2016) How to improve firm performance using big data analytics capability and business strategy alignment? Int J Prod Econ 182:113–131

Article   Google Scholar  

Ali IM, Jusoh YY, Abdullah R, Nor RNH, Affendey ALS (2019) Measuring the performance of big data analytics process. J Theor Appl Inf Technol 97(14):3783–3795

Google Scholar  

Avison DE, Davison RM, Malaurent J (2018) Information systems action research: Debunking myths and overcoming barriers. Inform Manag 55(2):177–187

Badewi A, Shehab E (2016) The impact of organizational project benefits management governance on ERP project success: Neo-institutional theory perspective. Int J Project Manage 34(3):412–428

Baesens B, Bapna R, Marsden JR, Vanthienen J, Zhao JL (2016) Transformational issues of big data and analytics in networked business. Mis Q 40(4):807–818

Baskerville R, Wood-Harper AT (1998) Diversity in information systems action research methods. Eur J Inf Syst 7(2):90–107

Baskerville R, Wood-Harper A (2016) A critical perspective on action research as a method for information systems research. Enacting Res Methods in Inform Syst 2(1996):169–190

Bennington P, Baccarini D (2004) Project benefits management in IT projects - an australian perspective. Project Manag J 35:20–30

Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36:1165–1188

Chiang RHL, Grover V, Liang TP, Zhang D (2018) Special issue: strategic value of big data and business analytics. J Manag Inf Syst 35:383–387

Chih YY, Zwikael O (2015) Project benefit management: a conceptual framework of target benefit formulation. Int J Project Manage 33(2):352–362

Claudia Goldin and Lawrence F. Katz (2007), The Race between Education and Technology, NBER Working Paper No. 12984

Côrte-real N, Oliveira T, Ruivo P (2017) Assessing business value of big data analytics in European firms. J Bus Res 70:379–390

Côrte-Real N, Ruivo P, Oliveira T, Popovič A (2019) Unlocking the drivers of big data analytics value in firms. J Bus Res 97(April):160–173

Daniel E, Peppard J, Ward J (2007) Managing the realization of business benefits from IT investments. MIS Q Exec 6(1):1–12

Davison Ou M (2012) The roles of theory in canonical action research. MIS Q 36(3):763–786

Doherty NF (2014) The role of socio-technical principles in leveraging meaningful benefits from IT investments. Appl Ergonom 45:181–187

Erevelles S, Fukawa N, Swayne L (2016) Big Data consumer analytics and the transformation of marketing. J Bus Res 69(2):897–904

Ferris T (2006) Churchman and measurement. In: McIntyre-Mills J, van Gigch JP (eds) Rescuing the enlightenment from itself: critical and systemic implications for democracy (vol 1, pp 213–225). Springer Science+Business Media Inc.

Fosso Wamba S, Akter S, Edwards A, Chopin G, Gnanzou D (2015) How “big data” can make big impact: Findings from a systematic review and a longitudinal case study. Int J Prod Econ 165:234–246

Frisk JE, Bannister F, Lindgren R (2015) Evaluation of information system investments: a value dials approach to closing the theory-practice gap. J Inf Technol 30(3):276–292

Gibson M, Arnott D (2005) The evaluation of business intelligence: a case study in a major financial institution. In: ACIS 2005 Proceedings—16th Australasian Conference on Information Systems, (December)

Gibson M, Arnott D, Jagielska I (2004) Evaluating the Intangible Benefits of Business Intelligence: Review & Research Agenda. Decision Support in an Uncertain and Complex World, 295–305

Grover V, Chiang RHL, Liang TP, Zhang D (2018) Creating strategic business value from big data analytics: a research framework. J Manag Inf Syst 35(2):388–423

Grover V, Lindberg A, Benbasat I, Lyytinen K (2020) The perils and promises of big data research in information systems. J Assoc Inf Syst 21(2):268–291

Günther WA, Rezazade Mehrizi MH, Huysman M, Feldberg F (2017) Debating big data: a literature review on realizing value from big data. J Strat Inf Syst 26(3):191–209

Hayes GR (2011) The relationship of action research to human-computer interaction. ACM Trans Comput-Human Interaction 18(3):1–20

Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687

Irani Z, Love P (2002) Developing a frame of reference for ex-ante IT/IS investment evaluation. Eur J Inf Syst 11(1):74–82

Iversen MN (2004) Managing risk in software process improvement: an action research approach. MIS Q 28(3):395

Jensen MH, Nielsen PA, Persson JS (2019) Managing big data analytics projects: The challenges of realizing value. 27th European Conference on Information Systems—Information Systems for a Sharing Society, ECIS 2019, (June)

Ji-fan Ren S, Fosso Wamba S, Akter S, Dubey R, Childe SJ (2017) Modelling quality dynamics, business value and firm performance in a big data analytics environment. Int J Prod Res 55(17):5011–5026

Kanji GK, Sá PME (2002) Kanji’s Business Scorecard. Total Qual Manag 13(1):13–27

Kaplan RS, Norton DP (1996) Translating strategy into action: the balanced scorecard. Harvard Business School Press, Boston, MA

Kwon O, Lee N, Shin B (2014) Data quality management, data usage experience and acquisition intention of big data analytics. Int J Inf Manage 34(3):387–394

Lai S-T, Leu F-Y (2019) A critical quality measurement model for managing and controlling big data project risks. In: Advances on Broad-Band Wireless Computing, Communication and Applications, Lecture Notes on Data Engineering and Communications Technologies 12 (pp. 777–789)

Larson D, Chang V (2016) A review and future direction of agile, business intelligence, analytics and data science. Int J Inf Manage 36(5):700–710

Lau RYK, Zhao JL, Chen G, Guo X (2016) Big data commerce. Inform Manag 53(8):929–933

Lavalle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N (2011) Big data, analytics and the path from insights to value. MIT Sloan Manag Rev 52(2):21–32

Lin C, Pervan G (2003) The practice of IS/IT benefits management in large Australian organizations. Inform Manag 41(1):13–24

Lynch RL, Cross KF (1995) Lynch, R.L., Cross, K.F.: Measure up! Yardsticks for Continuous Improvement. Cambridge, England: Blackwell business

Markus ML, Soh C (1995) How IT creates business value: a process theory synthesis. ICIS 1995 Proceedings, pp. 29–41

Marshall P, Mckay J, Prananto A (2004) A process model of business value creation from IT investments. ACIS 2004 Proceedings, (December), 12

Mathiassen L (2002) Collaborative practice research. Inf Technol People 15(4):321–345

McAfee A, Brynjolfsson E (2012) Big data. The management revolution. Harvard Buiness Review 90(10):61–68

Mckay J, Marshall P (2001) The dual imperatives of action research. Inf Technol People 14(1):46–59

Mikalef P, Augustin Framnes V, Danielsen F, Krogstie J, Håkon Olsen D (2017a). Big data analytics capability: antecedents and business value. Twenty First Pacific Asia Conference on Information Systems, 13

Mikalef P, Pappas IO, Krogstie J, Giannakos M (2017b). Big data analytics capabilities: a systematic literature review and research agenda. Information Systems and E-Business Management, 1–32

Mikalef P, Pappas IO, Krogstie J, Pavlou PA (2020) Big data and business analytics: a research agenda for realizing business value. Inform Manag 57(1):103237

Mirarab A, Mirtaheri SL, Asghari SA (2019) Value creation with big data analytics for enterprises: a survey. Telkomnika (Telecommun Computi Electron Control) 17(6):2790–2802

Müller O, Fay M, vom Brocke J (2018) The effect of big data and analytics on firm performance: an econometric analysis considering industry characteristics. J Manag Inf Syst 35(2):488–509

Neely A, Gregory M, Platts K (1995) Performance measurement system design: a literaturer review. Int J Oper Prod Manag 15(4):80–116

Nielsen PA (2007) IS action research and its criteria. In: Information System Action Research An Applied View of Emerging Concepts and Methods, N. Kock (ed.) (pp. 355–375). Springer

Oesterreich TD, Anton E, Teuteberg F, Dwivedi YK (2022a) The role of the social and technical factors in creating business value from big data analytics: a meta-analysis. J Bus Res 153:128–149

Oesterreich TD, Anton E, Teuteberg F (2022b) What translates big data into business value? A meta-analysis of the impacts of business analytics on firm performance. Inform Manag 59(6):103685

Patton MQ (2002) Qualitative research & evaluation methods, 4th edn. SAGE Publications Inc., Thousands Oaks, California

Ranjan J, Foropon C (2021) Big data analytics in building the competitive intelligence of organizations. Int J Inform Manag 56:102231

Schryen G (2013) Revisiting IS business value research: What we already know, what we still need to know, and how we can get there. Eur J Inf Syst 22(2):139–169

Seddon JJJM, Currie WL (2017) A model for unpacking big data analytics in high-frequency trading. J Bus Res 70:300–307

Spall S (1998) Emerging operational models sharon spall. Qual Inq 4(2):280–292

Trieu VH (2017) Getting value from business intelligence systems: a review and research agenda. Decis Support Syst 93:111–124

Veiga J, Exposito RR, Pardo XC, Taboada GL, Tourifio J (2016) Performance evaluation of big data frameworks for large-scale data analytics. In: Proceedings—2016 IEEE international conference on big data, pp 424–431

Veiga J, Expósito RR, Touriño J (2018) Performance evaluation of big data analysis. In: Sakr S, Zomaya A (eds) Encyclopedia of Big Data Technologies, Springer, Cham, pp 1265–1271. https://doi.org/10.1007/978-3-319-63962-8_143-1

Vries A, de; C.-M. Chituc and F. Pommeé. (2016) Towards identifying the business value of big data in a digital business ecosystem: a case study from the financial services industry. Lecture Notes in Bus Inform Process 255:28–40

Wamba SF, Gunasekaran A, Akter S, fan RenDubeyChilde SJRSJ (2017) Big data analytics and firm performance: effects of dynamic capabilities. J Bus Res 70:356–365

Ward J, Daniel E (2012) Benefits management. Wiley

Waring T, Casey R, Robson A (2018) Benefits realisation from IT-enabled innovation: a capability challenge for NHS English acute hospital trusts? Inf Technol People 31(3):618–645

Download references

The authors declare that the data supporting the findings in this study are available within the article in form of quotations. The data are not publicly available due to these containing sensitive information from Vestas Wind Systems A/S.

Author information

Authors and affiliations.

Department of Business Development and Technology, Aarhus University, BTECH, Birk Centerpark 15, Herning, Denmark

Maria Hoffmann Jensen

Department of Computer Science, Aalborg University, Selma Lagerløfs Vej, 300 9220, Aalborg, Denmark

John Stouby Persson & Peter Axel Nielsen

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Maria Hoffmann Jensen .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors have no conflict of interest to declare.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Jensen, M.H., Persson, J.S. & Nielsen, P.A. Measuring benefits from big data analytics projects: an action research study. Inf Syst E-Bus Manage 21 , 323–352 (2023). https://doi.org/10.1007/s10257-022-00620-0

Download citation

Received : 01 August 2022

Revised : 21 November 2022

Accepted : 19 December 2022

Published : 28 February 2023

Issue Date : June 2023

DOI : https://doi.org/10.1007/s10257-022-00620-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data analytics benefits
  • Measuring of benefits
  • Big data analytics projects
  • Benefits management
  • Find a journal
  • Publish with us
  • Track your research

Research Hub

Big Data Projects

Main navigation.

Big Data Projects studies the application of statistical modeling and AI technologies to healthcare.

Mohsen Bayati studies probabilistic and statistical models for decision-making with large-scale and complex data and applies them to healthcare problems. Currently, an area of focus is AI’s use in oncology, and multi-functional research efforts are underway between the GSB and the School of Medicine. For example, AI is the right technology for oncology treatment decision-making methods because of its ability to synthesize rich patient data into prospective individual-level actionable recommendations and retrospectively learn from those decisions at scale.

However, the current set of AI technologies are focused heavily on detection and diagnosis, and major challenges remain in accessing and using the rich set of patient data for the oncologist’s patient-specific treatment decision. The clinical workflow then becomes mainly experience-driven, leading to many care disparities and with many hand-offs between oncology specialists. Dr. Bayati’s research enables developing an oncologist-centric decision support tool to push oncological decision-making and AI research further and in a multidisciplinary way by using AI for day-to-day oncology treatment decisions. He also studies graphical models and message-passing algorithms.

Mohsen Bayati , Faculty Director

Mon - Sat 9:00am - 12:00am

  • Get a quote

List of Best Research and Thesis Topic Ideas for Data Science in 2022

In an era driven by digital and technological transformation, businesses actively seek skilled and talented data science potentials capable of leveraging data insights to enhance business productivity and achieve organizational objectives. In keeping with an increasing demand for data science professionals, universities offer various data science and big data courses to prepare students for the tech industry. Research projects are a crucial part of these programs and a well- executed data science project can make your CV appear more robust and compelling. A  broad range of data science topics exist that offer exciting possibilities for research but choosing data science research topics can be a real challenge for students . After all, a good research project relies first and foremost on data analytics research topics that draw upon both mono-disciplinary and multi-disciplinary research to explore endless possibilities for real –world applications.

As one of the top-most masters and PhD online dissertation writing services , we are geared to assist students in the entire research process right from the initial conception to the final execution to ensure that you have a truly fulfilling and enriching research experience. These resources are also helpful for those students who are taking online classes .

By taking advantage of our best digital marketing research topics in data science you can be assured of producing an innovative research project that will impress your research professors and make a huge difference in attracting the right employers.

Get an Immediate Response

Discuss your requirments with our writers

Get 3 Customize Research Topic within 24 Hours

Undergraduate Masters PhD Others

Data science thesis topics

We have compiled a list of data science research topics for students studying data science that can be utilized in data science projects in 2022. our team of professional data experts have brought together master or MBA thesis topics in data science  that cater to core areas  driving the field of data science and big data that will relieve all your research anxieties and  provide a solid grounding for  an interesting research projects . The article will feature data science thesis ideas that can be immensely beneficial for students as they cover a broad research agenda for future data science . These ideas have been drawn from the 8 v’s of big data namely Volume, Value, Veracity, Visualization, Variety, Velocity, Viscosity, and Virility that provide interesting and challenging research areas for prospective researches  in their masters or PhD thesis . Overall, the general big data research topics can be divided into distinct categories to facilitate the research topic selection process.

  • Security and privacy issues
  • Cloud Computing Platforms for Big Data Adoption and Analytics
  • Real-time data analytics for processing of image , video and text
  • Modeling uncertainty

How “The Research Guardian” Can Help You A lot!

Our top thesis writing experts are available 24/7 to assist you the right university projects. Whether its critical literature reviews to complete your PhD. or Master Levels thesis.


The article will also guide students engaged in doctoral research by introducing them to an outstanding list of data science thesis topics that can lead to major real-time applications of big data analytics in your research projects.

  • Intelligent traffic control ; Gathering and monitoring traffic information using CCTV images.
  • Asymmetric protected storage methodology over multi-cloud service providers in Big data.
  • Leveraging disseminated data over big data analytics environment.
  • Internet of Things.
  • Large-scale data system and anomaly detection.

What makes us a unique research service for your research needs?

We offer all –round and superb research services that have a distinguished track record in helping students secure their desired grades in research projects in big data analytics and hence pave the way for a promising career ahead. These are the features that set us apart in the market for research services that effectively deal with all significant issues in your research for.

  • Plagiarism –free ; We strictly adhere to a non-plagiarism policy in all our research work to  provide you with well-written, original content  with low similarity index   to maximize  chances of acceptance of your research submissions.
  • Publication; We don’t just suggest PhD data science research topics but our PhD consultancy services take your research to the next level by ensuring its publication in well-reputed journals. A PhD thesis is indispensable for a PhD degree and with our premier best PhD thesis services that  tackle all aspects  of research writing and cater to  essential requirements of journals , we will bring you closer to your dream of being a PhD in the field of data analytics.
  • Research ethics: Solid research ethics lie at the core of our services where we actively seek to protect the  privacy and confidentiality of  the technical and personal information of our valued customers.
  • Research experience: We take pride in our world –class team of computing industry professionals equipped with the expertise and experience to assist in choosing data science research topics and subsequent phases in research including findings solutions, code development and final manuscript writing.
  • Business ethics: We are driven by a business philosophy that‘s wholly committed to achieving total customer satisfaction by providing constant online and offline support and timely submissions so that you can keep track of the progress of your research.

Now, we’ll proceed to cover specific research problems encompassing both data analytics research topics and big data thesis topics that have applications across multiple domains.

Get Help from Expert Thesis Writers!

TheresearchGuardian.com providing expert thesis assistance for university students at any sort of level. Our thesis writing service has been serving students since 2011.

Multi-modal Transfer Learning for Cross-Modal Information Retrieval

Aim and objectives.

The research aims to examine and explore the use of CMR approach in bringing about a flexible retrieval experience by combining data across different modalities to ensure abundant multimedia data.

  • Develop methods to enable learning across different modalities in shared cross modal spaces comprising texts and images as well as consider the limitations of existing cross –modal retrieval algorithms.
  • Investigate the presence and effects of bias in cross modal transfer learning and suggesting strategies for bias detection and mitigation.
  • Develop a tool with query expansion and relevance feedback capabilities to facilitate search and retrieval of multi-modal data.
  • Investigate the methods of multi modal learning and elaborate on the importance of multi-modal deep learning to provide a comprehensive learning experience.

The Role of Machine Learning in Facilitating the Implication of the Scientific Computing and Software Engineering

  • Evaluate how machine learning leads to improvements in computational APA reference generator tools and thus aids in  the implementation of scientific computing
  • Evaluating the effectiveness of machine learning in solving complex problems and improving the efficiency of scientific computing and software engineering processes.
  • Assessing the potential benefits and challenges of using machine learning in these fields, including factors such as cost, accuracy, and scalability.
  • Examining the ethical and social implications of using machine learning in scientific computing and software engineering, such as issues related to bias, transparency, and accountability.

Trustworthy AI

The research aims to explore the crucial role of data science in advancing scientific goals and solving problems as well as the implications involved in use of AI systems especially with respect to ethical concerns.

  • Investigate the value of digital infrastructures  available through open data   in  aiding sharing  and inter linking of data for enhanced global collaborative research efforts
  • Provide explanations of the outcomes of a machine learning model  for a meaningful interpretation to build trust among users about the reliability and authenticity of data
  • Investigate how formal models can be used to verify and establish the efficacy of the results derived from probabilistic model.
  • Review the concept of Trustworthy computing as a relevant framework for addressing the ethical concerns associated with AI systems.

The Implementation of Data Science and their impact on the management environment and sustainability

The aim of the research is to demonstrate how data science and analytics can be leveraged in achieving sustainable development.

  • To examine the implementation of data science using data-driven decision-making tools
  • To evaluate the impact of modern information technology on management environment and sustainability.
  • To examine the use of  data science in achieving more effective and efficient environment management
  • Explore how data science and analytics can be used to achieve sustainability goals across three dimensions of economic, social and environmental.

Big data analytics in healthcare systems

The aim of the research is to examine the application of creating smart healthcare systems and   how it can   lead to more efficient, accessible and cost –effective health care.

  • Identify the potential Areas or opportunities in big data to transform the healthcare system such as for diagnosis, treatment planning, or drug development.
  • Assessing the potential benefits and challenges of using AI and deep learning in healthcare, including factors such as cost, efficiency, and accessibility
  • Evaluating the effectiveness of AI and deep learning in improving patient outcomes, such as reducing morbidity and mortality rates, improving accuracy and speed of diagnoses, or reducing medical errors
  • Examining the ethical and social implications of using AI and deep learning in healthcare, such as issues related to bias, privacy, and autonomy.

Large-Scale Data-Driven Financial Risk Assessment

The research aims to explore the possibility offered by big data in a consistent and real time assessment of financial risks.

  • Investigate how the use of big data can help to identify and forecast risks that can harm a business.
  • Categories the types of financial risks faced by companies.
  • Describe the importance of financial risk management for companies in business terms.
  • Train a machine learning model to classify transactions as fraudulent or genuine.

Scalable Architectures for Parallel Data Processing

Big data has exposed us to an ever –growing volume of data which cannot be handled through traditional data management and analysis systems. This has given rise to the use of scalable system architectures to efficiently process big data and exploit its true value. The research aims to analyses the current state of practice in scalable architectures and identify common patterns and techniques to design scalable architectures for parallel data processing.

  • To design and implement a prototype scalable architecture for parallel data processing
  • To evaluate the performance and scalability of the prototype architecture using benchmarks and real-world datasets
  • To compare the prototype architecture with existing solutions and identify its strengths and weaknesses
  • To evaluate the trade-offs and limitations of different scalable architectures for parallel data processing
  • To provide recommendations for the use of the prototype architecture in different scenarios, such as batch processing, stream processing, and interactive querying

Robotic manipulation modelling

The aim of this research is to develop and validate a model-based control approach for robotic manipulation of small, precise objects.

  • Develop a mathematical model of the robotic system that captures the dynamics of the manipulator and the grasped object.
  • Design a control algorithm that uses the developed model to achieve stable and accurate grasping of the object.
  • Test the proposed approach in simulation and validate the results through experiments with a physical robotic system.
  • Evaluate the performance of the proposed approach in terms of stability, accuracy, and robustness to uncertainties and perturbations.
  • Identify potential applications and areas for future work in the field of robotic manipulation for precision tasks.

Big data analytics and its impacts on marketing strategy

The aim of this research is to investigate the impact of big data analytics on marketing strategy and to identify best practices for leveraging this technology to inform decision-making.

  • Review the literature on big data analytics and marketing strategy to identify key trends and challenges
  • Conduct a case study analysis of companies that have successfully integrated big data analytics into their marketing strategies
  • Identify the key factors that contribute to the effectiveness of big data analytics in marketing decision-making
  • Develop a framework for integrating big data analytics into marketing strategy.
  • Investigate the ethical implications of big data analytics in marketing and suggest best practices for responsible use of this technology.

Looking For Customize Thesis Topics?

Take a review of different varieties of thesis topics and samples from our website TheResearchGuardian.com on multiple subjects for every educational level.

Platforms for large scale data computing: big data analysis and acceptance

To investigate the performance and scalability of different large-scale data computing platforms.

  • To compare the features and capabilities of different platforms and determine which is most suitable for a given use case.
  • To identify best practices for using these platforms, including considerations for data management, security, and cost.
  • To explore the potential for integrating these platforms with other technologies and tools for data analysis and visualization.
  • To develop case studies or practical examples of how these platforms have been used to solve real-world data analysis challenges.

Distributed data clustering

Distributed data clustering can be a useful approach for analyzing and understanding complex datasets, as it allows for the identification of patterns and relationships that may not be immediately apparent.

To develop and evaluate new algorithms for distributed data clustering that is efficient and scalable.

  • To compare the performance and accuracy of different distributed data clustering algorithms on a variety of datasets.
  • To investigate the impact of different parameters and settings on the performance of distributed data clustering algorithms.
  • To explore the potential for integrating distributed data clustering with other machine learning and data analysis techniques.
  • To apply distributed data clustering to real-world problems and evaluate its effectiveness.

Analyzing and predicting urbanization patterns using GIS and data mining techniques".

The aim of this project is to use GIS and data mining techniques to analyze and predict urbanization patterns in a specific region.

  • To collect and process relevant data on urbanization patterns, including population density, land use, and infrastructure development, using GIS tools.
  • To apply data mining techniques, such as clustering and regression analysis, to identify trends and patterns in the data.
  • To use the results of the data analysis to develop a predictive model for urbanization patterns in the region.
  • To present the results of the analysis and the predictive model in a clear and visually appealing way, using GIS maps and other visualization techniques.

Use of big data and IOT in the media industry

Big data and the Internet of Things (IoT) are emerging technologies that are transforming the way that information is collected, analyzed, and disseminated in the media sector. The aim of the research is to understand how big data and IoT re used to dictate information flow in the media industry

  • Identifying the key ways in which big data and IoT are being used in the media sector, such as for content creation, audience engagement, or advertising.
  • Analyzing the benefits and challenges of using big data and IoT in the media industry, including factors such as cost, efficiency, and effectiveness.
  • Examining the ethical and social implications of using big data and IoT in the media sector, including issues such as privacy, security, and bias.
  • Determining the potential impact of big data and IoT on the media landscape and the role of traditional media in an increasingly digital world.

Exigency computer systems for meteorology and disaster prevention

The research aims to explore the role of exigency computer systems to detect weather and other hazards for disaster prevention and response

  • Identifying the key components and features of exigency computer systems for meteorology and disaster prevention, such as data sources, analytics tools, and communication channels.
  • Evaluating the effectiveness of exigency computer systems in providing accurate and timely information about weather and other hazards.
  • Assessing the impact of exigency computer systems on the ability of decision makers to prepare for and respond to disasters.
  • Examining the challenges and limitations of using exigency computer systems, such as the need for reliable data sources, the complexity of the systems, or the potential for human error.

Network security and cryptography

Overall, the goal of research is to improve our understanding of how to protect communication and information in the digital age, and to develop practical solutions for addressing the complex and evolving security challenges faced by individuals, organizations, and societies.

  • Developing new algorithms and protocols for securing communication over networks, such as for data confidentiality, data integrity, and authentication
  • Investigating the security of existing cryptographic primitives, such as encryption and hashing algorithms, and identifying vulnerabilities that could be exploited by attackers.
  • Evaluating the effectiveness of different network security technologies and protocols, such as firewalls, intrusion detection systems, and virtual private networks (VPNs), in protecting against different types of attacks.
  • Exploring the use of cryptography in emerging areas, such as cloud computing, the Internet of Things (IoT), and blockchain, and identifying the unique security challenges and opportunities presented by these domains.
  • Investigating the trade-offs between security and other factors, such as performance, usability, and cost, and developing strategies for balancing these conflicting priorities.

Meet Our Professionals Ranging From Renowned Universities

Related topics.

  • Sports Management Research Topics
  • Special Education Research Topics
  • Software Engineering Research Topics
  • Primary Education Research Topics
  • Microbiology Research Topics
  • Luxury Brand Research Topics
  • Cyber Security Research Topics
  • Commercial Law Research Topics
  • Change Management Research Topics
  • Artificial intelligence Research Topics

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

The use of Big Data Analytics in healthcare

Kornelia batko.

1 Department of Business Informatics, University of Economics in Katowice, Katowice, Poland

Andrzej Ślęzak

2 Department of Biomedical Processes and Systems, Institute of Health and Nutrition Sciences, Częstochowa University of Technology, Częstochowa, Poland

Associated Data

The datasets for this study are available on request to the corresponding author.

The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities. The direct research was carried out based on research questionnaire and conducted on a sample of 217 medical facilities in Poland. Literature studies have shown that the use of Big Data Analytics can bring many benefits to medical facilities, while direct research has shown that medical facilities in Poland are moving towards data-based healthcare because they use structured and unstructured data, reach for analytics in the administrative, business and clinical area. The research positively confirmed that medical facilities are working on both structural data and unstructured data. The following kinds and sources of data can be distinguished: from databases, transaction data, unstructured content of emails and documents, data from devices and sensors. However, the use of data from social media is lower as in their activity they reach for analytics, not only in the administrative and business but also in the clinical area. It clearly shows that the decisions made in medical facilities are highly data-driven. The results of the study confirm what has been analyzed in the literature that medical facilities are moving towards data-based healthcare, together with its benefits.


The main contribution of this paper is to present an analytical overview of using structured and unstructured data (Big Data) analytics in medical facilities in Poland. Medical facilities use both structured and unstructured data in their practice. Structured data has a predetermined schema, it is extensive, freeform, and comes in variety of forms [ 27 ]. In contrast, unstructured data, referred to as Big Data (BD), does not fit into the typical data processing format. Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using traditional tools. It remains stored but not analyzed. Due to the lack of a well-defined schema, it is difficult to search and analyze such data and, therefore, it requires a specific technology and method to transform it into value [ 20 , 68 ]. Integrating data stored in both structured and unstructured formats can add significant value to an organization [ 27 ]. Organizations must approach unstructured data in a different way. Therefore, the potential is seen in Big Data Analytics (BDA). Big Data Analytics are techniques and tools used to analyze and extract information from Big Data. The results of Big Data analysis can be used to predict the future. They also help in creating trends about the past. When it comes to healthcare, it allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 60 ].

This paper is the first study to consolidate and characterize the use of Big Data from different perspectives. The first part consists of a brief literature review of studies on Big Data (BD) and Big Data Analytics (BDA), while the second part presents results of direct research aimed at diagnosing the use of big data analyses in medical facilities in Poland.

Healthcare is a complex system with varied stakeholders: patients, doctors, hospitals, pharmaceutical companies and healthcare decision-makers. This sector is also limited by strict rules and regulations. However, worldwide one may observe a departure from the traditional doctor-patient approach. The doctor becomes a partner and the patient is involved in the therapeutic process [ 14 ]. Healthcare is no longer focused solely on the treatment of patients. The priority for decision-makers should be to promote proper health attitudes and prevent diseases that can be avoided [ 81 ]. This became visible and important especially during the Covid-19 pandemic [ 44 ].

The next challenges that healthcare will have to face is the growing number of elderly people and a decline in fertility. Fertility rates in the country are found below the reproductive minimum necessary to keep the population stable [ 10 ]. The reflection of both effects, namely the increase in age and lower fertility rates, are demographic load indicators, which is constantly growing. Forecasts show that providing healthcare in the form it is provided today will become impossible in the next 20 years [ 70 ]. It is especially visible now during the Covid-19 pandemic when healthcare faced quite a challenge related to the analysis of huge data amounts and the need to identify trends and predict the spread of the coronavirus. The pandemic showed it even more that patients should have access to information about their health condition, the possibility of digital analysis of this data and access to reliable medical support online. Health monitoring and cooperation with doctors in order to prevent diseases can actually revolutionize the healthcare system. One of the most important aspects of the change necessary in healthcare is putting the patient in the center of the system.

Technology is not enough to achieve these goals. Therefore, changes should be made not only at the technological level but also in the management and design of complete healthcare processes and what is more, they should affect the business models of service providers. The use of Big Data Analytics is becoming more and more common in enterprises [ 17 , 54 ]. However, medical enterprises still cannot keep up with the information needs of patients, clinicians, administrators and the creator’s policy. The adoption of a Big Data approach would allow the implementation of personalized and precise medicine based on personalized information, delivered in real time and tailored to individual patients.

To achieve this goal, it is necessary to implement systems that will be able to learn quickly about the data generated by people within clinical care and everyday life. This will enable data-driven decision making, receiving better personalized predictions about prognosis and responses to treatments; a deeper understanding of the complex factors and their interactions that influence health at the patient level, the health system and society, enhanced approaches to detecting safety problems with drugs and devices, as well as more effective methods of comparing prevention, diagnostic, and treatment options [ 40 ].

In the literature, there is a lot of research showing what opportunities can be offered to companies by big data analysis and what data can be analyzed. However, there are few studies showing how data analysis in the area of healthcare is performed, what data is used by medical facilities and what analyses and in which areas they carry out. This paper aims to fill this gap by presenting the results of research carried out in medical facilities in Poland. The goal is to analyze the possibilities of using Big Data Analytics in healthcare, especially in Polish conditions. In particular, the paper is aimed at determining what data is processed by medical facilities in Poland, what analyses they perform and in what areas, and how they assess their analytical maturity. In order to achieve this goal, a critical analysis of the literature was performed, and the direct research was based on a research questionnaire conducted on a sample of 217 medical facilities in Poland. It was hypothesized that medical facilities in Poland are working on both structured and unstructured data and moving towards data-based healthcare and its benefits. Examining the maturity of healthcare facilities in the use of Big Data and Big Data Analytics is crucial in determining the potential future benefits that the healthcare sector can gain from Big Data Analytics. There is also a pressing need to predicate whether, in the coming years, healthcare will be able to cope with the threats and challenges it faces.

This paper is divided into eight parts. The first is the introduction which provides background and the general problem statement of this research. In the second part, this paper discusses considerations on use of Big Data and Big Data Analytics in Healthcare, and then, in the third part, it moves on to challenges and potential benefits of using Big Data Analytics in healthcare. The next part involves the explanation of the proposed method. The result of direct research and discussion are presented in the fifth part, while the following part of the paper is the conclusion. The seventh part of the paper presents practical implications. The final section of the paper provides limitations and directions for future research.

Considerations on use Big Data and Big Data Analytics in the healthcare

In recent years one can observe a constantly increasing demand for solutions offering effective analytical tools. This trend is also noticeable in the analysis of large volumes of data (Big Data, BD). Organizations are looking for ways to use the power of Big Data to improve their decision making, competitive advantage or business performance [ 7 , 54 ]. Big Data is considered to offer potential solutions to public and private organizations, however, still not much is known about the outcome of the practical use of Big Data in different types of organizations [ 24 ].

As already mentioned, in recent years, healthcare management worldwide has been changed from a disease-centered model to a patient-centered model, even in value-based healthcare delivery model [ 68 ]. In order to meet the requirements of this model and provide effective patient-centered care, it is necessary to manage and analyze healthcare Big Data.

The issue often raised when it comes to the use of data in healthcare is the appropriate use of Big Data. Healthcare has always generated huge amounts of data and nowadays, the introduction of electronic medical records, as well as the huge amount of data sent by various types of sensors or generated by patients in social media causes data streams to constantly grow. Also, the medical industry generates significant amounts of data, including clinical records, medical images, genomic data and health behaviors. Proper use of the data will allow healthcare organizations to support clinical decision-making, disease surveillance, and public health management. The challenge posed by clinical data processing involves not only the quantity of data but also the difficulty in processing it.

In the literature one can find many different definitions of Big Data. This concept has evolved in recent years, however, it is still not clearly understood. Nevertheless, despite the range and differences in definitions, Big Data can be treated as a: large amount of digital data, large data sets, tool, technology or phenomenon (cultural or technological.

Big Data can be considered as massive and continually generated digital datasets that are produced via interactions with online technologies [ 53 ]. Big Data can be defined as datasets that are of such large sizes that they pose challenges in traditional storage and analysis techniques [ 28 ]. A similar opinion about Big Data was presented by Ohlhorst who sees Big Data as extremely large data sets, possible neither to manage nor to analyze with traditional data processing tools [ 57 ]. In his opinion, the bigger the data set, the more difficult it is to gain any value from it.

In turn, Knapp perceived Big Data as tools, processes and procedures that allow an organization to create, manipulate and manage very large data sets and storage facilities [ 38 ]. From this point of view, Big Data is identified as a tool to gather information from different databases and processes, allowing users to manage large amounts of data.

Similar perception of the term ‘Big Data’ is shown by Carter. According to him, Big Data technologies refer to a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high velocity capture, discovery and/or analysis [ 13 ].

Jordan combines these two approaches by identifying Big Data as a complex system, as it needs data bases for data to be stored in, programs and tools to be managed, as well as expertise and personnel able to retrieve useful information and visualization to be understood [ 37 ].

Following the definition of Laney for Big Data, it can be state that: it is large amount of data generated in very fast motion and it contains a lot of content [ 43 ]. Such data comes from unstructured sources, such as stream of clicks on the web, social networks (Twitter, blogs, Facebook), video recordings from the shops, recording of calls in a call center, real time information from various kinds of sensors, RFID, GPS devices, mobile phones and other devices that identify and monitor something [ 8 ]. Big Data is a powerful digital data silo, raw, collected with all sorts of sources, unstructured and difficult, or even impossible, to analyze using conventional techniques used so far to relational databases.

While describing Big Data, it cannot be overlooked that the term refers more to a phenomenon than to specific technology. Therefore, instead of defining this phenomenon, trying to describe them, more authors are describing Big Data by giving them characteristics included a collection of V’s related to its nature [ 2 , 3 , 23 , 25 , 58 ]:

  • Volume (refers to the amount of data and is one of the biggest challenges in Big Data Analytics),
  • Velocity (speed with which new data is generated, the challenge is to be able to manage data effectively and in real time),
  • Variety (heterogeneity of data, many different types of healthcare data, the challenge is to derive insights by looking at all available heterogenous data in a holistic manner),
  • Variability (inconsistency of data, the challenge is to correct the interpretation of data that can vary significantly depending on the context),
  • Veracity (how trustworthy the data is, quality of the data),
  • Visualization (ability to interpret data and resulting insights, challenging for Big Data due to its other features as described above).
  • Value (the goal of Big Data Analytics is to discover the hidden knowledge from huge amounts of data).

Big Data is defined as an information asset with high volume, velocity, and variety, which requires specific technology and method for its transformation into value [ 21 , 77 ]. Big Data is also a collection of information about high-volume, high volatility or high diversity, requiring new forms of processing in order to support decision-making, discovering new phenomena and process optimization [ 5 , 7 ]. Big Data is too large for traditional data-processing systems and software tools to capture, store, manage and analyze, therefore it requires new technologies [ 28 , 50 , 61 ] to manage (capture, aggregate, process) its volume, velocity and variety [ 9 ].

Undoubtedly, Big Data differs from the data sources used so far by organizations. Therefore, organizations must approach this type of unstructured data in a different way. First of all, organizations must start to see data as flows and not stocks—this entails the need to implement the so-called streaming analytics [ 48 ]. The mentioned features make it necessary to use new IT tools that allow the fullest use of new data [ 58 ]. The Big Data idea, inseparable from the huge increase in data available to various organizations or individuals, creates opportunities for access to valuable analyses, conclusions and enables making more accurate decisions [ 6 , 11 , 59 ].

The Big Data concept is constantly evolving and currently it does not focus on huge amounts of data, but rather on the process of creating value from this data [ 52 ]. Big Data is collected from various sources that have different data properties and are processed by different organizational units, resulting in creation of a Big Data chain [ 36 ]. The aim of the organizations is to manage, process and analyze Big Data. In the healthcare sector, Big Data streams consist of various types of data, namely [ 8 , 51 ]:

  • clinical data, i.e. data obtained from electronic medical records, data from hospital information systems, image centers, laboratories, pharmacies and other organizations providing health services, patient generated health data, physician’s free-text notes, genomic data, physiological monitoring data [ 4 ],
  • biometric data provided from various types of devices that monitor weight, pressure, glucose level, etc.,
  • financial data, constituting a full record of economic operations reflecting the conducted activity,
  • data from scientific research activities, i.e. results of research, including drug research, design of medical devices and new methods of treatment,
  • data provided by patients, including description of preferences, level of satisfaction, information from systems for self-monitoring of their activity: exercises, sleep, meals consumed, etc.
  • data from social media.

These data are provided not only by patients but also by organizations and institutions, as well as by various types of monitoring devices, sensors or instruments [ 16 ]. Data that has been generated so far in the healthcare sector is stored in both paper and digital form. Thus, the essence and the specificity of the process of Big Data analyses means that organizations need to face new technological and organizational challenges [ 67 ]. The healthcare sector has always generated huge amounts of data and this is connected, among others, with the need to store medical records of patients. However, the problem with Big Data in healthcare is not limited to an overwhelming volume but also an unprecedented diversity in terms of types, data formats and speed with which it should be analyzed in order to provide the necessary information on an ongoing basis [ 3 ]. It is also difficult to apply traditional tools and methods for management of unstructured data [ 67 ]. Due to the diversity and quantity of data sources that are growing all the time, advanced analytical tools and technologies, as well as Big Data analysis methods which can meet and exceed the possibilities of managing healthcare data, are needed [ 3 , 68 ].

Therefore, the potential is seen in Big Data analyses, especially in the aspect of improving the quality of medical care, saving lives or reducing costs [ 30 ]. Extracting from this tangle of given association rules, patterns and trends will allow health service providers and other stakeholders in the healthcare sector to offer more accurate and more insightful diagnoses of patients, personalized treatment, monitoring of the patients, preventive medicine, support of medical research and health population, as well as better quality of medical services and patient care while, at the same time, the ability to reduce costs (Fig.  1 ).

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_553_Fig1_HTML.jpg

Healthcare Big Data Analytics applications

(Source: Own elaboration)

The main challenge with Big Data is how to handle such a large amount of information and use it to make data-driven decisions in plenty of areas [ 64 ]. In the context of healthcare data, another major challenge is to adjust big data storage, analysis, presentation of analysis results and inference basing on them in a clinical setting. Data analytics systems implemented in healthcare are designed to describe, integrate and present complex data in an appropriate way so that it can be understood better (Fig.  2 ). This would improve the efficiency of acquiring, storing, analyzing and visualizing big data from healthcare [ 71 ].

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_553_Fig2_HTML.jpg

Process of Big Data Analytics

The result of data processing with the use of Big Data Analytics is appropriate data storytelling which may contribute to making decisions with both lower risk and data support. This, in turn, can benefit healthcare stakeholders. To take advantage of the potential massive amounts of data in healthcare and to ensure that the right intervention to the right patient is properly timed, personalized, and potentially beneficial to all components of the healthcare system such as the payer, patient, and management, analytics of large datasets must connect communities involved in data analytics and healthcare informatics [ 49 ]. Big Data Analytics can provide insight into clinical data and thus facilitate informed decision-making about the diagnosis and treatment of patients, prevention of diseases or others. Big Data Analytics can also improve the efficiency of healthcare organizations by realizing the data potential [ 3 , 62 ].

Big Data Analytics in medicine and healthcare refers to the integration and analysis of a large amount of complex heterogeneous data, such as various omics (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenetics, deasomics), biomedical data, talemedicine data (sensors, medical equipment data) and electronic health records data [ 46 , 65 ].

When analyzing the phenomenon of Big Data in the healthcare sector, it should be noted that it can be considered from the point of view of three areas: epidemiological, clinical and business.

From a clinical point of view, the Big Data analysis aims to improve the health and condition of patients, enable long-term predictions about their health status and implementation of appropriate therapeutic procedures. Ultimately, the use of data analysis in medicine is to allow the adaptation of therapy to a specific patient, that is personalized medicine (precision, personalized medicine).

From an epidemiological point of view, it is desirable to obtain an accurate prognosis of morbidity in order to implement preventive programs in advance.

In the business context, Big Data analysis may enable offering personalized packages of commercial services or determining the probability of individual disease and infection occurrence. It is worth noting that Big Data means not only the collection and processing of data but, most of all, the inference and visualization of data necessary to obtain specific business benefits.

In order to introduce new management methods and new solutions in terms of effectiveness and transparency, it becomes necessary to make data more accessible, digital, searchable, as well as analyzed and visualized.

Erickson and Rothberg state that the information and data do not reveal their full value until insights are drawn from them. Data becomes useful when it enhances decision making and decision making is enhanced only when analytical techniques are used and an element of human interaction is applied [ 22 ].

Thus, healthcare has experienced much progress in usage and analysis of data. A large-scale digitalization and transparency in this sector is a key statement of almost all countries governments policies. For centuries, the treatment of patients was based on the judgment of doctors who made treatment decisions. In recent years, however, Evidence-Based Medicine has become more and more important as a result of it being related to the systematic analysis of clinical data and decision-making treatment based on the best available information [ 42 ]. In the healthcare sector, Big Data Analytics is expected to improve the quality of life and reduce operational costs [ 72 , 82 ]. Big Data Analytics enables organizations to improve and increase their understanding of the information contained in data. It also helps identify data that provides insightful insights for current as well as future decisions [ 28 ].

Big Data Analytics refers to technologies that are grounded mostly in data mining: text mining, web mining, process mining, audio and video analytics, statistical analysis, network analytics, social media analytics and web analytics [ 16 , 25 , 31 ]. Different data mining techniques can be applied on heterogeneous healthcare data sets, such as: anomaly detection, clustering, classification, association rules as well as summarization and visualization of those Big Data sets [ 65 ]. Modern data analytics techniques explore and leverage unique data characteristics even from high-speed data streams and sensor data [ 15 , 16 , 31 , 55 ]. Big Data can be used, for example, for better diagnosis in the context of comprehensive patient data, disease prevention and telemedicine (in particular when using real-time alerts for immediate care), monitoring patients at home, preventing unnecessary hospital visits, integrating medical imaging for a wider diagnosis, creating predictive analytics, reducing fraud and improving data security, better strategic planning and increasing patients’ involvement in their own health.

Big Data Analytics in healthcare can be divided into [ 33 , 73 , 74 ]:

  • descriptive analytics in healthcare is used to understand past and current healthcare decisions, converting data into useful information for understanding and analyzing healthcare decisions, outcomes and quality, as well as making informed decisions [ 33 ]. It can be used to create reports (i.e. about patients’ hospitalizations, physicians’ performance, utilization management), visualization, customized reports, drill down tables, or running queries on the basis of historical data.
  • predictive analytics operates on past performance in an effort to predict the future by examining historical or summarized health data, detecting patterns of relationships in these data, and then extrapolating these relationships to forecast. It can be used to i.e. predict the response of different patient groups to different drugs (dosages) or reactions (clinical trials), anticipate risk and find relationships in health data and detect hidden patterns [ 62 ]. In this way, it is possible to predict the epidemic spread, anticipate service contracts and plan healthcare resources. Predictive analytics is used in proper diagnosis and for appropriate treatments to be given to patients suffering from certain diseases [ 39 ].
  • prescriptive analytics—occurs when health problems involve too many choices or alternatives. It uses health and medical knowledge in addition to data or information. Prescriptive analytics is used in many areas of healthcare, including drug prescriptions and treatment alternatives. Personalized medicine and evidence-based medicine are both supported by prescriptive analytics.
  • discovery analytics—utilizes knowledge about knowledge to discover new “inventions” like drugs (drug discovery), previously unknown diseases and medical conditions, alternative treatments, etc.

Although the models and tools used in descriptive, predictive, prescriptive, and discovery analytics are different, many applications involve all four of them [ 62 ]. Big Data Analytics in healthcare can help enable personalized medicine by identifying optimal patient-specific treatments. This can influence the improvement of life standards, reduce waste of healthcare resources and save costs of healthcare [ 56 , 63 , 71 ]. The introduction of large data analysis gives new analytical possibilities in terms of scope, flexibility and visualization. Techniques such as data mining (computational pattern discovery process in large data sets) facilitate inductive reasoning and analysis of exploratory data, enabling scientists to identify data patterns that are independent of specific hypotheses. As a result, predictive analysis and real-time analysis becomes possible, making it easier for medical staff to start early treatments and reduce potential morbidity and mortality. In addition, document analysis, statistical modeling, discovering patterns and topics in document collections and data in the EHR, as well as an inductive approach can help identify and discover relationships between health phenomena.

Advanced analytical techniques can be used for a large amount of existing (but not yet analytical) data on patient health and related medical data to achieve a better understanding of the information and results obtained, as well as to design optimal clinical pathways [ 62 ]. Big Data Analytics in healthcare integrates analysis of several scientific areas such as bioinformatics, medical imaging, sensor informatics, medical informatics and health informatics [ 65 ]. Big Data Analytics in healthcare allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 65 ]. Discussing all the techniques used for Big Data Analytics goes beyond the scope of a single article [ 25 ].

The success of Big Data analysis and its accuracy depend heavily on the tools and techniques used to analyze the ability to provide reliable, up-to-date and meaningful information to various stakeholders [ 12 ]. It is believed that the implementation of big data analytics by healthcare organizations could bring many benefits in the upcoming years, including lowering health care costs, better diagnosis and prediction of diseases and their spread, improving patient care and developing protocols to prevent re-hospitalization, optimizing staff, optimizing equipment, forecasting the need for hospital beds, operating rooms, treatments, and improving the drug supply chain [ 71 ].

Challenges and potential benefits of using Big Data Analytics in healthcare

Modern analytics gives possibilities not only to have insight in historical data, but also to have information necessary to generate insight into what may happen in the future. Even when it comes to prediction of evidence-based actions. The emphasis on reform has prompted payers and suppliers to pursue data analysis to reduce risk, detect fraud, improve efficiency and save lives. Everyone—payers, providers, even patients—are focusing on doing more with fewer resources. Thus, some areas in which enhanced data and analytics can yield the greatest results include various healthcare stakeholders (Table ​ (Table1 1 ).

The use of analytics by various healthcare stakeholders

Source: own elaboration on the basis of [ 19 , 20 ]

Healthcare organizations see the opportunity to grow through investments in Big Data Analytics. In recent years, by collecting medical data of patients, converting them into Big Data and applying appropriate algorithms, reliable information has been generated that helps patients, physicians and stakeholders in the health sector to identify values and opportunities [ 31 ]. It is worth noting that there are many changes and challenges in the structure of the healthcare sector. Digitization and effective use of Big Data in healthcare can bring benefits to every stakeholder in this sector. A single doctor would benefit the same as the entire healthcare system. Potential opportunities to achieve benefits and effects from Big Data in healthcare can be divided into four groups [ 8 ]:

  • assessment of diagnoses made by doctors and the manner of treatment of diseases indicated by them based on the decision support system working on Big Data collections,
  • detection of more effective, from a medical point of view, and more cost-effective ways to diagnose and treat patients,
  • analysis of large volumes of data to reach practical information useful for identifying needs, introducing new health services, preventing and overcoming crises,
  • prediction of the incidence of diseases,
  • detecting trends that lead to an improvement in health and lifestyle of the society,
  • analysis of the human genome for the introduction of personalized treatment.
  • doctors’ comparison of current medical cases to cases from the past for better diagnosis and treatment adjustment,
  • detection of diseases at earlier stages when they can be more easily and quickly cured,
  • detecting epidemiological risks and improving control of pathogenic spots and reaction rates,
  • identification of patients who are predicted to have the highest risk of specific, life-threatening diseases by collating data on the history of the most common diseases, in healing people with reports entering insurance companies,
  • health management of each patient individually (personalized medicine) and health management of the whole society,
  • capturing and analyzing large amounts of data from hospitals and homes in real time, life monitoring devices to monitor safety and predict adverse events,
  • analysis of patient profiles to identify people for whom prevention should be applied, lifestyle change or preventive care approach,
  • the ability to predict the occurrence of specific diseases or worsening of patients’ results,
  • predicting disease progression and its determinants, estimating the risk of complications,
  • detecting drug interactions and their side effects.
  • supporting work on new drugs and clinical trials thanks to the possibility of analyzing “all data” instead of selecting a test sample,
  • the ability to identify patients with specific, biological features that will take part in specialized clinical trials,
  • selecting a group of patients for which the tested drug is likely to have the desired effect and no side effects,
  • using modeling and predictive analysis to design better drugs and devices.
  • reduction of costs and counteracting abuse and counseling practices,
  • faster and more effective identification of incorrect or unauthorized financial operations in order to prevent abuse and eliminate errors,
  • increase in profitability by detecting patients generating high costs or identifying doctors whose work, procedures and treatment methods cost the most and offering them solutions that reduce the amount of money spent,
  • identification of unnecessary medical activities and procedures, e.g. duplicate tests.

According to research conducted by Wang, Kung and Byrd, Big Data Analytics benefits can be classified into five categories: IT infrastructure benefits (reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, better use of healthcare systems, processing standardization among various healthcare IT systems, reducing IT maintenance costs regarding data storage), operational benefits (improving the quality and accuracy of clinical decisions, processing a large number of health records in seconds, reducing the time of patient travel, immediate access to clinical data to analyze, shortening the time of diagnostic test, reductions in surgery-related hospitalizations, exploring inconceivable new research avenues), organizational benefits (detecting interoperability problems much more quickly than traditional manual methods, improving cross-functional communication and collaboration among administrative staffs, researchers, clinicians and IT staffs, enabling data sharing with other institutions and adding new services, content sources and research partners), managerial benefits (gaining quick insights about changing healthcare trends in the market, providing members of the board and heads of department with sound decision-support information on the daily clinical setting, optimizing business growth-related decisions) and strategic benefits (providing a big picture view of treatment delivery for meeting future need, creating high competitive healthcare services) [ 73 ].

The above specification does not constitute a full list of potential areas of use of Big Data Analysis in healthcare because the possibilities of using analysis are practically unlimited. In addition, advanced analytical tools allow to analyze data from all possible sources and conduct cross-analyses to provide better data insights [ 26 ]. For example, a cross-analysis can refer to a combination of patient characteristics, as well as costs and care results that can help identify the best, in medical terms, and the most cost-effective treatment or treatments and this may allow a better adjustment of the service provider’s offer [ 62 ].

In turn, the analysis of patient profiles (e.g. segmentation and predictive modeling) allows identification of people who should be subject to prophylaxis, prevention or should change their lifestyle [ 8 ]. Shortened list of benefits for Big Data Analytics in healthcare is presented in paper [ 3 ] and consists of: better performance, day-to-day guides, detection of diseases in early stages, making predictive analytics, cost effectiveness, Evidence Based Medicine and effectiveness in patient treatment.

Summarizing, healthcare big data represents a huge potential for the transformation of healthcare: improvement of patients’ results, prediction of outbreaks of epidemics, valuable insights, avoidance of preventable diseases, reduction of the cost of healthcare delivery and improvement of the quality of life in general [ 1 ]. Big Data also generates many challenges such as difficulties in data capture, data storage, data analysis and data visualization [ 15 ]. The main challenges are connected with the issues of: data structure (Big Data should be user-friendly, transparent, and menu-driven but it is fragmented, dispersed, rarely standardized and difficult to aggregate and analyze), security (data security, privacy and sensitivity of healthcare data, there are significant concerns related to confidentiality), data standardization (data is stored in formats that are not compatible with all applications and technologies), storage and transfers (especially costs associated with securing, storing, and transferring unstructured data), managerial skills, such as data governance, lack of appropriate analytical skills and problems with Real-Time Analytics (health care is to be able to utilize Big Data in real time) [ 4 , 34 , 41 ].

The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities in Poland.

Presented research results are part of a larger questionnaire form on Big Data Analytics. The direct research was based on an interview questionnaire which contained 100 questions with 5-point Likert scale (1—strongly disagree, 2—I rather disagree, 3—I do not agree, nor disagree, 4—I rather agree, 5—I definitely agree) and 4 metrics questions. The study was conducted in December 2018 on a sample of 217 medical facilities (110 private, 107 public). The research was conducted by a specialized market research agency: Center for Research and Expertise of the University of Economics in Katowice.

When it comes to direct research, the selected entities included entities financed from public sources—the National Health Fund (23.5%), and entities operating commercially (11.5%). In the surveyed group of entities, more than a half (64.9%) are hybrid financed, both from public and commercial sources. The diversity of the research sample also applies to the size of the entities, defined by the number of employees. Taking into account proportions of the surveyed entities, it should be noted that in the sector structure, medium-sized (10–50 employees—34% of the sample) and large (51–250 employees—27%) entities dominate. The research was of all-Poland nature, and the entities included in the research sample come from all of the voivodships. The largest group were entities from Łódzkie (32%), Śląskie (18%) and Mazowieckie (18%) voivodships, as these voivodships have the largest number of medical institutions. Other regions of the country were represented by single units. The selection of the research sample was random—layered. As part of medical facilities database, groups of private and public medical facilities have been identified and the ones to which the questionnaire was targeted were drawn from each of these groups. The analyses were performed using the GNU PSPP 0.10.2 software.

The aim of the study was to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Characteristics of the research sample is presented in Table ​ Table2 2 .

Characteristics of the research sample

The research is non-exhaustive due to the incomplete and uneven regional distribution of the samples, overrepresented in three voivodeships (Łódzkie, Mazowieckie and Śląskie). The size of the research sample (217 entities) allows the authors of the paper to formulate specific conclusions on the use of Big Data in the process of its management.

For the purpose of this paper, the following research hypotheses were formulated: (1) medical facilities in Poland are working on both structured and unstructured data (2) medical facilities in Poland are moving towards data-based healthcare and its benefits.

The paper poses the following research questions and statements that coincide with the selected questions from the research questionnaire:

  • From what sources do medical facilities obtain data? What types of data are used by the particular organization, whether structured or unstructured, and to what extent?
  • From what sources do medical facilities obtain data?
  • In which area organizations are using data and analytical systems (clinical or business)?
  • Is data analytics performed based on historical data or are predictive analyses also performed?
  • Determining whether administrative and medical staff receive complete, accurate and reliable data in a timely manner?
  • Determining whether real-time analyses are performed to support the particular organization’s activities.

Results and discussion

On the basis of the literature analysis and research study, a set of questions and statements related to the researched area was formulated. The results from the surveys show that medical facilities use a variety of data sources in their operations. These sources are both structured and unstructured data (Table ​ (Table3 3 ).

Type of data sources used in medical facility (%)

1—strongly disagree, 2—I disagree, 3—I agree or disagree, 4—I rather agree, 5—I strongly agree

According to the data provided by the respondents, considering the first statement made in the questionnaire, almost half of the medical institutions (47.58%) agreed that they rather collect and use structured data (e.g. databases and data warehouses, reports to external entities) and 10.57% entirely agree with this statement. As much as 23.35% of representatives of medical institutions stated “I agree or disagree”. Other medical facilities do not collect and use structured data (7.93%) and 6.17% strongly disagree with the first statement. Also, the median calculated based on the obtained results (median: 4), proves that medical facilities in Poland collect and use structured data (Table ​ (Table4 4 ).

Collection and use of data determined by the size of medical facility (number of employees)

In turn, 28.19% of the medical institutions agreed that they rather collect and use unstructured data and as much as 9.25% entirely agree with this statement. The number of representatives of medical institutions that stated “I agree or disagree” was 27.31%. Other medical facilities do not collect and use structured data (17.18%) and 13.66% strongly disagree with the first statement. In the case of unstructured data the median is 3, which means that the collection and use of this type of data by medical facilities in Poland is lower.

In the further part of the analysis, it was checked whether the size of the medical facility and form of ownership have an impact on whether it analyzes unstructured data (Tables ​ (Tables4 4 and ​ and5). 5 ). In order to find this out, correlation coefficients were calculated.

Collection and use of data determined by the form of ownership of medical facility

Based on the calculations, it can be concluded that there is a small statistically monotonic correlation between the size of the medical facility and its collection and use of structured data (p < 0.001; τ = 0.16). This means that the use of structured data is slightly increasing in larger medical facilities. The size of the medical facility is more important according to use of unstructured data (p < 0.001; τ = 0.23) (Table ​ (Table4 4 .).

To determine whether the form of medical facility ownership affects data collection, the Mann–Whitney U test was used. The calculations show that the form of ownership does not affect what data the organization collects and uses (Table ​ (Table5 5 ).

Detailed information on the sources of from which medical facilities collect and use data is presented in the Table ​ Table6 6 .

Data sources used in medical facility

1—we do not use at all, 5—we use extensively

The questionnaire results show that medical facilities are especially using information published in databases, reports to external units and transaction data, but they also use unstructured data from e-mails, medical devices, sensors, phone calls, audio and video data (Table ​ (Table6). 6 ). Data from social media, RFID and geolocation data are used to a small extent. Similar findings are concluded in the literature studies.

From the analysis of the answers given by the respondents, more than half of the medical facilities have integrated hospital system (HIS) implemented. As much as 43.61% use integrated hospital system and 16.30% use it extensively (Table ​ (Table7). 7 ). 19.38% of exanimated medical facilities do not use it at all. Moreover, most of the examined medical facilities (34.80% use it, 32.16% use extensively) conduct medical documentation in an electronic form, which gives an opportunity to use data analytics. Only 4.85% of medical facilities don’t use it at all.

The use of HIS and electronic documentation in medical facilities (%)

Other problems that needed to be investigated were: whether medical facilities in Poland use data analytics? If so, in what form and in what areas? (Table ​ (Table8). 8 ). The analysis of answers given by the respondents about the potential of data analytics in medical facilities shows that a similar number of medical facilities use data analytics in administration and business (31.72% agreed with the statement no. 5 and 12.33% strongly agreed) as in the clinical area (33.04% agreed with the statement no. 6 and 12.33% strongly agreed). When considering decision-making issues, 35.24% agree with the statement "the organization uses data and analytical systems to support business decisions” and 8.37% of respondents strongly agree. Almost 40.09% agree with the statement that “the organization uses data and analytical systems to support clinical decisions (in the field of diagnostics and therapy)” and 15.42% of respondents strongly agree. Exanimated medical facilities use in their activity analytics based both on historical data (33.48% agree with statement 7 and 12.78% strongly agree) and predictive analytics (33.04% agrees with the statement number 8 and 15.86% strongly agree). Detailed results are presented in Table ​ Table8 8 .

Conditions of using Big Data Analytics in medical facilities (%)

Medical facilities focus on development in the field of data processing, as they confirm that they conduct analytical planning processes systematically and analyze new opportunities for strategic use of analytics in business and clinical activities (38.33% rather agree and 10.57% strongly agree with this statement). The situation is different with real-time data analysis, here, the situation is not so optimistic. Only 28.19% rather agree and 14.10% strongly agree with the statement that real-time analyses are performed to support an organization’s activities.

When considering whether a facility’s performance in the clinical area depends on the form of ownership, it can be concluded that taking the average and the Mann–Whitney U test depends. A higher degree of use of analyses in the clinical area can be observed in public institutions.

Whether a medical facility performs a descriptive or predictive analysis do not depend on the form of ownership (p > 0.05). It can be concluded that when analyzing the mean and median, they are higher in public facilities, than in private ones. What is more, the Mann–Whitney U test shows that these variables are dependent from each other (p < 0.05) (Table ​ (Table9 9 ).

Conditions of using Big Data Analytics in medical facilities determined by the form of ownership of medical facility

When considering whether a facility’s performance in the clinical area depends on its size, it can be concluded that taking the Kendall’s Tau (τ) it depends (p < 0.001; τ = 0.22), and the correlation is weak but statistically important. This means that the use of data and analytical systems to support clinical decisions (in the field of diagnostics and therapy) increases with the increase of size of the medical facility. A similar relationship, but even less powerful, can be found in the use of descriptive and predictive analyses (Table ​ (Table10 10 ).

Conditions of using Big Data Analytics in medical facilities determined by the size of medical facility (number of employees)

Considering the results of research in the area of analytical maturity of medical facilities, 8.81% of medical facilities stated that they are at the first level of maturity, i.e. an organization has developed analytical skills and does not perform analyses. As much as 13.66% of medical facilities confirmed that they have poor analytical skills, while 38.33% of the medical facility has located itself at level 3, meaning that “there is a lot to do in analytics”. On the other hand, 28.19% believe that analytical capabilities are well developed and 6.61% stated that analytics are at the highest level and the analytical capabilities are very well developed. Detailed data is presented in Table ​ Table11. 11 . Average amounts to 3.11 and Median to 3.

Analytical maturity of examined medical facilities (%)

The results of the research have enabled the formulation of following conclusions. Medical facilities in Poland are working on both structured and unstructured data. This data comes from databases, transactions, unstructured content of emails and documents, devices and sensors. However, the use of data from social media is smaller. In their activity, they reach for analytics in the administrative and business, as well as in the clinical area. Also, the decisions made are largely data-driven.

In summary, analysis of the literature that the benefits that medical facilities can get using Big Data Analytics in their activities relate primarily to patients, physicians and medical facilities. It can be confirmed that: patients will be better informed, will receive treatments that will work for them, will have prescribed medications that work for them and not be given unnecessary medications [ 78 ]. Physician roles will likely change to more of a consultant than decision maker. They will advise, warn, and help individual patients and have more time to form positive and lasting relationships with their patients in order to help people. Medical facilities will see changes as well, for example in fewer unnecessary hospitalizations, resulting initially in less revenue, but after the market adjusts, also the accomplishment [ 78 ]. The use of Big Data Analytics can literally revolutionize the way healthcare is practiced for better health and disease reduction.

The analysis of the latest data reveals that data analytics increase the accuracy of diagnoses. Physicians can use predictive algorithms to help them make more accurate diagnoses [ 45 ]. Moreover, it could be helpful in preventive medicine and public health because with early intervention, many diseases can be prevented or ameliorated [ 29 ]. Predictive analytics also allows to identify risk factors for a given patient, and with this knowledge patients will be able to change their lives what, in turn, may contribute to the fact that population disease patterns may dramatically change, resulting in savings in medical costs. Moreover, personalized medicine is the best solution for an individual patient seeking treatment. It can help doctors decide the exact treatments for those individuals. Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including doctors’ time.

The quantitative analysis of the research carried out and presented in this article made it possible to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Thanks to the results obtained it was possible to formulate the following conclusions. Medical facilities are working on both structured and unstructured data, which comes from databases, transactions, unstructured content of emails and documents, devices and sensors. According to analytics, they reach for analytics in the administrative and business, as well as in the clinical area. It clearly showed that the decisions made are largely data-driven. The results of the study confirm what has been analyzed in the literature. Medical facilities are moving towards data-based healthcare and its benefits.

In conclusion, Big Data Analytics has the potential for positive impact and global implications in healthcare. Future research on the use of Big Data in medical facilities will concern the definition of strategies adopted by medical facilities to promote and implement such solutions, as well as the benefits they gain from the use of Big Data analysis and how the perspectives in this area are seen.

Practical implications

This work sought to narrow the gap that exists in analyzing the possibility of using Big Data Analytics in healthcare. Showing how medical facilities in Poland are doing in this respect is an element that is part of global research carried out in this area, including [ 29 , 32 , 60 ].

Limitations and future directions

The research described in this article does not fully exhaust the questions related to the use of Big Data Analytics in Polish healthcare facilities. Only some of the dimensions characterizing the use of data by medical facilities in Poland have been examined. In order to get the full picture, it would be necessary to examine the results of using structured and unstructured data analytics in healthcare. Future research may examine the benefits that medical institutions achieve as a result of the analysis of structured and unstructured data in the clinical and management areas and what limitations they encounter in these areas. For this purpose, it is planned to conduct in-depth interviews with chosen medical facilities in Poland. These facilities could give additional data for empirical analyses based more on their suggestions. Further research should also include medical institutions from beyond the borders of Poland, enabling international comparative analyses.

Future research in the healthcare field has virtually endless possibilities. These regard the use of Big Data Analytics to diagnose specific conditions [ 47 , 66 , 69 , 76 ], propose an approach that can be used in other healthcare applications and create mechanisms to identify “patients like me” [ 75 , 80 ]. Big Data Analytics could also be used for studies related to the spread of pandemics, the efficacy of covid treatment [ 18 , 79 ], or psychology and psychiatry studies, e.g. emotion recognition [ 35 ].


We would like to thank those who have touched our science paths.

Authors’ contributions

KB proposed the concept of research and its design. The manuscript was prepared by KB with the consultation of AŚ. AŚ reviewed the manuscript for getting its fine shape. KB prepared the manuscript in the contexts such as definition of intellectual content, literature search, data acquisition, data analysis, and so on. AŚ obtained research funding. Both authors read and approved the final manuscript.

This research was fully funded as statutory activity—subsidy of Ministry of Science and Higher Education granted for Technical University of Czestochowa on maintaining research potential in 2018. Research Number: BS/PB–622/3020/2014/P. Publication fee for the paper was financed by the University of Economics in Katowice.

Availability of data and materials


Not applicable.

The author declares no conflict of interest.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Kornelia Batko, Email: [email protected] .

Andrzej Ślęzak, Email: moc.liamg@25kazelsa .

Research projects

Check back soon for details on specific projects.

Phasor Measurement Unit (PMU) Data Analytics based Smart Grid Diagnostics

Mentors : Hanif Livani and Lei Yang

Project Description : With the proliferation of PMUs in smart grids, time-synchronized high-resolution measurements can be obtained and used for numerous monitoring applications such as state estimation and event diagnostics. Disruptive events frequently occur in smart grids that interrupt the normal operation of the system. Therefore, data-driven event diagnostics are of utmost importance to extract useful information such as the cause or location of events. Moreover, having a repository of data is useful for other post-event analysis, such as preventive maintenance. Accurate disruptive event analysis is beneficial in terms of time, maintenance crew utilization, and further outages prevention. In this research project, a PMU data-driven framework will be developed to distinguish disruptive events, i.e., malfunctioned capacitor bank switching and malfunctioned regulator on-load tap changer (OLTC) switching from normal operating events, namely, the normal abrupt load change and the reconfiguration in smart grids. The event diagnostics will be formulated using a neural network based algorithm, i.e., autoencoders along with softmax classifiers. The performance of the proposed framework will be verified using our state-of-the-art Cyber-Physical-Hardware-in-the-Loop (CP-HIL) testbed. This project will broaden students' perspective in smart grids by utilizing advanced data analytics and hands-on education.

Student Role : The undergraduate students will get an introduction to Matlab so that they can learn some basic data analytics, experiment design, and basic machine learning library usage. With the guidance from the mentors and PhD students, they will learn how to extract PMU data stream from actual devices in the CP-HIL testbed, write computer codes with graphical user interface (GUI) to access data from the SQL server, and execute data-driven event diagnostics tools.

Big Data Analytics based Wildfire Smoke Transport and Air Quality Prediction

Mentors : Feng Yan, Lei Yang, and Heather Holmes

Project Description : Smoke can transport very fast and cause sudden air quality change and cause significant health and economic problems. State-of-the-art smoke forecasting models can only do infrequent updates (e.g., every 6 hours) and predict with very limited spatial resolution (e.g., 12km$\times$12km) due to the low spatiotemporal data resolution. To enable real-time prediction of wildfire smoke transport and air quality, data with finer temporal and spatial resolution is needed. The ground-level camera systems (e.g., AlertTahoe Fire Camera Network) generate large amounts of image data at various locations with much finer spatial and temporal resolution, e.g., each camera can generate 30 images per second for a small region (e.g., 1km x 1km). By using these images, we can first detect the smoke in each region, then estimate the air quality based on the strong correlation between air pollution concentrations and smoke plume density, and then predict air quality from smoke transport.

Student Role : With the guidance from the mentors and PhD students, the undergraduate students will learn how to detect smoke from image data using Deep Neural Networks (DNN). They will also learn how to predict air quality from smoke transport using Gaussian Markov Random Field (GMRF) based on the strong correlation between air quality and smoke transport.

Big Data Analytics based Robotic Perception for Autonomous Driving

Mentors : Kostas Alexis and Lei Yang

Project Description : Autonomous driving requires an accurate and comprehensive understanding of the vehicle's surroundings. This refers to a multitude of challenges including those of a) detection and classification of objects of interest and b) estimating the relative pose of such objects. State-of-the-art methods face certain limitations. Object detection tasks for traffic sign recognition are well--handled but the respective methods lack in their ability to deal with visually--degraded environments or significant occlusions. Object localization works well when the tracked object is consistently perceived but can fail otherwise due to lack of reliable prediction methods. To achieve reliable autonomous driving, ``any--time'' and ``any--place'' robust and safe navigation autonomy must be facilitated. In this project, we have identified three important challenges of progressive complexity and we will offer respective research experiences for students.

Student Role : The undergraduate students will conduct experiments over a pre-trained neural network and will perform detection of traffic signs in both well-lit and low-light conditions with the guidance from the mentors and PhD students. After they get familiar with machine learning and TensorFlow, they will dive deeper into multi-view geometry and will aim to optimize the recognition behavior of a pre-trained network exploiting a sliding window trajectory of the vehicle.

Adaptive and Scalable Big Data Management

Mentors : Lei Yang and Dongfang Zhao

Project Description : In many big data applications, massive amounts of heterogeneous data are collected by various sensing devices, in order to enhance the cognition of the system dynamics and optimize decision making. However, these measurements are subject to communication delay and data packet loss, which can lead to significant errors in system state estimation and prediction. A key observation is that these measurements are spatio-temporal correlated, which can be represented in a low-rank subspace. By constructing and tracking this low-rank subspace, it is possible to reconstruct the delayed or missing data. However, such a low-rank subspace is difficult to characterize, as the measurements are heterogeneous with complex spatio-temporal correlations and such correlations may change over time due to the change of network topology.

Student Role : The undergraduate students will get an introduction to basic data recovery techniques and apply these techniques to recover the missing data. With the guidance from the mentors and PhD students, they will study state-of-the-art tensor-based data recovery techniques and develop adaptive and scalable data recovery methods for various big data applications.

Big Data System Performance and Efficiency Optimization

Mentors : Feng Yan and Dongfang Zhao

Project Description : Big data analytics tasks in many applications (e.g., recognition, prediction, and control for smart cities) are fulfilled in large-scale distributed systems (e.g., Hadoop, Spark, and Storm, Tensorflow, and Caffe. The performance of these big data systems depends on the configuration optimization for different applications, workloads, and systems. However, today's computing frameworks provide tens to hundreds of configuration knobs for users to tune their systems, which renders a challenging task for many users with no expertise in either application domain or system. Even for experts in both application domain and system, it is time-consuming to optimally configure the system. Therefore, there is an urgent need to develop formal methodologies for automatic configuration tuning to optimize big data system performance and efficiency.

Student Role : The undergraduate students will get an introduction to big data analytics systems, such as Hadoop, Spark, and Storm. With the guidance from the mentors and PhD students, they will get familiar with the configuration knobs and learn some basic skills in configuring the big data analytics systems. Then they will learn how to collect performance measurements from the big data analytics systems and analyze the collected performance measurements. Finally, they will learn to use analytical models, simulation, and machine learning techniques to automatically tune the control knobs to get optimized system performance and efficiency.

UC San Diego

  • Research & Collections
  • Borrow & Request
  • Computing & Technology

UC San Diego

Data Science: Guide for Independent Projects

  • Books & Journals
  • Working with Python
  • Working with R
  • Version control & GitHub


Getting started, guided projects, starting projects from scratch, project examples for beginners, more advanced projects, portfolio examples.

  • Finding Data This link opens in a new window
  • Data Visualization This link opens in a new window
  • Other Library Resources This link opens in a new window

Many data science students eventually want to undertake an independent or personal side project. This guide is intended to provide resources for these types of project. This is  not  necessarily intended to provide guidance for course projects, internship deliverables, or other formalized projects. Rather, this is to help you, as a data science student, get a little extra experience working with data. 

The benefits of these types of projects are three-fold: (1) apply what you've learned in your coursework to a new topic, testing your knowledge (2) learn new skills, including new Python or R packages and other platforms/tools, and (3) produce an output you can put on your resume. If you get really into your project, you can also consider turning it into a guest blog post on a data science site, or otherwise sharing your work with a broader audience.

Still not sure? Use the choice wheels below to help brainstorm a project topic.

  • Pick a topic area For example, maybe you want to do a project related to sports, or social media, or biology. If you're not sure, or don't have a preference, use this spinning wheel to help you pick a topic area.
  • Pick a data science task Data science projects often focus on a specific task, for instance classification, regression, clustering, or others.
  • Pick a data type One way to pick a project is to think about what kind of data you want practice working with. For instance, do you want to practice working with numeric data, or text data? Or maybe you want to practice your data science skills with image data?
  • Pick an additional tool or approach This wheel includes a selection of tools, methods, or approaches often used in data science project, such as: API queries, recommender systems, sentiment analysis, and more.
  • Pick a random keyword Spin to get a random keyword to help further brainstorm your topic. For instance, if you want to practice classification (task) using tabular data (type) for transportation (area), and incorporate sentiment analysis (additional tool) could you tie this to your chosen keyword? Consider: is there a way to incorporate, say, political preferences based on vote data, with voter sentiment towards expanded highway infrastructure? Or maybe classifying preferences for electric vehicles by median income by Census block, rates of homeowners insurance, or other of these keywords?

Maybe you're not ready to start a project entirely from scratch. That's fine! These links have examples of more guided projects: they provide a dataset, a general question, and either tutorials or hints about what packages and analyses you'll need to use. Think of these are "training wheels projects": they are a way to build your confidence and help you get comfortable with outside class projects.

  • 24 data science projects to boost knowledge and skills These projects are split into beginner, intermediate, and advanced levels, with links to tutorials and where to download the data in question.
  • 12 Data Science Projects for Beginners and Experts This site presents data science projects in R and Python with source code and data. Areas of project include text analysis, recommender systems, deep learning, supervised and unsupervised machine learning.
  • 8 fun machine learning projects for beginners Machine learning is a popular topic with data science students, and these projects provide a semi-guided way to practice your skills.

Make use of the other resources in this guide!  Check out the " Working with Python " and " Working with R " tabs for information about data analysis and visualization packages. Read through the " Version Control & GitHub " tab for additional information about working with Git and how to properly structure a GitHub repository. The " Finding Data & Statistics " tab redirects to a full guide to help with finding data sources and the " Data Visualization " tab will send you to additional resources about data visualization, including best practices.

  • Project inspiration It can be hard to know how to get started with an independent data science project. Fortunately, there are quite a few websites to pursue for examples and inspiration.
  • Getting started for beginners Starting with visualization is great advice.
  • Options for projects This guide to building a data science portfolio also offers a good overview of different kinds of projects possible: data cleaning, data storytelling, an "end to end" project, and an explanatory project. Picking what kind of project you'd like to undertake is a good start.
  • Guide to starting a data science project You won't need to write a formal proposal (since this is your personal project, you can work on whatever you want), but the other steps in this guide are applicable.
  • Scoping a project This guide to scoping a data science project is more detailed than necessary for a personal side project, but the takeaways are good (define the goal, determine data needs, determine analysis needed).
  • Project style guide Remember: how you put together your project is as important as your project topic! This guide is definitely worth reading.
  • Data science project template From Cookiecutter Data Science: "A logical, reasonably standardized, but flexible project structure for doing and sharing data science work."

Sometimes, you want to look at fully formed examples to get an idea of what you can do for your own project. Here are some examples of data science (or at least, data science-ish) projects suitable for lower division data science students: the projects use available data, (mostly) make the underlying code public, produce effective/interesting visuals, and are easy to read through. These examples also span a range of project options, such as making a tutorial for popular/frequently used datasets, learning new techniques, scraping your own data, or digging into a big dataset.

  • Kaggle Titanic tutorial One way to approach a data science side project is to write up your workflow/results as a tutorial for other people to use. This has multiple benefits: it helps you organize your thoughts, forces you to be explicit about your data wrangling and modeling, and adds your own personal touch when working with popular, frequently used datasets.
  • Visualize Spotify This project visualizes attributes of songs (beats per minute, loudness, length, etc.) from one of this person's Spotify playlists. There is a link in the post to a GitHub repository which includes the data, scripts, notebooks, and figures.
  • Text mining The Office This project uses text mining techniques on a dataset of every line from the TV series The Office. Note the cleaning steps to get the data ready for analysis!
  • Tracking emerging slang This project uses Google Trends data to track where new slang comes from (spatially and temporally). Could you recreate a similar analysis using Python? What other questions could you ask with Google Trends data?
  • Recipe recommendations API This project consists of three parts: scraping recipe data, building recommender models and building an API to be hosted on a web server. How might results change with different recipe data?
  • Video game sales This project using video game data relies heavily on data visualization. This example uses R, but consider: could you make similar plots in Python? What about PowerBI or Tableau?
  • Movie genre prediction This project uses elements of movie posters to predict movie genres using convolutional neural networks (CNN). The code is already available, making this a good project to practice looking through and understanding code written by someone else. What parts of the code are understandable based on prior coursework? Are there Python libraries used that are new to you?
  • Football (soccer) match outcome prediction Projects with this data predict the probability of match outcomes for each target class (home team wins, away (opponent) team wins and draw). This project includes dealing with missing and imbalanced data. A more detailed evaluation of various models can be found in this notebook . Try adapting this workflow to data from other sports of your choice!

Also consider reaching out to your fellow data science students about forming a group to work on an independent project. Group projects are a great way to develop important skills such as code collaboration (particularly using GitHub) and project workflow management. Working with a group also provides a built-in network for brainstorming ideas, troubleshooting code errors, and formalizing your project. Plus, it can be more motivating to work in a group, since you're relying on each other to make progress.

Alternatively, if you prefer to work on your own project, it would still be valuable to reach out to other people for code review . Reviewing someone else's code is a useful learning exercise, and having your own code reviewed by your peers is a good way to make sure you don't have any mistakes in your code. 

  • Using Common Crawl data The Common Crawl corpus contains petabytes of data and is available on Amazon S3. It contains raw web page data, extracted metadata and text extractions collected since 2008. The Common Crawl site includes tutorials and example projects using this data. This is a good dataset to use for a project if you want experience working with truly big data, navigating the Amazon web ecosystem, and using data mining techniques at scale.
  • Wayback Machine (archived web pages) Historical web page captures of sites are available via the Wayback Machine and can be extracted and analyzed with Python in a multistep process. more... less... The UC San Diego Library maintains a campus web archiving program to capture web sites relevant to the UCSD community. This presentation from UC Love Data Week 2023 demonstrates an example workflow for accessing and analyzing data in one of these web archive collections.
  • Papers with Code Papers with Code is free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables. Browse by data type and task/method; most of the datasets and code examples are linked to associated peer reviewed publications. Publications here are beyond the scope of most personal projects, but this site is a good central hub for learning more about cutting edge (and classic) data science methods and models. A potential project would be to try implementing one of the methods/models, which requires learning new packages/functions (reading documentation), benchmarking and assessment, and interpreting technical results.
  • Previous DSC capstone projects This page includes links to past DSC capstone (180AB) projects. These represent multi-quarter projects, not personal projects, but they provide a good overview of the types of topics and methods found in many advanced personal projects.

When working on a personal project, you are building your data science portfolio , a public collection of your work you can share with future employers.

Having a well-organized GitHub with each project in its own repository is a great start to building your data science portfolio. You may eventually decide to create your own website. The format of your portfolio may vary; the important thing to keep in mind is that this is a way to showcase your work.

For an in-depth guide to developing your data science portfolio, check out this site from UC Davis DataLab .  

  • "Data Projects" section from  Scott Cole's personal website
  • Cultureplot , by Oliver Gladfelter 
  • Kaylin Pavlik's site
  • Sajal Sharma’s site
  • << Previous: Version control & GitHub
  • Next: Finding Data >>
  • Last Updated: Mar 8, 2024 2:58 PM
  • URL: https://ucsd.libguides.com/data-science

Tutorial Playlist

Big data tutorial: a step-by-step guide, what is big data and what are its benefits, top big data applications across industries, how to become a big data engineer, exploring the main components of big data: a comprehensive overview, big data projects, 10 mind-blowing big data projects revolutionizing industries.

Lesson 5 of 5 By Simplilearn

Big Data Projects

Table of Contents

The accurate approximations of the current scenario suggest that internet users worldwide create 2.5 quintillion bytes of data daily. A big data project involves collecting, processing, evaluating, and interpreting large volumes of data to derive valuable insights, styles, and trends. These projects frequently require specialized tools and techniques to handle the demanding situations posed by the sheer quantity, velocity, and diversification of data. They may be used throughout numerous domains, like business, healthcare, and finance, to make informed selections and gain a deeper understanding of complicated phenomena related to large volumes of data .

What is a Big Data Project?

A big data project is a complicated task that emphasizes harnessing the ability of large and diverse datasets. The key factors that provide advanced information about what a big data project includes:

  • Volume, Velocity, and Variety 
  • Data Storage 
  • Data Processing
  • Data Integration 
  • Data Analysis and Mining
  • Scalability and Parallel Processing
  • Data Visualization
  • Privacy and Security 
  • Cloud Computing
  • Domain Applications

Your Big Data Engineer Career Awaits!

Your Big Data Engineer Career Awaits!

Why is a Big Data Project Important?

A big data project encompasses convoluted procedures of acquiring, managing, and analyzing large and numerous datasets, regularly exceeding the abilities of conventional data processing techniques. It entails stages like data sourcing, storage design, ETL operations , and application of specialized analytics tools, which include big data projects with source code like Hadoop, and Spark. The project's favorable outcomes rely on addressing challenges like data quality , scalability, and privacy concerns. The insights gained can cause advanced decision-making, predictive modeling, and enhanced operational performance. Effective big data projects require a blend of domain understanding, data engineering talents, and a strategic method to address information at an unprecedented scale.

Top 10 Big Data Projects

1. google bigtable.


Google's Bigtable is an enormously scalable and NoSQL database system designed to handle large quantities of data whilst keeping low-latency performance. It is used internally at Google to power various offerings, which include Google Search, Google Analytics, and Google Earth and manage big data analytics projects.

Key Features

  • Bigtable can manage petabytes of records dispensed across heaps of machines, making it appropriate for dealing with massive datasets.This big data project idea has successfully managed massive amounts of datasets. 
  • It gives low-latency read-and-write operations, making it suitable for real-time applications.

2. NASA’s Earth Observing System Data and Information System (EOSDIS)


EOSDIS is a comprehensive gadget that collects, records, and distributes Earth science statistics from NASA's satellites, airborne sensors, and other devices. It aims to offer researchers, scientists, and the general public admission to diverse environmental statistics.

  • EOSDIS encompasses various data facilities, each specializing in unique Earth science information, along with land, ocean, and atmosphere. 
  • These data facilities ensure that the information amassed is saved, managed, and made accessible for research and analysis, contributing to our expertise on Earth's structures and weather.

Get In-Demand Skills to Launch Your Data Career

Get In-Demand Skills to Launch Your Data Career

3. Facebook's Hive


Hive is a data warehousing infrastructure constructed on top of Hadoop, designed for querying and coping with big data projects using a SQL-like language referred to as HiveQL. 

  • Without programming knowledge, it lets users analyze information stored in Hadoop's HDFS (Hadoop Distributed File System). 
  • Hive translates HiveQL queries into MapReduce jobs, making it simpler for data analysts and engineers to work with big information.
  •  It supports partitioning, bucketing, and diverse optimization strategies to improve performance. 

4. Netflix's Recommendation System


Netflix's recommendation system employs big data analytics and machine learning to customize content material pointers for its users. 

  • By reading customer conduct, viewing history, rankings, and alternatives, the machine indicates films and TV shows that align with customer tastes. 
  • This complements consumer engagement and retention; as it facilitates customers discovering content material they might enjoy.
  • Netflix's recommendation engine uses an aggregate of collaborative filtering, content-based filtering, and deep knowledge of algorithms to improve its accuracy and effectiveness.

5. IBM Watson


IBM Watson is an AI-powered platform that uses big data projects, analytics, natural language processing , and machine learning to understand and process unstructured statistics. It has been carried out in numerous domains, including healthcare, finance, and customer service.

  • Watson's talents include language translation, sentiment analysis, image recognition, and question-answering. 
  • It can process large quantities of data from diverse resources, documents, articles, and social media to extract significant insights and provide appropriate recommendations. 
  • IBM Watson demonstrates the potential of big data technology in enabling advanced AI programs and reworking industries through data-driven choice-making.

Learn Everything You Need to Know About Data!

Learn Everything You Need to Know About Data!

6. Uber's Movement


Uber's Movement project is a superior instance of how big data projects are utilized in urban mobility evaluation. It uses anonymized ride information from Uber trips to offer insights into site visitor patterns and transportation traits in towns and cities. 

  • The records from Uber movements can help urban planners, town officers, and researchers make informed decisions about infrastructure upgrades, site visitors management, and public transportation plans. 
  • Uber Movement provides entry to aggregated and anonymized statistics via visualizations and datasets, bearing in mind a higher knowledge of site visitors and congestion dynamics in different urban areas.

7. CERN's Large Hadron Collider(LHC) 


The Large Hadron Collider (LHC) at CERN is the sector's largest and most effective particle accelerator. It generates huge quantities of data in the course of particle collision experiments. To manage and examine this data, CERN employs advanced huge records technologies. 

  • Distributed computing and grid computing architectures method, the large datasets generated by experiments, allow scientists to find new particles and gain insights into essential physics standards.
  • The records generated using the LHC pose substantial demanding situations due to its volume and complexity, showcasing how big data processing is crucial for current scientific research.

8. Twitter's Real-time Analytics


In real-time, Twitter's real-time analytics leverage big data processing to screen, analyze, and visualize tendencies, conversations, and personal interactions. This lets corporations, researchers, or even the general public gain insights into what is occurring on the platform. 

  • By processing and studying huge amounts of tweets and user engagement facts, Twitter becomes aware of trending topics, sentiment analysis, and consumer conduct styles.
  • This real-time data aids in understanding public sentiment, monitoring events, and improving marketing techniques.

9. Walmart's Data Analytics


Walmart, one of the world's largest stores, notably uses data analytics to optimize various operations elements. Big data analytics enables Walmart to make information-driven choices from stock control to supply chain optimization, pricing techniques, and customer conduct analysis. 

  • It helps ensure efficient inventory tiers, minimize wastage, improve client experiences, and enhance standard commercial enterprise performance. 
  • Walmart's data analytics efforts showcase how big data can transform conventional retail practices, resulting in good-sized enhancements in diverse operational areas.

10. City of Chicago's Array of Things


The City of Chicago's Array of Things big data project is a network of sensor nodes deployed throughout the metropolis to gather information on various environmental elements, air quality, temperature, and humidity. 

  • This assignment pursuits to offer real-time statistics for urban planning and decision-making. By studying this big data, town officials can make informed selections about infrastructure upgrades, public protection, and typical quality of life. 
  • The Array of Things assignment exemplifies how the internet of things and big data technologies can contribute to growing smarter and more sustainable towns.

With an idea of some of the best big data projects, it is time to take your knowledge to the next level. Gain insights by enrolling on Big Data Engineer Course by Simplilearn in collaboration in IBM. Master the skills and move on to more advanced projects.

1. What are some common challenges in big data projects?

Common challenges in big data projects include:

  • Handling data quality.
  • Ensuring scalability.
  • Coping with data security and privacy.
  • Managing diverse data formats.
  • Addressing hardware and infrastructure constraints.
  • Locating efficient approaches to processing and examining massive volumes of data.

2. What are some widely used big data technologies?

Some widely used big data technologies consist of Hadoop (and its environment components like HDFS, MapReduce, and Spark), NoSQL databases (together with MongoDB, Cassandra), and distributed computing frameworks (like Apache Flink).

3. How do I choose the right tools for my big data project?

To pick the right tool, consider your project's necessities. Evaluate elements like data volume, velocity, variety, and the complexity of analyses required. Cloud solutions provide scalability, whilst open-source tools like Hadoop and Spark are flexible for use instances. Choose tools that align together with your team's skill set and finances. 

4. What skills are needed for a successful big data project?

A successful big data project calls for a blend of competencies. In conclusion, interpreting  one's main  learning style and choosing suitable platforms for personal growth  can significantly decorate the effectiveness of the learning manner.Big data project topics covering Data engineering competencies, such as data acquisition, ETL strategies, and data cleansing, are essential. Programming proficiency (e.g., Python, Java) for records processing and analysis is important. Knowledge of big data technology, including Hadoop and Spark, is useful. Statistical and machine learning talents help in deriving insights from data. Additionally, problem-solving teamwork is precious for interpreting consequences in a meaningful context.

About the Author


Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

Recommended Programs

Post Graduate Program in Data Engineering

Big Data Engineer

Professional Certificate Course in Data Engineering

*Lifetime access to high-quality, self-paced e-learning content.

Recommended Resources

Free eBook: Top Programming Languages For A Data Scientist

Free eBook: Top Programming Languages For A Data Scientist

Introducing the Post Graduate Program in Cyber Security

Introducing the Post Graduate Program in Cyber Security

Data Science vs Software Engineering: Key Differences

Data Science vs Software Engineering: Key Differences

The Ultimate Guide to Top Front End and Back End Programming Languages for 2021

The Ultimate Guide to Top Front End and Back End Programming Languages for 2021

Program Preview Wrap-Up: PGP in Data Engineering

Program Preview Wrap-Up: PGP in Data Engineering

The Complete Guide On Solidity Programming

The Complete Guide On Solidity Programming

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

IRIS – Institute For Research In Schools

  • Governance & policies
  • Safeguarding & child protection
  • Case studies
  • Impact reports
  • Evaluation reports
  • Big Data: ATLAS
  • Big Data: Covid-19
  • Carbon Researchers
  • Cosmic Mining
  • DNA Origami
  • Earth Observation
  • Future Flight
  • Greener Fragrances
  • Ionic liquids
  • Original Research
  • Scanning Electron Microscope
  • Past projects
  • The R&I Framework
  • Our research
  • Student research
  • Student posters
  • Student Conferences
  • News & comment
  • IRIS Awards Winners
  • Get in touch

research projects on big data

Big Data: COVID-19

Introduce your students to the art of data science through contextual learning. They’ll uncover trends and develop narratives using the global Covid-19 database.

Project timeline

Prepare & launch: Teachers get ready and launch Big Data: Covid-19, using our helpful guidance documents.

Background research & skills development: W ith access to our support materials, students develop the knowledge and skills required to successfully complete research. This includes seminars and training packages on the virology of COVID-19 and epidemiology analysis using Excel and ‘ R ’, a mathematical statistics package. 

Student research: Students explore the global COVID-19 database, making correlations and statistical analyses of public health data.

Artefact development and conference:  Students produce an article, academic poster presentation or academic paper, based on their research process and/or findings with the aim of exhibiting at IRIS’ conference.

This project is for UK state schools and colleges. It’s free and fully supported by our team. If you are a teacher and would like to start this project at your school, click the join button at the top right.

research projects on big data

The volume of data available in our modern world is endless. Learning what good data looks like and how to decipher it will continue to be a valuable skill.

Big Data: Covid-19 introduces to students to the art of data science through contextual learning. They start by gaining context into SARS-CoV-2, then move onto big data. Lessons involve working with real virology, SARS-CoV-2 data, providing insight into how epidemiologists model pandemics.

Once they’ve got the basics down, budding data scientists get to further their skills using the global Covid database. Students learn to use Excel and its Data Analysis package to develop a narrative using basic statistics, creating linear regressions, and plotting histograms. Once they master this, they move onto R, a mathematical statistics programming language.

A new, a vast, and a powerful language is developed for the future use of analysis, in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind than the means hitherto in our possession have rendered possible. Ada Lovelace 1843

Undergraduate Research & Fellowships logo

Can Big Data Have a Role in Treating Dementia? That’s What This Northeastern Student Is Hoping to Help Solve

Dementia is a devastating condition that impacts  more than 55 million people  globally, according to the World Health Organization. In the United States alone, it’s estimated that one in nine people over the age of 65 has Alzheimer’s.

Conditions like Alzheimer’s and other diseases that cause cognitive impairments can be difficult to treat. Early symptoms are often subtle and may go undetected by medical professionals. And as these diseases progress, they become even harder and harder to manage. Individuals begin to lose the ability to speak, think, and move before eventually succumbing to the disease.

Ethan Wong, a fourth-year student at Northeastern University, will use the power of “big data” biology and neuroscience to help develop better early intervention models for those suffering with cognitive impairments when he starts his studies at Churchill College at Cambridge University this fall.

Wong is  one of just 16 individuals  around the globe this year to be honored with Cambridge University’s Churchill Scholarship. The illustrious honor was created by Churchill College at the request of Sir Winston Churchill when the college was founded in 1960, according to its website.

This isn’t Wong’s first award. He was  also a 2023 recipient  of the Barry Goldwater Scholarship, which recognizes students pursuing research in math, natural science and engineering.

“If I can do research that gives people one or two extra years to be a father, a mother or a grandparent, I think that’s super worth fighting for,” he says.

Wong, who is set to graduate in May with a major in biology and a minor in data science, has spent his college career at Northeastern University doing research in the neuroscience domain.

He started at Northeastern in 2020 and quickly began doing research at the university’s Laboratory for Movement Neurosciences, learning under professors Gene Tunik and Matthew Yarossi.

Wong focused his studies on the Trail Making Test, a test clinicians use to test cognitive functions in patients.

“It’s a connect-the-dots test,” Wong says. “But the unique thing about connecting the dots, is that it is both a cognitive task and a motor task. You have to not only see what the next number is and remember what it is, but you also have to move your hand.”

One of Wong’s projects involved developing a variation of the TMT that involves physical objects.

“We actually set up two shelves with cans on them, and they were labeled one through 10,” he said. “We also put grocery items on the shelf and the task was for people to take the items off the shelf as quickly as possible in the correct order.”

The project  earned him a PEAK award  from Northeastern University in 2021.

Wong has also completed a co-op at Beth Israel Deaconess Medical Center, working as a patient care tech.

This paper is in the following e-collection/theme issue:

Published on 23.4.2024 in Vol 12 (2024)

A Scalable Pseudonymization Tool for Rapid Deployment in Large Biomedical Research Networks: Development and Evaluation Study

Authors of this article:

Author Orcid Image

  • Hammam Abu Attieh 1 , MSc ; 
  • Diogo Telmo Neves 1 , BSc ; 
  • Mariana Guedes 2, 3, 4 , MSc, MD ; 
  • Massimo Mirandola 5 , PhD ; 
  • Chiara Dellacasa 6 , MSc ; 
  • Elisa Rossi 6 , MSc ; 
  • Fabian Prasser 1 , Prof Dr

1 Medical Informatics Group, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, , Berlin, , Germany

2 Infection and Antimicrobial Resistance Control and Prevention Unit, Centro Hospitalar Universitário São João, , Porto, , Portugal

3 Infectious Diseases and Microbiology Division, Hospital Universitario Virgen Macarena, , Sevilla, , Spain

4 Department of Medicine, University of Sevilla/Instituto de Biomedicina de Sevilla (IBiS)/Consejo Superior de Investigaciones Científicas (CSIC), , Sevilla, , Spain

5 Infectious Diseases Division, Diagnostic and Public Health Department, University of Verona, , Verona, , Italy

6 High Performance Computing (HPC) Department, CINECA - Consorzio Interuniversitario, , Bologna, , Italy

Corresponding Author:

Hammam Abu Attieh, MSc

Background: The SARS-CoV-2 pandemic has demonstrated once again that rapid collaborative research is essential for the future of biomedicine. Large research networks are needed to collect, share, and reuse data and biosamples to generate collaborative evidence. However, setting up such networks is often complex and time-consuming, as common tools and policies are needed to ensure interoperability and the required flows of data and samples, especially for handling personal data and the associated data protection issues. In biomedical research, pseudonymization detaches directly identifying details from biomedical data and biosamples and connects them using secure identifiers, the so-called pseudonyms. This protects privacy by design but allows the necessary linkage and reidentification.

Objective: Although pseudonymization is used in almost every biomedical study, there are currently no pseudonymization tools that can be rapidly deployed across many institutions. Moreover, using centralized services is often not possible, for example, when data are reused and consent for this type of data processing is lacking. We present the ORCHESTRA Pseudonymization Tool (OPT), developed under the umbrella of the ORCHESTRA consortium, which faced exactly these challenges when it came to rapidly establishing a large-scale research network in the context of the rapid pandemic response in Europe.

Methods: To overcome challenges caused by the heterogeneity of IT infrastructures across institutions, the OPT was developed based on programmable runtime environments available at practically every institution: office suites. The software is highly configurable and provides many features, from subject and biosample registration to record linkage and the printing of machine-readable codes for labeling biosample tubes. Special care has been taken to ensure that the algorithms implemented are efficient so that the OPT can be used to pseudonymize large data sets, which we demonstrate through a comprehensive evaluation.

Results: The OPT is available for Microsoft Office and LibreOffice, so it can be deployed on Windows, Linux, and MacOS. It provides multiuser support and is configurable to meet the needs of different types of research projects. Within the ORCHESTRA research network, the OPT has been successfully deployed at 13 institutions in 11 countries in Europe and beyond. As of June 2023, the software manages data about more than 30,000 subjects and 15,000 biosamples. Over 10,000 labels have been printed. The results of our experimental evaluation show that the OPT offers practical response times for all major functionalities, pseudonymizing 100,000 subjects in 10 seconds using Microsoft Excel and in 54 seconds using LibreOffice.

Conclusions: Innovative solutions are needed to make the process of establishing large research networks more efficient. The OPT, which leverages the runtime environment of common office suites, can be used to rapidly deploy pseudonymization and biosample management capabilities across research networks. The tool is highly configurable and available as open-source software.


As a response to the SARS-CoV-2 pandemic, many research projects have been rapidly set up to study the virus, its impact, and possible interventions [ 1 , 2 ]. This accelerated the general trend toward large collaborative networks in biomedical research [ 3 , 4 ]. These are motivated by the need to generate sufficiently large data sets and collections of biosamples, which are essential for developing new methods of personalized medicine and generating real-world evidence [ 5 ]. However, setting up such networks usually takes quite some time, as common tools and policies are needed to achieve interoperability and enable the required flows of data and biosamples [ 6 , 7 ]. One area in which this challenge is frequently encountered is the handling of personal data and the related data protection issues, which can arise in all processing steps, from collection [ 8 ] to sharing [ 9 ] and even analysis and visualization [ 10 ].

Laws and regulations, such as the European Union General Data Protection Regulation (GDPR) [ 11 ] or the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule [ 12 ], advocate for various strategies for the protection of personal data. In general terms, the GDPR prohibits the processing of sensitive categories of personal data, including medical data, unless consent is given. However, under certain conditions, processing is also possible without consent if technical and organizational safeguards are implemented [ 13 ]. Although there is no consensus on which protection methods are best suited for use in biomedical research [ 14 ], pseudonymization (also called coding or pseudo-anonymization) [ 15 ] is a common strategy, which can also be used to deidentify data under the HIPAA Privacy Rule. Pseudonymization is an essential aspect of the GDPR, as it is mentioned in multiple articles, in particular as a data minimization measure [ 16 ]. In this privacy-by-design approach, directly identifying data about study subjects are stored separately from biomedical data and biosamples, which are needed for scientific analyses [ 17 ]. The link between the different types of data and assets is established through secure identifiers, the so-called pseudonyms [ 18 ], which enable data linkage and allow the reidentification of subjects only if strictly necessary, for example, for follow-up data collection.

Although pseudonymization is done in almost any biomedical study, there are currently no pseudonymization tools that can rapidly be rolled out across many institutions. Existing tools, such as the Generic Pseudonym Administration Service (gPAS) [ 19 ] and Mainzelliste [ 20 ], are client-server applications, requiring server components to be deployed to and integrated into the institutions’ IT infrastructures. Although this can have some important advantages (see the Limitations and Future Work section), it is usually time-consuming, for example, due to a lack of resources or efforts required to ensure compliance with local security policies. Moreover, using central services, such as the European Unified Patient Identity Management (EUPID) [ 21 ], is often not an option, for example, when data should be reused and consent is missing for this type of processing [ 22 ].

In this paper, we present the ORCHESTRA Pseudonymization Tool (OPT) that has been developed under the umbrella of the ORCHESTRA consortium. This project faced the challenges described in the previous paragraph when quickly establishing a large-scale research network as part of Europe’s rapid pandemic response [ 23 ]. Hence, the OPT has been developed with the aim of supporting (1) the registration, pseudonymization, and management of study subject identities as well as biosamples; (2) rapid rollout across research network partners; and (3) scalability and simple configurability. The objective of this paper is to describe the design and implementation of the OPT and to offer insights into its usability and scalability, as evidenced by its deployment in the ORCHESTRA research network.

Ethical Considerations

The work described in this article covers the design and implementation of a generic research tool, which did not involve research on humans or human specimens and no epidemiological research with personal data. Therefore, no approval was required according to the statutes of the Ethics Committee of the Faculty of Medicine at Charité - Universitätsmedizin Berlin. However, the individual studies which use the tool usually have to apply for ethics approval. For example, the COVID HOME study within the ORCHESTRA project was approved by the Medical Ethical Review Committee of the University Medical Center Groningen (UMCG) under vote number METc 2020/158.

General Approach

The OPT has been designed to support general pseudonymization workflows that are needed in most biomedical research projects, as illustrated in Figure 1 .

When a subject is admitted to the hospital, visits a study center, or has a follow-up visit, they are enrolled in the study. In this setting, the physicians or study nurses collect directly identifying and medical data and, according to the study protocol, the appropriate biosamples. The identifying attributes are entered into the OPT to create a unique pseudonym: the OPT Subject ID. During the follow-up visits, the study staff can use the OPT to retrieve an existing pseudonym from a subject that was already enrolled in the study. In all downstream data collection or processing, the OPT Subject ID can be used instead of identifying data so that the medical data are protected but still linked to the study subject and across visits. In addition, biosample data can also be entered into the OPT and linked to the appropriate subject to generate 1 or more additional pseudonyms: the OPT Biosample IDs. A label can then be generated for each biosample vial, containing the OPT Biosample ID, the OPT Subject ID, a DataMatrix Code, a QR code, or a barcode (containing the OPT Biosample ID) for tracking the biosample via scanners commonly used in laboratories. Study-specific information, for example, the exact information to capture for each study subject and biosample, the number and schedule of visits, and the types and schedules of biosample collections, can all be configured in the OPT. Moreover, in addition to its applicability in prospective studies, as described above, the software also supports importing existing data about subjects and biosamples that can be used in retrospective study designs.

research projects on big data

Implementation Details

To overcome challenges caused by the heterogeneity of IT infrastructures across different institutions and a potential lack of support by IT departments due to resource constraints, the OPT has been implemented based on programmable runtime environments that are available at practically any institution: office suites. These suites, especially the one by Microsoft, are among the most important and widely used applications around the world and still play a key role in many sectors today. The OPT is available for Microsoft Office as an Excel application and for LibreOffice as a Calc application. The application logic has been implemented in the embedded Basic scripting language using efficient algorithms for data management. Although Visual Basic for Applications is supported by Microsoft Office and LibreOffice Basic is supported by LibreOffice, they share similarities but are not fully compatible with each other. In the development process of the OPT, the Excel version serves as the primary implementation, and changes as well as additions are regularly ported to the LibreOffice version to achieve feature parity.

For generating the labels for the biosample vials, the OPT is delivered together with a single-page label printing application that takes pseudonyms and metadata (eg, visit labels) as input and generates printable labels. Although this application is implemented using web technologies such as HTML, CSS, and JavaScript, it is delivered as files and can be executed locally without access to the internet. The label printing application works in any common web browser and can be called via the OPT. Properties of the labels to be printed can either be automatically transmitted via the URL for a single label or manually copied into the application via an input field for bulk printing of a larger number of labels. It is also possible to host the application on a web server. However, in this case, the URL function will be deactivated in the OPT to ensure that no data are sent to the server that hosts the application. It is important to note that the application still runs completely locally in the browser of the user, and no data ever leave the devices used to print labels. The pseudonyms and biosample metadata will be temporarily managed in the browser of the device.

Specific Functionalities

In addition to study subject and biosample management, the OPT also provides import and export functionalities, statistics, and a range of configuration options. In this section, we will briefly introduce each function, whereas a structured overview can be found in Multimedia Appendix 1 . Regarding the subject-related functions, the OPT supports individual or bulk registration and a search function for finding pseudonyms for already registered subjects. An important feature of the software is a search function, required for any new patient or sample registration, which prevents multiple registrations of the same study participant. The search, to be performed as the first step of the registration, is linked to several data quality checks as well as a fuzzy record linkage process that prevents duplicate registrations. The bulk registration functionality enables the use of the OPT for retrospective pseudonymization of existing data sets. The search function supports wildcards and fuzzy matching across a configured set of master data attributes. Additional properties for the registered individuals can be documented to account for site-specific requirements.

Biosample-related functions are designed analogously to those for study subject management. In addition, labels can be generated and printed through the service described in the previous section.

Import and export functionalities are provided to enable the creation of backups (see the next section) and the migration from old versions of the OPT as part of update processes.

Finally, separate worksheets display statistical information about the data captured, such as the number of subjects registered or pseudonyms created for different study visits. Extensive configuration options are also available through a separate worksheet.

All functionalities of the OPT are described briefly in an integrated Quick User Guide and in detail in a comprehensive user manual [ 24 ].

Security Considerations and Features

The data collected during study subject and biosample registration, as well as the pseudonyms generated, are sensitive and a critical part of the data managed in any study. Hence, the confidentiality, integrity, and availability [ 25 ] of the data managed in the OPT must be ensured. In this context, the approach taken by the OPT clearly trades off some of the guarantees that could be provided by a client-server application against the possibility of rapid deployment and rollout. However, as described in the user manual, care has been taken to provide robust guarantees by specifying requirements on how the OPT should be deployed and used [ 24 ]. First, the OPT should not be placed on a local drive but on a network share that is integrated with the institution’s Authentication and Authorization Infrastructure and, hence, provides means for controlling who is able to access the software in read or write mode and from which devices. Second, it is highly recommended that this share be backed up regularly so that data can be restored in case of problems. This should be complemented by regular, for example, daily, manual backups through the export functionality provided by the OPT and according to reminders that are displayed by the software. Finally, the office suites used as runtime environments do not provide multiuser support, and the application can only be opened by 1 user with write permission at any point in time. To enable parallel read access, the OPT comes with a script that opens a temporary read-only copy of the software. This allows, for example, laboratory technicians to use the OPT for generating biosample labels in parallel with ongoing registration processes. The measures described in this section have proven to be effective, and no problems have been encountered to date during extensive use of the software at many institutions (see the Results section).

Overview of the Application

The graphical user interface of the OPT is divided into 10 different perspectives that provide access to the functionalities described in the previous sections. One of those sheets, the configuration sheet, is hidden from the users. All other sheets have write protection using the integrated protection functions of the spreadsheet software, except the input fields and the buttons, to ensure that data management is only performed through the specific functionalities provided by the software. A password is set by default for the write protection, which can be changed by the administrator at any time. However, it is important to keep the password safe. Figure 2 provides an overview of 4 important perspectives.

research projects on big data

Figure 2A shows the configuration sheet, in which the specifics of the algorithm for generating pseudonyms, the study schedule, and the data fields to be documented can be specified. Figure 2B shows the interface provided for searching and registering subjects, with a search form on the left side of the sheet and a results list on the right side. All study subject data stored in the OPT are listed in the sheet shown in Figure 2C . This sheet also allows users to document any additional data that a site may require. Finally, Figure 2D shows a sheet providing statistical information on the number of subjects and biosamples registered, as well as insights into how these numbers have developed over time.

An overview of the label printing application is provided in Figure 3 . As shown in the figure, the data that are to be printed on the labels are listed, and the number of rows and columns can be configured to support printing in bulk or for individual labels. The figure also shows an example of a sheet that can be printed and a detailed image of a single label. The data that are printed on those labels include the biosample and study subject IDs, the associated visit of the study schedule, and the biosample type.

research projects on big data

Use of the OPT in the ORCHESTRA Project

ORCHESTRA is a 3-year international research project about the COVID-19 pandemic that was established in December 2020, involving 26 partners from 15 countries. The aim of ORCHESTRA is to share and analyze data from several retrospective and prospective studies to provide rigorous evidence for improving the prevention and treatment of COVID-19 and to better prepare for future pandemics [ 26 , 27 ].

The data management architecture in ORCHESTRA consists of 3 layers that build upon each other. The first layer is formed by “National Data Providers,” which consist of the participating partners (universities, hospitals, and research networks). These provide the subject data and samples for joint analyses. On the second layer, “National Hubs” pool pseudonymized data in national instances of the Research Electronic Data Capture (REDCap) system [ 28 ]. Finally, the “ORCHESTRA Data Portal” forms the third layer, in which access to aggregated data and results is provided through a central repository.

In ORCHESTRA, the OPT was used for implementing pseudonymization at the data providers’ sites. Each participating site named 1 or 2 persons responsible for technical aspects, such as setting up the required network share and installing updates, as well as several study nurses or clinicians, who would use the OPT. With these users, we performed regular training sessions and provided contact details in case of questions. As of June 2023, 19 instances of the OPT have been rolled out to 13 sites in 11 countries, including Germany, France, Italy, and Slovakia in Europe; Congo in Africa; and Argentina in South America. A world map highlighting all the countries in which the OPT has been rolled out can be found in Multimedia Appendix 2 .

On average, each instance of the OPT was used by up to 4 staff members. The OPT has been successfully rolled out, used, and maintained at large sites with committed IT departments, as well as at smaller, resource-constrained institutions. Overall, it has been in constant production use for more than 2 years. In the majority of the sites (10/13, 77%), the OPT Microsoft Excel version was used, whereas the remaining sites (3/13, 23%) used the LibreOffice release. In total, more than 10,000 study subjects and 15,000 samples have been registered in the OPT across all sites, and more than 10,000 labels have been printed. To evaluate the usability of the OPT, we conducted a survey among all active users, leveraging the widespread System Usability Scale [ 29 ] questionnaire, which includes 10 Likert-scale questions. During this survey, our system was designed to prevent multiple responses from individual participants and the submission of incomplete responses. We received 6 responses from 9 invited users, resulting in a score of 75 on a scale from 0 to 100, which adjectively translates to “good” [ 30 ].

Performance Evaluation

As mentioned, the OPT has been carefully designed to provide acceptable performance, even when large data sets are being processed or a large number of subjects or samples are being managed. In this section, we present the results of a brief performance evaluation. Our test environment consisted of an average office laptop, which was equipped with a quad-core 1.8 GHz Intel Core i7 CPU and a 64-bit Microsoft Windows 10 operating system. On top of it, Microsoft Excel 2016 (x32) and LibreOffice 7.0 (x64) were installed. Figure 4 provides an overview of the execution times of the most important functionalities of the OPT for different cohort sizes.

The numbers clearly show that the OPT works well and provides excellent performance for small or medium-sized data sets and acceptable performance for large data sets.

research projects on big data

Figure 4A shows the average execution times for importing data about study subjects and samples. Data about subjects were imported into a completely empty OPT, whereas data about samples were imported into an OPT that already had the corresponding study subjects registered, so that each biosample was assigned to exactly 1 subject. For example, importing the data of 100,000 subjects took about 10 seconds in the Excel version and 54 seconds in the LibreOffice version. During the registration, the existence of the associated study subject in the OPT is checked, which makes the registration of samples slower compared to the registration of subjects. This is also noticeable in Figure 4B , which shows the average execution times for registering a single study subject or sample. As can be seen, using an OPT data set in which 100,000 entities were already registered, this took between 2 and 4 seconds in the Excel version and between 4 and 6 seconds in the LibreOffice version. Figure 4C shows the average execution times for searching for entities and obtaining their pseudonym, which is roughly twice as fast as the registration operation.

As performance is associated linearly with the number of entities already managed, subsecond response times can be expected for instances in which around 15,000 or fewer subjects or samples have been registered. This is consistent with our experiences from the deployments in the ORCHESTRA research network.

Principal Findings

In this paper, we presented the OPT, a comprehensive, scalable, and pragmatic pseudonymization tool that can be rapidly rolled out across large research networks. To achieve this, the software has been implemented based on runtime environments that are available at practically any institution: office suites. The software supports a broad range of functionalities, from registering and pseudonymizing subject and biosample identities to search and depseudonymization functions, statistics about the data managed, as well as import and export features. We have described measures that are recommended to ensure the security of the data managed by the OPT and reported on our experiences gained after 2 years of successful operation in a large research network on COVID-19. Finally, we have also presented the results of a performance evaluation showing that the software provides excellent performance for small or medium-sized data sets and acceptable performance for large data sets. The OPT is available as open-source software [ 31 ] and can be configured to meet the needs of a wide range of biomedical research projects.

Limitations and Future Work

To achieve the design goals of the OPT, some compromises had to be made regarding data management. Compared to using client-server applications that use database management systems to store data, it is more difficult to ensure the confidentiality, integrity, and availability of the data managed with the OPT. There is also limited support for multiuser scenarios. However, we have developed and documented a set of measures that, if taken, help to still ensure a high level of data security. For this to work, it is important that users adhere to those recommendations. Therefore, all users of the OPT should familiarize themselves with the manual [ 24 ], and ideally, they should also be trained in the use and operation of the software. Despite these limitations, we strongly believe that our approach offers an innovative take on pseudonymization tools that can rapidly be rolled out across large research networks. Of course, it would be even more desirable if global standards for pseudonymization functions could be developed and agreed upon. Such global standards would ensure that solutions already existing at many research institutions are interoperable and can readily be used in joint research activities.

Comparison With Related Work

A range of pseudonymization tools has been described in the literature and are available as open-source software. However, they are either based on a client-server architecture and hence require quite some effort to be rolled out across sites, based on central services and hence not usable if consent is lacking for this type of processing, or offered as command-line utilities or programming libraries for IT experts.

Examples of client-server approaches include the work by Lablans et al [ 20 ] to provide a RESTful interface to pseudonymization services in modern web applications, which is based on a concept suggested by Pommerening et al [ 6 ] in 2006. Moreover, researchers from the University of Greifswald in Germany have designed and developed several client-server tools that can be used to manage subjects, samples, and other aspects of biomedical studies [ 32 , 33 ].

Examples of central services for pseudonymization include the EUPID, which was developed in 2014 by the Austrian Institute of Technology for the European Network for Cancer Research in Children and Adolescents project [ 21 ]. Another example is the Secure Privacy-preserving Identity management in Distributed Environments for Research (SPIDER) service, which was launched in May 2022 by the Joint Research Centre [ 34 ]. Both services support linking and transferring subject data across registries without revealing their identities. However, biosample data management is not possible with them. Further centralized concepts include the one described by Angelow et al [ 35 ].

Examples of command-line utilities, application programming interfaces, and programming libraries include the generic solution for record linkage of special categories of personal data developed by Fischer et al [ 36 ]; that by Preciado-Marquez et al [ 37 ]; and the PID (patient ID) generator developed by the TMF (Technologies, Methods and Infrastructure for Networked Medical Research e.V.), the German umbrella association for networked medical research [ 6 ].

Widely available office suites provide runtime environments that offer opportunities to rapidly roll out software components for biomedical studies across a wide range of large and resource-constrained research institutions. We have demonstrated this through the development, practical use, and evaluation of the OPT, which offers pseudonymization functionalities for study subjects and biosamples. As we believe that the software is of interest to the larger research community, it has been made available under a permissive open-source license [ 31 ].


This work has been funded by the European Union’s Horizon 2020 research and innovation programme under the project ORCHESTRA (grant agreement 101016167).

Conflicts of Interest

None declared.

Overview of the ORCHESTRA Pseudonymization Tool functions.

Map of countries in which the ORCHESTRA Pseudonymization Tool has been rolled out.

  • Dron L, Dillman A, Zoratti MJ, Haggstrom J, Mills EJ, Park JJH. Clinical trial data sharing for COVID-19-related research. J Med Internet Res. Mar 12, 2021;23(3):e26718. [ CrossRef ] [ Medline ]
  • R&D Blueprint. A coordinated global research roadmap: 2019 novel coronavirus. World Health Organization; Mar 12, 2020. URL: https://www.who.int/publications/m/item/a-coordinated-global-research-roadmap [Accessed 2024-04-12]
  • Guinney J, Saez-Rodriguez J. Alternative models for sharing confidential biomedical data. Nat Biotechnol. May 9, 2018;36(5):391-392. [ CrossRef ] [ Medline ]
  • Walport M, Brest P. Sharing research data to improve public health. Lancet. Feb 12, 2011;377(9765):537-539. [ CrossRef ] [ Medline ]
  • Mahmoud A, Ahlborn B, Mansmann U, Reinhardt I. Clientside pseudonymization with trusted third-party using modern web technology. Stud Health Technol Inform. May 27, 2021;281:496-497. [ CrossRef ] [ Medline ]
  • Pommerening K, Schröder M, Petrov D, Schlösser-Faßbender M, Semler SC, Drepper J. Pseudonymization service and data custodians in medical research networks and biobanks. In: INFORMATIK 2006 – INFORMATIK für Menschen. Vol 1. Gesellschaft für Informatik e.V; 2006;715-721. ISBN: 978-3-88579-187-4
  • Tacconelli E, Gorska A, Carrara E, et al. Challenges of data sharing in European COVID-19 projects: a learning opportunity for advancing pandemic preparedness and response. Lancet Reg Health Eur. Oct 2022;21:100467. [ CrossRef ] [ Medline ]
  • Rumbold J, Pierscionek B. Contextual anonymization for secondary use of big data in biomedical research: proposal for an anonymization matrix. JMIR Med Inform. Nov 22, 2018;6(4):e47. [ CrossRef ] [ Medline ]
  • Aamot H, Kohl CD, Richter D, Knaup-Gregori P. Pseudonymization of patient identifiers for translational research. BMC Med Inform Decis Mak. Jul 24, 2013;13:75. [ CrossRef ] [ Medline ]
  • Wu X, Wang H, Zhang Y, Li R. A secure visual framework for multi-index protection evaluation in networks. Digit Commun Netw. Apr 2023;9(2):327-336. [ CrossRef ]
  • Regulation (EU) 2016/679 of the European Parliament and of the Council. Official Journal of the European Union. Apr 27, 2016. URL: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679 [Accessed 2024-04-12]
  • U.S. Department of Health and Human Services, Office for Civil Rights. HIPAA administrative simplification: regulation text: 45 CFR parts 160, 162, and 164 (unofficial version, as amended through March 26, 2013). U.S. Department of Health and Human Services. Mar 26, 2013. URL: https://www.hhs.gov/sites/default/files/hipaa-simplification-201303.pdf [Accessed 2024-04-12]
  • Quinn P. Research under the GDPR - a level playing field for public and private sector research? Life Sci Soc Policy. Mar 1, 2021;17(1):4. [ CrossRef ] [ Medline ]
  • Rodriguez A, Tuck C, Dozier MF, et al. Current recommendations/practices for anonymising data from clinical trials in order to make it available for sharing: a scoping review. Clin Trials. Aug 2022;19(4):452-463. [ CrossRef ] [ Medline ]
  • Kohlmayer F, Lautenschläger R, Prasser F. Pseudonymization for research data collection: is the juice worth the squeeze? BMC Med Inform Decis Mak. Sep 4, 2019;19(1):178. [ CrossRef ] [ Medline ]
  • Gruschka N, Mavroeidis V, Vishi K, Jensen M. Privacy issues and data protection in big data: a case study analysis under GDPR. Presented at: 2018 IEEE International Conference on Big Data (Big Data); Dec 10 to 13, 2018;5027-5033; Seattle, WA. [ CrossRef ]
  • Lautenschläger R, Kohlmayer F, Prasser F, Kuhn KA. A generic solution for web-based management of pseudonymized data. BMC Med Inform Decis Mak. Nov 30, 2015;15:100. [ CrossRef ] [ Medline ]
  • European Union Agency for Cybersecurity, Drogkaris P, Bourka A. Recommendations on shaping technology according to GDPR provisions - an overview on data pseudonymisation. European Network and Information Security Agency; 2018. [ CrossRef ]
  • Bialke M, Bahls T, Havemann C, et al. MOSAIC--a modular approach to data management in epidemiological studies. Methods Inf Med. 2015;54(4):364-371. [ CrossRef ] [ Medline ]
  • Lablans M, Borg A, Ückert F. A RESTful interface to pseudonymization services in modern web applications. BMC Med Inform Decis Mak. Feb 7, 2015;15:2. [ CrossRef ] [ Medline ]
  • Nitzlnader M, Schreier G. Patient identity management for secondary use of biomedical research data in a distributed computing environment. Stud Health Technol Inform. 2014;198:211-218. [ Medline ]
  • El Emam K, Rodgers S, Malin B. Anonymising and sharing individual patient data. BMJ. Mar 20, 2015;350:h1139. [ CrossRef ] [ Medline ]
  • Connecting European cohorts to increase common and effective response to SARS-CoV-2 pandemic: ORCHESTRA. European Commission. Apr 21, 2022. URL: https://cordis.europa.eu/project/id/101016167/de [Accessed 2023-06-02]
  • BIH-MI/opt: ORCHESTRA pseudonymization tool - user manual. GitHub. Sep 24, 2023. URL: https://github.com/BIH-MI/opt/blob/main/development/documentation/user-manual.pdf [Accessed 2023-09-26]
  • ISO/IEC 27001:2022 information security, cybersecurity and privacy protection - information security management systems - requirements. International Organization for Standardization; 2022. URL: https://www.iso.org/standard/27001 [Accessed 2024-04-12]
  • Azzini AM, Canziani LM, Davis RJ, et al. How European research projects can support vaccination strategies: the case of the ORCHESTRA project for SARS-CoV-2. Vaccines (Basel). Aug 14, 2023;11(8):1361. [ CrossRef ] [ Medline ]
  • ORCHESTRA - EU horizon 2020 cohort to tackle COVID-19 internationally. ORCHESTRA. Sep 19, 2022. URL: https://orchestra-cohort.eu/ [Accessed 2023-04-12]
  • Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCAP)--a metadata-driven methodology and workflow process for providing translational research Informatics support. J Biomed Inform. Apr 2009;42(2):377-381. [ CrossRef ] [ Medline ]
  • Brooke J. SUS: a quick and dirty usability scale. In: Usability Evaluation in Industry. CRC Press; 1996;189-194.
  • Bangor A, Kortum P, Miller J. Determining what individual SUS scores mean: adding an adjective rating scale. J Usability Stud. May 2009;4(3):114-123. URL: https://uxpajournal.org/wp-content/uploads/sites/7/pdf/JUS_Bangor_May2009.pdf [Accessed 2024-04-12]
  • BIH-MI/opt: ORCHESTRA pseudonymization tool. GitHub. Jun 2, 2023. URL: https://github.com/BIH-MI/opt [Accessed 2023-06-02]
  • Bialke M. Werkzeuggestützte Verfahren für die Realisierung einer Treuhandstelle im Rahmen des zentralen Datenmanagements in der epidemiologischen Forschung [Dissertation]. Universitätsmedizin der Ernst-Moritz-Arndt-Universität Greifswald; 2016. URL: https://d-nb.info/1124566945/34 [Accessed 2024-04-12]
  • Bialke M, Penndorf P, Wegner T, et al. A workflow-driven approach to integrate generic software modules in a trusted third party. J Transl Med. Jun 4, 2015;13:176. [ CrossRef ] [ Medline ]
  • SPIDER pseudonymisation tool. European Commission. May 4, 2023. URL: https://eu-rd-platform.jrc.ec.europa.eu/spider/ [Accessed 2023-06-02]
  • Angelow A, Schmidt M, Weitmann K, et al. Methods and implementation of a central biosample and data management in a three-centre clinical study. Comput Methods Programs Biomed. Jul 2008;91(1):82-90. [ CrossRef ] [ Medline ]
  • Fischer H, Röhrig R, Thiemann VS. Simple Batch Record Linkage System (SimBa) – a generic tool for record linkage of special categories of personal data in small networked research projects with distributed data sources: lessons learned from the Inno_RD project. In: Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). German Medical Science GMS Publishing House; 2019. [ CrossRef ]
  • Preciado-Marquez D, Becker L, Storck M, Greulich L, Dugas M, Brix TJ. MainzelHandler: a library for a simple integration and usage of the Mainzelliste. Stud Health Technol Inform. May 27, 2021;281:233-237. [ CrossRef ] [ Medline ]


Edited by Christian Lovis; submitted 06.06.23; peer-reviewed by James Scheibner, Xiang Wu; final revised version received 03.10.23; accepted 07.03.24; published 23.04.24.

© Hammam Abu Attieh, Diogo Telmo Neves, Mariana Guedes, Massimo Mirandola, Chiara Dellacasa, Elisa Rossi, Fabian Prasser. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 23.4.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/ , as well as this copyright and license information must be included.

We couldn’t find any results matching your search.

Please try using other words for your search or explore other sections of the website for relevant information.

We’re sorry, we are currently experiencing some issues, please try again later.

Our team is working diligently to resolve the issue. Thank you for your patience and understanding.

News & Insights

Investing News Network-Logo

AI Stocks: 9 Biggest Companies in 2024

research projects on big data

April 22, 2024 — 04:30 pm EDT

Written by Melissa Pistilli for Investing News Network  ->

Artificial intelligence (AI) may be an emerging technology, but there are plenty of billion-dollar companies in this space.

As the market has grown over the past few years, AI technology has made strong inroads into several key industries, including logistics, manufacturing, finance, healthcare, customer service and cybersecurity.

While AI-driven advancements in robotics have received the most press in recent years, the latest buzz has centered around OpenAI’s ChatGPT . This intelligent chatbot shows how quickly generative AI is advancing, and has attracted the attention of heavyweight technology companies such as Microsoft (NASDAQ: MSFT ), which has reportedly invested billions of dollars in the privately held OpenAI. Alphabet (NASDAQ: GOOGL ) has also released its own AI chat tool, Google Gemini.

On a global scale, Fortune Business Insights predicts that the AI industry will experience a compound annual growth rate of 20.2 percent between 2024 and 2032 to reach a market value of more than US$2.74 trillion.

Here the Investing News Network profiles some of the biggest AI stocks by market cap on US, Canadian and Australian stock exchanges. Data was gathered on April 12, 2024, using TradingView’s stock screener .

American AI stocks

According to Tracxn Technologies, the number of US AI companies has more than doubled since 2017 with over 70,700 companies working in the sector today.

One of the major factors fueling growth in the American AI market, states Statista , is “the growing investments and partnerships among technology companies, research institutions, and governments".

Below are three of the top US AI stocks.

1. Microsoft (NASDAQ:MSFT)

Market cap: US$3.134 trillion; share price: US$421.74

In addition to the reported billions Microsoft is committed to investing in OpenAI, the technology behemoth has built its own AI solutions based on the chatbot creator’s technology: Bing AI and Copilot . OpenAI officially licensed its technologies to Microsoft in 2020.

An update to Windows 11 in 2023 integrated Bing into the operating system's search bar, allowing users to interact with the chatbot directly with Microsoft's Edge browser, Chrome and Safari.

Microsoft’s moves into generative AI have translated into higher revenues for its Azure cloud computing business, and a higher market capitalization as the tech giant pushed past the US$3 trillion mark in January 2024. The company is also expected to unveil its first AI PC this year.


Market cap: US$2.215 trillion; share price: US$866.11

The global leader in graphics processing unit (GPU) technology, NVIDIA is designing specialized chips used to train AI and machine learning models for laptops, workstations, mobile devices, notebooks, and PCs. The company is partnering with a number of big name tech firms to bring a number of key AI products to market.

Through its partnership with Dell Technologies (NYSE: DELL ), NVIDIA is developing AI applications for enterprises, such as language-based services, speech recognition and cybersecurity. The chip maker has been instrumental in the build out of Meta Platforms’ (NASDAQ: META ) AI supercomputer called the Research SuperCluster, which reportedly uses a total of 16,000 of NVIDIA's GPUs.

Most recently, NVIDIA and the Taiwan Semiconductor Manufacturing Company (NYSE: TSM ) have developed the world's first multi-die chip specifically designed for AI applications: the Blackwell GPU . Blackwell’s architecture allows for the increased processing power needed to train larger and more complex AI models.

NVIDIA’s AI ambitions were on full display at its GPU Technology Conference in March where CEO Jensen Huang presented his company’s plans to build humanoid robots , known as Project GR00T. “Building foundation models for general humanoid robots is one of the most exciting problems to solve in AI today,” stated Huang in his keynote presentation.

3. Alphabet (NASDAQ:GOOGL)

Market cap: US$1.967 trillion; share price: US$158.92

Alphabet holds court with both Microsoft and NVIDIA as part of the tech sector’s Magnificent 7 , and its foray into AI has also brought the tech giant much success. As of April 12, Alphabet’s market cap looks set to surpass the US$2 trillion mark .

It would seem investors still remain confident in the potential for growth in Alphabet’s AI ventures despite its hiccups in the rollout of its subsidiary Google’s AI chatbot Gemini, formerly called Bard. “While the headlines haven’t been favorable, Google’s role in generative AI products will present massive growth opportunities for the stock,” said Sylvia Jablonski , chief executive officer at Defiance ETFs.

In early April, Google introduced a custom AI chip designed for its cloud services customers. Set to be delivered later this year, the technology uses British semiconductor company Arm Holding's (NASDAQ: ARM ) AI architecture. In the same week, Google revealed its new A3 Mega AI processor based on NVIDIA’s H100 Technology.

Canadian AI stocks

Recognized as a world-leading AI research hub, Canada ranks fifth out of 54 countries in the Global AI Index . Since 2017, the Canadian government has invested hundreds of millions of dollars into accelerating the research and commercialization of AI technology in the country through the Pan-Canadian Artificial Intelligence Strategy .

Recent research by IBM (NYSE: IBM ) says Canadian businesses are increasingly adopting AI, with 37 percent of IT professionals in large enterprises reporting that they have deployed the technology in their operations.

Below are three of the top Canadian AI stocks.


Market cap: US$33.238 billion; share price: US$143.45

Montreal-based CGI is among the world’s largest IT systems integration companies, and offers a wide range of services, from cloud migration and digital transformation to data analysis, fraud detection, and even supply chain optimization. Its more than 700 clients span the retail, wholesale, consumer packaged goods and consumer services sectors worldwide.

Through a partnership with Google, CGI is leveraging the Google Cloud Platform to strengthen the capabilities of its CGI PulseAI™ solution, which can be integrated with existing applications and workflows.

CGI is aggressively working to expand its generative AI capabilities and client offerings, and reportedly is planning to invest US$1 billion into its AI offerings. In early March, the company launched Elements360 ARC-IBA , an AI powered platform for brokers and insurers to settle accounts in the UK broking industry.

2. OpenText (TSX:OTEX)

Market cap: C$13.366 billion; share price: C$48.58

Ontario-based OpenText is one of Canada’s largest software companies. The tech firm develops and sells enterprise information management software. Its portfolio includes hundreds of products in the areas of enterprise content management, digital process automation and security, plus AI and analytics tools. OpenText serves small businesses, large enterprises and governments alike.

OpenText's AI & Analytics platform has an open architecture that enables integration with other AI services, including Google Cloud and Azure. It can leverage all types of data, including structured or unstructured data, big data and the internet of things (IoT) to quickly create interactive visuals.

In January, OpenText launched its Cloud Editions 24.1, which includes enhancements to its OpenText Aviator portfolio. "Leveraging AI for impactful results depends on reliable data – without it, even the most skilled data scientists will struggle,” stated OpenText CEO and CTO tMark J. Barrenechea. “By expanding the Aviator portfolio in conjunction with our world class information management platform, Cloud Editions 24.1 empowers customers with the tools and insights needed to get ahead."

3. Descartes Systems Group (TSX:DSG)

Market cap: C$8.9 billion; share price: C$104

Descartes Systems Group provides on-demand software-as-a-service (SaaS) solutions. The multinational technology company specializes in logistics software, supply chain management software and cloud-based services for logistics businesses.

AI and ML enhancements to Descartes’ routing, mobile and telematics suite are helping the company’s customers optimize fleet performance. “AI and ML are perfect extensions to our advanced route optimization and execution capabilities,” said Ken Wood, executive vice president at Descartes. “From dynamic delivery appointment scheduling through planning and real-time route execution, we’ve used AI and ML to improve our ability to deliver the next level of fleet performance for customers.”

Australian AI stocks

AI investment in Australia is expected to reach AU$5.7 billion in 2026 , according to research firm IDC. The biggest spenders when it comes to AI in Australia are the banking industry, the federal government, professional services and retail.

Below are three of the top Australian AI stocks.

1. Xero (ASX:XRO)

Market cap: AU$18.451 billion; share price: AU$121.96

New Zealand-based technology company Xero provides cloud-based accounting software for small and medium-sized businesses. The company’s product portfolio also includes the Xero Accounting app, Xero HQ, Xero Ledger, Xero Workpapers and Xero tax tools.

Xero has made a number of AI enhancements to its platform in recent years, including bank reconciliation predictions that save time and reduce errors, and Analytics Plus, a suite of AI-powered planning and forecasting tools.

In March, the company launched its Gen AI assistant, named ‘Just Ask Xero’ or JAX. Some of its features include the automation or streamlining of repetitive and time-consuming tasks; the ability to anticipate tasks based on previous user actions and the ability to make cashflow projections on request.

2. TechnologyOne (ASX:TNE)

Market cap: AU$5.213 billion; share price: AU$16.23

TechnologyOne is another large enterprise technology software firm in Australia. In fact, it is the country’s largest enterprise resource planning SaaS company. TechnologyOne has a client base of over 1,200, including customers in the government, education, health and financial services sectors across Australia, New Zealand and the UK. The company’s research and development center is targeting cloud-based technology, AI and ML.

TechnologyOne recently announced its 2023 financial results , highlighting that it saw record profits for the 14th year. The company’s SaaS annual recurring revenue was up 22 percent and its after-tax profit was up 16 percent. TechnologyOne attributes the strong results to robust demand for the company’s global SaaS enterprise resource planning solution. TechOne attributed its success to the large number of major deals it completed in the government sector over the period.

3. Brainchip Holdings (ASX:BRN)

Market cap: AU$647.603 million; share price: AU$0.345

Global technology company BrainChip Holdings has developed and commercialized a type of edge AI that simulates the functionality of the human neuron. The company's neuromorphic processor, Akida, enables the deployment of edge computing across several applications, including connected cars, consumer electronics and industrial IoT.

BrainChip partnered with AI-based video analytics solutions provider CVEDIA in May 2023 to further develop edge AI and neuromorphic computing. The CVEDIA-RT platform for video analytics will be integrated with BrainChip’s Akida neuromorphic IP. The technology has applications in security and surveillance, transportation, information technology services and retail.

The company has also partnered with MYWAI , a leader artificial intelligence-of-things (AIoT) solution provider. They will leverage BrainChip’s Akida™, with MYWAI’s AIoT Platform for equipment-as-a-service. “The partnership is expected to accelerate the adoption of Edge AI in the industrial and robotic sectors and generate significant value for both companies and their customers,” stated the press release.

FAQs for AI stocks

​which company is leading the ai race.

Google and Microsoft are battling it out for king of the AI hill. While a study from digital marketing firm Critical Mass shows that consumers believe Alphabet’s Google is leading the AI race, analysts are pointing to Microsoft as the clear frontrunner. Microsoft stands to benefit in a big way from its billions of dollars investment in OpenAI's ChatGPT as advancements in generative AI may have the potential to increase the company's revenues for its Azure cloud computing business.

​Which country is doing best in AI?

North America is the global hotspot for advancements in AI technology and is home to the majority of the world’s largest AI providers. Of the countries in this region, Canada’s AI industry is showing the fastest growth, according to a report by Markets and Markets . Swiss-based CRM firm InvestGlass positions the US as the primary hub for AI development, and many of the world’s leading tech giants are headquartered there. According to the firm, China comes in a close second.

​What is Elon Musk's AI company?

In November 2023, Elon Musk launched Grok , a new AI technology company based in Nevada. While not much is known about the company yet, Musk said he is starting it as a "third option" to ChatGPT and Google Gemini; its product will be named TruthGPT.

​Does Tesla have its own AI?

Tesla (NASDAQ: TSLA ) has developed proprietary AI chips and neural network architecture. The company’s autonomous vehicle AI system gathers visual data in real time from eight cameras to produce a 3D output that helps to identify the presence and motion of obstacles, lanes and traffic lights. The AI-driven models also help autonomous vehicles make quick decisions. In addition to developing autonomous vehicles, Tesla is working on bi-pedal robotics.

Don't forget to follow us @INN_Technology for real-time news updates!

Securities Disclosure: I, Melissa Pistilli, hold no direct investment interest in any company mentioned in this article.

The views and opinions expressed herein are the views and opinions of the author and do not necessarily reflect those of Nasdaq, Inc.

Investing News Network logo

More Related Articles

This data feed is not available at this time.

Sign up for the TradeTalks newsletter to receive your weekly dose of trading news, trends and education. Delivered Wednesdays.

To add symbols:

  • Type a symbol or company name. When the symbol you want to add appears, add it to My Quotes by selecting it and pressing Enter/Return.
  • Copy and paste multiple symbols separated by spaces.

These symbols will be available throughout the site during your session.

Your symbols have been updated

Edit watchlist.

  • Type a symbol or company name. When the symbol you want to add appears, add it to Watchlist by selecting it and pressing Enter/Return.

Opt in to Smart Portfolio

Smart Portfolio is supported by our partner TipRanks. By connecting my portfolio to TipRanks Smart Portfolio I agree to their Terms of Use .

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

Key facts as India surpasses China as the world’s most populous country

research projects on big data

India is poised to become the world’s most populous country this year – surpassing China, which has held the distinction since at least 1950 , when the United Nations population records begin. The UN expects that India will overtake China in April , though it may have already reached this milestone since the UN estimates are projections.

Here are key facts about India’s population and its projected changes in the coming decades, based on Pew Research Center analyses of data from the UN and other sources.

This Pew Research Center analysis is primarily based on the World Population Prospects 2022 report by the United Nations. The estimates produced by the UN are based on “all available sources of data on population size and levels of fertility, mortality and international migration.”

Population sizes over time come from India’s decennial census. The census has collected detailed information on India’s inhabitants, including on religion, since 1881. Data on fertility and how it is related to factors like education levels and place of residence is from India’s National Family Health Survey (NFHS) . The NFHS is a large, nationally representative household survey with more extensive information about childbearing than the census. Data on migration is primarily from the United Nations Population Division .

Because future levels of fertility and mortality are inherently uncertain, the UN uses probabilistic methods to account for both the past experiences of a given country and the past experiences of other countries under similar conditions. The “medium scenario” projection is the median of many thousands of simulations. The “low” and “high” scenarios make different assumptions about fertility: In the high scenario, total fertility is 0.5 births above the total fertility in the medium scenario; in the low scenario, it is 0.5 births below the medium scenario.

Other sources of information for this analysis are available through the links included in the text.

A chart showing that India’s population has more than doubled since 1950

India’s population has grown by more than 1 billion people since 1950, the year the UN population data begins. The exact size of the country’s population is not easily known, given that India has not conducted a census since 2011 , but it is estimated to have more than 1.4 billion people – greater than the entire population of Europe (744 million) or the Americas (1.04 billion). China, too, has more than 1.4 billion people, but while China’s population is declining , India’s continues to grow. Under the UN’s “ medium variant ” projection, a middle-of-the-road estimate, India’s population will surpass 1.5 billion people by the end of this decade and will continue to slowly increase until 2064, when it will peak at 1.7 billion people. In the UN’s “high variant” scenario – in which the total fertility rate in India is projected to be 0.5 births per woman above that of the medium variant scenario – the country’s population would surpass 2 billion people by 2068. The UN’s “low variant” scenario – in which the total fertility rate is projected to be 0.5 births below that of the medium variant scenario – forecasts that India’s population will decline beginning in 2047 and fall to 1 billion people by 2100.

People under the age of 25 account for more than 40% of India’s population. In fact, there are so many Indians in this age group that roughly one-in-five people globally who are under the age of 25 live in India. Looking at India’s age distribution another way, the country’s median age is 28. By comparison, the median age is 38 in the United States and 39 in China.

A chart showing that more than four-in-ten people in India are under 25 years old

The other two most populous countries in the world, China and the U.S. , have rapidly aging populations – unlike India. Adults ages 65 and older comprise only 7% of India’s population as of this year, compared with 14% in China and 18% in the U.S., according to the UN. The share of Indians who are 65 and older is likely to remain under 20% until 2063 and will not approach 30% until 2100, under the UN’s medium variant projections.

A chart showing in India, people under 25 are projected to outnumber those ages 65 and older at least until 2078

The fertility rate in India is higher than in China and the U.S., but it has declined rapidly in recent decades . Today, the average Indian woman is expected to have 2.0 children in her lifetime, a fertility rate that is higher than China’s (1.2) or the United States’ (1.6), but much lower than India’s in 1992 (3.4) or 1950 (5.9). Every religious group in the country has seen its fertility rate fall, including the majority Hindu population and the Muslim, Christian, Sikh, Buddhist and Jain minority groups. Among Indian Muslims, for example, the total fertility rate has declined dramatically from 4.4 children per woman in 1992 to 2.4 children in 2019, the most recent year for which data is available from India’s National Family Health Survey (NFHS). Muslims still have the highest fertility rate among India’s major religious groups, but the gaps in childbearing among India’s religious groups are generally much smaller than they used to be.

A chart showing in India, fertility rates have fallen and religious gaps of fertility have shrunk

Fertility rates vary widely by community type and state in India. On average, women in rural areas have 2.1 children in their lifetimes, while women in urban areas have 1.6 children, according to the 2019-21 NFHS . Both numbers are lower than they were 20 years ago, when rural and urban women had an average of 3.7 and 2.7 children, respectively.

Total fertility rates also vary greatly by state in India , from as high as 2.98 in Bihar and 2.91 in Meghalaya to as low as 1.05 in Sikkim and 1.3 in Goa. Likewise, population growth varies across states. The populations of Meghalaya and Arunachal Pradesh both increased by 25% or more between 2001 and 2011, when the last Indian census was conducted. By comparison, the populations of Goa and Kerala increased by less than 10% during that span, while the population in Nagaland shrank by 0.6%. These differences may be linked to uneven economic opportunities and quality of life .

A map showing that populations grew unevenly across India between 2001 and 2011

On average, Indian women in urban areas have their first child 1.5 years later than women in rural areas. Among Indian women ages 25 to 49 who live in urban areas, the median age at first birth is 22.3. Among similarly aged women in rural areas, it is 20.8, according to the 2019 NFHS.

Women with more education and more wealth also generally have children at later ages. The median age at first birth is 24.9 among Indian women with 12 or more years of schooling, compared with 19.9 among women with no schooling. Similarly, the median age at first birth is 23.2 for Indian women in the highest wealth quintile, compared with 20.3 among women in the lowest quintile.

Among India’s major religious groups, the median age of first birth is highest among Jains at 24.9 and lowest among Muslims at 20.8.

A chart showing that India’s sex ratio at birth has been moving toward balance in recent years

India’s artificially wide ratio of baby boys to baby girls – which arose in the 1970s from the use of prenatal diagnostic technology to facilitate sex-selective abortions – is narrowing. From a large imbalance of about 111 boys per 100 girls in India’s 2011 census, the sex ratio at birth appears to have normalized slightly over the last decade. It narrowed to about 109 boys per 100 girls in the 2015-16 NFHS and to 108 boys per 100 girls in the 2019-21 NFHS.

To put this recent decline into perspective, the average annual number of baby girls “missing” in India fell from about 480,000 in 2010 to 410,000 in 2019, according to a Pew Research Center study published in 2022 . (Read more about how this “missing” population share is defined and calculated in the “How did we count ‘missing’ girls?” box of the report.) And while India’s major religious groups once varied widely in their sex ratios at birth, today there are indications that these differences are shrinking.

Infant mortality in India has decreased 70% in the past three decades but remains high by regional and international standards. There were 89 deaths per 1,000 live births in 1990, a figure that fell to 27 deaths per 1,000 live births in 2020. Since 1960, when the UN Interagency Group for Child Mortality Estimation began compiling this data, the rate of infant deaths in India has dropped between 0.1% and 0.5% each year.

Still, India’s infant mortality rate is higher than those of neighboring Bangladesh (24 deaths per 1,000 live births), Nepal (24), Bhutan (23) and Sri Lanka (6) – and much higher than those of its closest peers in population size, China (6) and the U.S. (5).

A chart showing that out-migration typically exceeds in-migration in India

Typically, more people migrate out of India each year than into it, resulting in negative net migration. India lost about 300,000 people due to migration in 2021, according to the UN Population Division . The UN’s medium variant projections suggest India will continue to experience net negative migration through at least 2100.

But India’s net migration has not always been negative. As recently as 2016, India gained an estimated 68,000 people due to migration (likely to be a result of an increase in asylum-seeking Rohingya fleeing Myanmar). India also recorded increases in net migration on several occasions in the second half of the 20th century.

  • Birth Rate & Fertility

Portrait photo of staff

Few East Asian adults believe women have an obligation to society to have children

A growing share of americans say they’ve had fertility treatments or know someone who has, key facts about china’s declining population, global population skews male, but un projects parity between sexes by 2050, india’s sex ratio at birth begins to normalize, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy


  1. Top 10 Interesting Big Data Project Ideas [Innovative Research Topics]

    research projects on big data

  2. Master Thesis Big Data Projects (Research Guidance)

    research projects on big data

  3. 21 Best Big Data Research Topics

    research projects on big data

  4. Top 10+Big Data Projects for Students

    research projects on big data

  5. Most Happening Big Data Projects (Advance Level)

    research projects on big data

  6. The Value Of Big Data In Research & Development

    research projects on big data


  1. Using Big Data to Revolutionize Sustainability

  2. Big Data Project Use Case

  3. The only end-to-end project solutions

  4. A Secure and Verifiable Access Control Scheme for Big Data Storage in Clouds

  5. Webinar on “Small Data, Big Opportunities: Making The Most Of AI”

  6. Data Analysis with Python


  1. 214 Big Data Research Topics: Interesting Ideas To Try

    These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars. Evaluate the data mining process. The influence of the various dimension reduction methods and techniques. The best data classification methods. The simple linear regression modeling methods.

  2. Top 15 Big Data Projects (With Source Code)

    Recommendations can also be made based on tendencies in a certain area, as well as age groups, sex, and other shared interests. This is a data warehouse implementation for an e-commerce website "Infibeam" which sells digital and consumer electronics. Source Code - Data Warehouse Design. 5. Text Mining Project.

  3. Big Data Research

    The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in … View full aims & scope $2760

  4. Top 20 Latest Research Problems in Big Data and Data Science

    E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the ...

  5. A new theoretical understanding of big data analytics capabilities in

    Big Data Analytics (BDA) usage in the industry has been increased markedly in recent years. As a data-driven tool to facilitate informed decision-making, the need for BDA capability in organizations is recognized, but few studies have communicated an understanding of BDA capabilities in a way that can enhance our theoretical knowledge of using BDA in the organizational domain.

  6. Big data quality framework: a holistic approach to continuous quality

    Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and ...

  7. 9 Big Data Projects To Grow Your Skills [Or Land a Job]

    What Is a Big Data Project? A big data project is a data analysis project that uses a very large data set as the basis for its analysis. Any data set larger than a terabyte would be considered big data. Big data projects combine traditional data analysis techniques with others that are tailored to handle large data volumes. Big data engineers often use deep learning, convolutional neural ...

  8. Big data in digital healthcare: lessons learnt and ...

    Big Data initiatives in the United Kingdom. The UK Biobank is a prospective cohort initiative that is composed of individuals between the ages of 40 and 69 before disease onset (Allen et al. 2012 ...

  9. Full article: Big data for scientific research and discovery

    'Big Data for Development: Challenges & Opportunities' by United Nations Global Pulse, an initiative of the Secretary-General on big data, suggesting that projects/programs of big data research promote a national strategy, and pointing out the essential role of big data for the development of society as a whole, including science and ...

  10. A review of big data and medical research

    In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers. Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s. 4 The big data age, starting in 2002 ...

  11. 25+ Solved End-to-End Big Data Projects with Source Code

    22. Apache Spark. Apache Spark is an open-source big data processing engine that provides high-speed data processing capabilities for large-scale data processing tasks. It offers a unified analytics platform for batch processing, real-time processing, machine learning, and graph processing.

  12. Title: Open Research Issues and Tools for Visualization and Big Data

    View a PDF of the paper titled Open Research Issues and Tools for Visualization and Big Data Analytics, by Rania Mkhinini Gahar and 2 other authors. View PDF HTML ... This paper examines the big data visualization project based on its characteristics, benefits, challenges and issues. The project, also, resulted in the provision of tools surging ...

  13. Measuring benefits from big data analytics projects: an action research

    Big data analytics (BDA) projects are expected to provide organizations with several benefits once the project closes. Nevertheless, many BDA projects are unsuccessful as benefits did not materialize as expected. Organization can manage the expected benefits by measuring these, yet very few organizations actually measure on benefits post project development, and little has been written about ...

  14. Big Data Projects

    Big Data Projects studies the application of statistical modeling and AI technologies to healthcare. Mohsen Bayati studies probabilistic and statistical models for decision-making with large-scale and complex data and applies them to healthcare problems. Currently, an area of focus is AI's use in oncology, and multi-functional research ...

  15. Top 10 Essential Data Science Topics to Real-World Application From the

    1. Introduction. Statistics and data science are more popular than ever in this era of data explosion and technological advances. Decades ago, John Tukey (Brillinger, 2014) said, "The best thing about being a statistician is that you get to play in everyone's backyard."More recently, Xiao-Li Meng (2009) said, "We no longer simply enjoy the privilege of playing in or cleaning up everyone ...

  16. Current approaches for executing big data science projects—a systematic

    This was also consistent with the view that most big data science research has focused on the technical capabilities required for data science and has overlooked the topic of managing data science projects (Saltz & Shamshurin, 2016). However, much has happened during the past 6 years, with respect to research on data science process frameworks.

  17. Best Big Data Science Research Topics for Masters and PhD

    Data science thesis topics. We have compiled a list of data science research topics for students studying data science that can be utilized in data science projects in 2022. our team of professional data experts have brought together master or MBA thesis topics in data science that cater to core areas driving the field of data science and big data that will relieve all your research anxieties ...

  18. The use of Big Data Analytics in healthcare

    Future research on the use of Big Data in medical facilities will concern the definition of strategies adopted by medical facilities to promote and implement such solutions, as well as the benefits they gain from the use of Big Data analysis and how the perspectives in this area are seen. ... Croft R. Project management maturity in the age of ...

  19. Collaborative Historical Research in the Age of Big Data

    The Turing Data Safe Haven was conceived in 2019 when a team of researchers at the institute (including three members of our project team) published a paper entitled 'Design choices for productive, secure, data-intensive research at scale in the cloud', which presented a policy and process framework for secure environments deployed in the ...

  20. Big Data REU Projects

    Project Description: Big data analytics tasks in many applications (e.g., recognition, prediction, and control for smart cities) are fulfilled in large-scale distributed systems (e.g., Hadoop, Spark, and Storm, Tensorflow, and Caffe. The performance of these big data systems depends on the configuration optimization for different applications ...

  21. 13 Ultimate Big Data Project Ideas & Topics for Beginners [2024]

    Fun Big Data Project Ideas. Social Media Trend Analysis: Gather data from various platforms and analyze trends, topics, and sentiment. Music Recommender System: Build a personalized music recommendation engine based on user preferences. Video Game Analytics: Analyze gaming data to identify patterns and player behavior.

  22. LibGuides: Data Science: Guide for Independent Projects

    It contains raw web page data, extracted metadata and text extractions collected since 2008. The Common Crawl site includes tutorials and example projects using this data. This is a good dataset to use for a project if you want experience working with truly big data, navigating the Amazon web ecosystem, and using data mining techniques at scale.

  23. 10 Mind-Blowing Big Data Projects Revolutionizing Industries

    5. IBM Watson. Source. IBM Watson is an AI-powered platform that uses big data projects, analytics, natural language processing, and machine learning to understand and process unstructured statistics. It has been carried out in numerous domains, including healthcare, finance, and customer service.

  24. Big Data: COVID-19

    Big Data: Covid-19 introduces to students to the art of data science through contextual learning. They start by gaining context into SARS-CoV-2, then move onto big data. Lessons involve working with real virology, SARS-CoV-2 data, providing insight into how epidemiologists model pandemics. Once they've got the basics down, budding data ...

  25. Can Big Data Have a Role in Treating Dementia? That's What This

    This isn't Wong's first award. He was also a 2023 recipient of the Barry Goldwater Scholarship, which recognizes students pursuing research in math, natural science and engineering. "If I can do research that gives people one or two extra years to be a father, a mother or a grandparent, I think that's super worth fighting for," he says.

  26. How Can Big Data Help Dementia Treatment?

    This isn't Wong's first award. He was also a 2023 recipient of the Barry Goldwater Scholarship, which recognizes students pursuing research in math, natural science and engineering. "If I can do research that gives people one or two extra years to be a father, a mother or a grandparent, I think that's super worth fighting for," he says.

  27. JMIR Medical Informatics

    Background: The SARS-CoV-2 pandemic has demonstrated once again that rapid collaborative research is essential for the future of biomedicine. Large research networks are needed to collect, share, and reuse data and biosamples to generate collaborative evidence. However, setting up such networks is often complex and time-consuming, as common tools and policies are needed to ensure ...

  28. AI Stocks: 9 Biggest Companies in 2024

    It can leverage all types of data, including structured or unstructured data, big data and the internet of things (IoT) to quickly create interactive visuals.In January, OpenText launched its ...

  29. Key facts about India's growing population as ...

    India's population has grown by more than 1 billion people since 1950, the year the UN population data begins. The exact size of the country's population is not easily known, given that India has not conducted a census since 2011, but it is estimated to have more than 1.4 billion people - greater than the entire population of Europe (744 million) or the Americas (1.04 billion).