A Reader on Data Visualization

Chapter 3 case studies.

This chapter explores some interesting case studies of data visualizations. Critiquing these case studies is a valuable exercise that helps both expand our knowledge of possible visual representations of data as well as develop the type of critical thinking that improves our own visualizations. Furthermore, the examination and evaluation of case studies help show that new designs are just as usable as existing techniques, demonstrating that the field is suitable for future development.

3.1 Introduction

Visualization is like art; it speaks where words fail. The usefulness of data visualizations is not just limited to business and analytics; visualizations can explain almost anything in the world. Wars, rescue operations, social issues, etc. can be visualized to synthesize the details important details relevant to the issues. In particular, phenomena like the Syrian war, the number flights during Thanksgiving in the USA, the controversy of ‘#OscarsSoWhite,’ etc. present such complexity that we can write endless paragraphs and still fail to convince readers. Below are visualizations of some of these important and complex topics - visualizations that are much more persuasive than an essay, and with a tiny fraction of the text.

Many of the case studies mentioned below come from the following articles:

(Nathan Yau ) This source picks the top 10 best data visualizations of 2015. For each pick, the author displays the project plot and also describes his reasoning for choosing that chart as an exemplary visualization. This article is useful for getting a basic understanding of what characteristics a good visualization should include.
(Kayla Darling ) The author has chosen fifteen of the best infographics and data visualizations from 2016 and explained the reasoning behind these choices.
(Crooks ) This author has chosen 16 examples of data visualization that demonstrate how to represent data in a way that’s both compelling and easy to digest.
(Stadd ) These 15 data visualizations show the vast range that data analysis is applicable to, from pop culture to public good. Take a look at them to get inspiration/understanding for your own work.
(Chibana ) This source includes 15 data visualizations that cover current events, including politics, Oscar nominations, and immigration.
(Andy ) Vizwiz is a blog about Tableau-based data visualization. It has case studies about how to improve visualizations, written by Andy Kriebel, a famous Tableau Zen Master. This blog is recommended because it is not only practical but also full of insights. One of the best parts of this blog is the “Makeover Monday,” which develops a new visualization based on an original one. This blog also includes excellent tips for and examples of Tableau.
Tableau has a gallery that displays great data visualization examples created by Tableau. It is useful to see how people are using all kinds of data to create informative yet fun data visuals. Data being used is also attached to the example so we can try to mimic what other people did as well.

3.2 Geographic Visualizations

Geovisualization or geovisualisation (short for geographic visualization), refers to a set of tools and techniques supporting the analysis of geospatial data through the use of interactive visualization. Like the related fields of scientific visualization and information visualization geovisualization emphasizes knowledge construction over knowledge storage or information transmission.To do this, geovisualization communicates geospatial information in ways that, when combined with human understanding, allow for data exploration and decision-making processes. Source: (contributors 2019 a ) More specifically, Geovisualization is a process that alters geographic information so that we can consume it with our eyes. Its purpose is to capitalize on our affinity for visual things and convert the seemingly random collection of information available to us into a form that can be quickly understood. Many tools can be used for Geographic Visualization, such as Mapbox,Carto,ArcGIS Online and HERE Data Lens. Source: (Gloag, n.d. : Tools & Techniques)

Often, people use maps to visualize data that should not be mapped. Here are some examples of when a map visualization is a good choice.

3.2.1 Spies in the Skies

The map below is from a Buzzfeed article (Aldhous and Seife 2016 ) that shows how common it is for the government to observe people. It was filled with red and blue lines (representing FBI and DHS aircraft, respectively) which illustrate the flight paths of the planes. When planes circle an area more than once, the circles become darker. The circles change by day and time, and individual cities can be typed into a search bar to see the flight patterns over them. The visualization rather creatively looks almost like a hand-drawn map. While presenting an ordinarily uncomfortable topic, this allows individuals to check things for themselves, hopefully providing some peace of mind.

Source: (Kayla Darling 2017 )

New York Flight Patterns

New York Flight Patterns

3.2.2 Two Centuries of U.S. Immigration

This interactive map from (Galka 2016 ) shows the rate of immigration into the U.S. from other countries over the last 200 years in 10-year segments. Each colored dot represents 10,000 people coming from the specified country. Countries then light up when they have one of the highest rates of migration. A tracker on the left indicates what countries sent the most people to the U.S. at what times.

This is a good visualization because it is engaging and easy to read and interpret. The movement of the dots draws the reader’s attention while the brightly lit countries make it easy to pick out the highest total migrations. The bright colors and dark background help the information stand out. This map is a bit simple, but effective.

Source: (Kayla Darling 2017 ) .

US Immigration

US Immigration

3.2.3 Uber: Crafting Data-Driven Maps

Map visualization is essential for companies like Uber that need to track metrics using geo-space points. In this article, the designer from Uber talks about the challenges of designing such visualizations and the possible solutions (Klimczak 2016 ) .

To tackle these problems, Uber started by defining base map themes by optimizing detail, color, and typography. Based on that, data layers are added using scatter plots and hex bins, with careful color selection to help their team make decisions. To make it even better, Uber took a further step by adding trip lines (see images below), which became a signature visualization of Uber. Choropleths are also used to help visualize how metrics and values differ across geographic areas. Uber uses US postal codes as geographic boundaries and infuses various datasets to create the color variation.

The visualization in this article is a classic problem of visualizing geographic data. The detailed explanation of the problems and how they are solved can be beneficial for people or startups trying to conceptualize and make appropriate visualizations that support the decision-making process.

Uber Route Maps

Uber Route Maps

Source: (Klimczak 2016 )

3.3 Demographic Comparisons

One common use of visualization is to compare different groups against each other, such as political parties or generations.

3.3.1 Young Voters, Class and Turnout: How Britain Voted in 2017

This article’s goal is to convey the change in party votes in the 2017 UK general election compared to votes in 2015 (Holder, Barr, and Kommenda 2017 ) . The change in party votes was shown with regards to three demographic factors: age, class, and ethnicity. For each factor, there are four graphs (one per political party), each illustrated in the party’s standard color. The change in the percent of votes is shown as an arrow where the arrow’s shaft is the length of the difference from 2015 to 2017 while the x-axis is the demographic factor split into different bins.

This a good visualization because it is straightforward to read and interpret. The color-coding of the arrows and party names makes it easy to pick out the different parties. The index is smartly spread across the visualization to reduce cross-referencing, and color in the graph represents the actual party colors in the campaign. The arrow lengths highlight just how significant of a change happened. For example, in the Age section, it is easy to see the pattern between the Labour party gaining many voters aged 18 to 44 and the Conservative party gaining voters aged 45 and up.

UK Party Votes by Age

UK Party Votes by Age

Source: (Holder, Barr, and Kommenda 2017 )

3.3.2 U.S. Migration Patterns

The New York Times data team mapped out Americans’ moving patterns from 1900 to present, and the results are fascinating to interact with (Aisch, Gebeloff, and Quealy 2014 ) . We can see where people living in each state were born, and where people are moving to and from. The groupings of the destinations vary based on that state’s trends, preventing unnecessary clutter while still showing detail when vital, as can be seen by the difference between the charts for California and Pennsylvania. When generating interactive charts, one must always assume that the audience will not interact with it. The message of a chart has to be clear enough that anyone just viewing the generic chart can understand.

Overall, this type of chart can work well to visualize movement in data over time, such as with migration. However, it must be done carefully to maintain clarity. Too many categories with colors and crossing lines can make it difficult for a reader to keep track of what the data is saying and it can quickly go from a very graphic visualization to a chaotic mess of lines. The designer does a pretty good job with these visualizations by limiting the number of categories in grouping states by region (West, South, Midwest, etc.). But when introducing many dimenional variables such as Migration from Pennsylvania, the chart can quickly turn convoluted and hard to read which costs the audience. Finally, it is not completely clear why so many crossing lines are necessary for the Pennsylvania chart. The crossing lines, along with the use of the same color for different lines within the same regional categories, can introduce unnecessary complexity.

Migration from California

Migration from California

Migration from Pennsylvania

Migration from Pennsylvania

Source: (Aisch, Gebeloff, and Quealy 2014 )

3.3.3 The American Workday

NPR tapped into American Time Use Survey data to ascertain the share of workers in a wide range of industries who are at work at any given time (Quoctrung Bui 2014 ) . The original question of when Americans work, rather than the number of hours worked, is answered in the graph. The chart overlays the traditional 9 AM-5 PM standard workday as a reference point, helping the audience draw exciting conclusions. Below is a screenshot of the data product; the original graph is more interactive and allows the audience to explore when people are working for different occupations.

data visualization case study in python

Some interesting findings include: 1. Construction workers both start and finish their workday earlier and generally do not work at lunch hours as there is a massive drop at noon.

data visualization case study in python

  • Servers and cooks’ schedule are the opposite of all other occupations with the peak from lunch through the evening.

data visualization case study in python

This data product is an excellent example because the analytic design has been applied to contrast specific occupations to the traditional 9-5 working hours. This is easy to understand and make particular occupations stand out more manageable. The use of color for highlighting the selected occupation in the graph helps to categorize different occupations as well.

3.3.4 How People Like You Spend Their Time

This visualization from (Yau 2016 ) lists several categories such as “personal care” and “work” along one side of a graph with a line illustrating the amount of time the average person in a particular demographic spends on each subject. Entering different parameters at the top, such as changing gender or age, causes the lines to shift to feature that demographic. The simplicity of this visualization helps the information get across and avoids bogging down the statistics. Sometimes, less is more.

data visualization case study in python

3.3.5 Britain’s Diet In Data

This is an excellent example about how to present a significant amount of comprehensive data - distributed across different categories and measured in different metrics - in a simple yet effective manner, while still maintaining interest and aesthetics. The data product attempts to show how the average Briton’s diet has changed over the last four decades for the better (Institute 2016 ) . It does this by displaying simple trend lines that show that more harmful and fatty foods are being consumed less while consumed more healthier and leaner foods. It further breaks down every major food category into tens of its constituent products, and in both the overview and deep-dive versions, provides further levers to massage more meaning out of the data. It also shows how the contribution of different foods to the typical diet has changed over the years. Here, we can toggle the year to see exactly how much of each food was consumed, again with another deep-dive into the constituents of every primary food group.

data visualization case study in python

Such a visualization is ideal for a layman who would want to walk away with an immediate and accurate understanding of the overall dietary changes. It also provides plenty detail on demand for the more discerning viewer who might have more time and inclination to dissect and parse through the graphs. It is difficult to use the same data product to cater to both types of viewers in such an adequate capacity, which is what makes this particular data product so impressive and useful. It satisfies the principles of graphical excellence as stated by Edward Tufte : >“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.”

Source: (Tufte 1986 )

3.3.6 Selfie City

Selfie City, a detailed multi-component visual exploration of 3,200 selfies from five major cities around the world, offers a close look at the demographics and trends of selfies (Manovich et al. 2014 ) . This project is based on a unique dataset compiled by analyzing tens of thousands of images from each city, both through automatic image analysis and human judgment. The team behind the project collected and filtered the data using Instagram and Mechanical Turk. Rich media visualizations (imageplots) assemble thousands of photos to reveal interesting patterns. It provides a demographic and regional comparison of selfies.

Estimated Age and Gender Distribution

Estimated Age and Gender Distribution

Source: (Manovich et al. 2014 )

3.3.7 Evolving Demographics

Another frequent use is to look at how something changes over time. Time-series data can be shown many ways, and these are some examples.

3.3.7.1 Millennial Generation Diversity

CNNMoney created an interactive chart using U.S. Census Data to show the size and diversity of the millennial generation compared to baby boomers (Kurtz and Yellin 2018 ) . While the article’s main point is that the millennial generation is bigger and more diverse than the baby boomer generation, it also contains information about all of the other living generations. It turns hard numbers into an intriguing story, illustrating the racial makeup of different age groups from 1913 to present.

The author also summarized three key findings from the graph: 1| The most common age in the US is 22 years old. 2| The median age in the US is 37.6 years old. * 3| Among the youngest generation, only 50% of the population is white with the potential of dropping from the biggest race in the US.

Racial Diversity of US Generations

Racial Diversity of US Generations

Source: (Kurtz and Yellin 2018 )

This is an effective graph because while it contains many data points, it makes the overall trends very clear without sacrificing much detail. You can see the drop in some white people and the increasing growth of the other racial categories.

3.3.7.2 How the Recession Reshaped the Economy, in 255 Charts

The first large graph contains 255 lines to show how the number of jobs has changed for every industry in America, using color to highlight the lines and let viewers see the specifics for each industry (Ashkenas and Parlapiano 2014 ) . By hovering over a line, viewers can get the detailed information of that industry’s job trend. Keeping this extra data hidden until needd will make it easier for readers to absorb the bigger picture from this vast data visualization. Following charts are subsets categorized by job sector and sub-industries. Readers can choose the industry or sector they are interested in and, similar to the first graph, view the more detailed information by hovering over a line.

data visualization case study in python

Source: (Ashkenas and Parlapiano 2014 )

3.3.7.3 An Aging Population: Projected Number of Children and Older Adults

An aging population is always a hot topic in social economics and politics (United States Census Bureau 2018 ) . Here we explore a collection of data visualizations showing the aging population in the U.S. and the world.

data visualization case study in python

Source: (United States Census Bureau 2018 )

This example includes a bar chart and a line graph to demonstrate the aging population compared with the population of children. This visualization allows easy comparison, employs color to differentiate the categories, and highlights the intersection point.

3.3.7.4 From Pyramid to Pillar: A Century of Change, Population of the U.S.

data visualization case study in python

This is a population pyramid . “A population pyramid is a pair of back-to-back histograms for each sex that displays the distribution of a population in all age groups and in gender” (Bureau 2018 b ) . It is good to visualize changes in population distributions (sex, age, year). The shape of a pyramid is also used to represent other characteristics of a population. To illustrate, A pyramid with a very wide base and a narrow top section suggests a population with both high fertility and death rates. It is a useful tool to make sense of census data. (“An Aging Population,” n.d. ) offers an animated pyramid.

Comparison of aging population in US and Japan

Comparison of aging population in US and Japan

Source: (“An Aging Population,” n.d. )

This is an animated and multiple-population pyramid. It used to compare different patterns across countries. One additional benefit for the interactive population pyramid is that it shows the shape changes by year, which is useful for time-series comparison. A similar project with R code is here .

3.3.7.5 Music Timeline

Google’s Music Timeline illustrates a variety of music genres waxing and waning in popularity from 2010 to the present day, based on how many Google Play Music users have an artist or album in their library, and other data such as album release dates (Google 2014 ) . One useful feature of this graph is the reader’s ability to explore one specific genre and its subgenres at a more detailed level, as well as view the general timeline of all music. The drill-down interaction allows for more details without cluttering the overview of the visualization. Embedding the graph with names (e.g., Rock/Pop) makes similar color lines easy to distinguish.

data visualization case study in python

Source: (Google 2014 )

3.4 Visualizing Urban Data for Social Change

(Neira 2016 )

One field in which visualization can have a meaningful social impact is promoting understanding of and generating discussions around cities. With the development of a city, demographic changes, economic, environmental and social problems become important issues. Visualization plays an important role in promoting understanding of how the cities and the societies within them work, debating the problems that cities face, and engaging citizens to work toward their dream cities.

Recently, as part of Habitat III side event , LlactaLAB - Sustainable Cities Research Group, presented a project called Live Infographics. It was an interactive methodology that put citizens and experts opinions about the New Urban Agenda on one platform to help generate a ‘horizontal governance’. The different opinions were materialized with a dynamic map to visualize the generated data. The primary objective of the project is to generate citizen-led data collection and to enable governments to build a better understanding of public sentiment, and then engaging people in the process.

data visualization case study in python

A great Urban Data Visualization ought to have the capacity to start “Sociological Imagination”. It should provoke individuals to consider how their individual choices, issues, struggles, and in general their daily lives, are a extension of society, and how their choices collectively influence public opinion. Another key aspect of these kinds of data visualizations is their ability to make the audience understand how their activities impacts the cities they live in and help them work towards the betterment of the cities.

The following is an example of a visualization that is trying to effect social change. It shows how different states are populated on our way to wealth at the cost of the Environment and the percentage of adults who support the cause by estimating public opinions. Source : (“We Have Poluted Our Way to the Wealth in the Expense of the Environment,” n.d. )

data visualization case study in python

Urbanization and the spread of information technologies transform Cities into huge data pools, that data will play a major role in understanding how city areas have changed and are likely to change in the future. Urban Data Visualization gives us a quick view of the architectural contrast of Urban changes in Cities. (MORPHOCODE 2019 )

This Urban Data Visualization based on the NYC Department of City Planning Data set, the result is a snapshot of Brooklyn’s evolution, revealing how development has rippled across certain neighborhoods while leaving some pockets unchanged for decades, even centuries. The visualization is interactive, the reader can check every block’s name and built year. (MORPHOCODE 2019 )

data visualization case study in python

As urban areas continue to develop, diverse and complex issues evolve along with them. Disparity, isolation, loss of biodiversity and environmental quality, etc. are all important but thorny issues, and finding successful solutions will require uniting strategy producers, academics, designers, and citizens. Visualization, if done right, can help jumpstart important discussions between these diverse groups of people and help solve the issues that emerge as the world becomes more urbanized.

3.5 Animated Data Visualization

Like evolving demographics, these visualizations are demographics that change over time. These, however, are self-animated instead of interactive.

3.5.1 A Day in the Life of Americans

This animated data visualization shows the time people spend on daily activities throughout the day (Nathan Yau 2015 b ) . The plot is simple and easy to interpret, but it also includes a good number of variables including time, activity type, number of people doing each activity, and the order in which activities are done.

One of the plot’s biggest strengths is that by using one dot to represent each person in the study and using animation, we can drill down to the level of an individual and follow him or her throughout the day. The accumulation of dots for each particular activity also gives us an aggregate-level view of the same data, so that we get both individual and aggregate insights.

A drawback of the plot is that it is hard for our eyes to keep track of 1000 simultaneously moving dots. The author of the post addresses this by creating subsequent plots with stationary lines at crucial times of the day. This represents people’s movements from one activity to another without overwhelming the reader.

Overall, this is an engaging, informative, relevant, and fun animated plot that tells a story.

data visualization case study in python

Source: (Nathan Yau 2015 b )

3.5.2 Hans Rosling’s 200 Countries, 200 Years, 4 Minutes

Global health data expert Hans Rosling’s famous statistical documentary “The Joy of Stats” aired on BBC in 2010, but it is still turning heads. In the remarkable segment “200 Countries, 200 Years, 4 Minutes”, Rosling uses augmented reality to explore public health data in 200 countries over 200 years using 120,000 numbers, in just four minutes (Rosling, Hans 2010 ) .

Screenshot from “200 Countries, 200 Years, 4 Minutes”

Screenshot from “200 Countries, 200 Years, 4 Minutes”

Source: (Rosling, Hans 2010 )

What makes this visualization so well-known is its use of animation and narration to highlight different stories within the overall data. While the visualization could have been made as an interactive chart where the audience can select the year, instead it is a video. Rosling’s narration of how various regions have fluctuated over the last two hundred years is necessary for his argument since there is no other description or explanation.

3.6 Dust in the Wind: Visualization and Environmental Problems

Environmental issues can quickly become extremely complex. When dealing with assessments of site, environmental remediation design, monitoring, environmental litigation, the quantity of data involved can quickly become overwhelming. Maintaining and organizing that data and keep a balance is insufficient. Visualization is the only means for condensing and communicating vast quantities of data. Visualization provides an invaluable tool to communicate complex data in a form that makes it intelligible to all parties. There are many case studies on visualization of environment-related issues. Some of them are mentioned below:

3.6.1 Global Carbon Emissions

This data visualization, based on data from the World Resource Institute’s Climate Analysis Indicators Tool and the Intergovernmental Panel on Climate Change, shows how national CO₂ emissions have transformed over the last 150 years and what the future might hold. It also allows the audience to explore emissions by country for a range of different scenarios (World Resources Institute 2014 ) .

data visualization case study in python

Source: (World Resources Institute 2014 )

3.6.2 What’s really warming the world?

This case study begins by clearly explaining necessary background information and the analytic questions it seeks to answer. Next, it analyzes each factor separately using both verbal explanations and dynamic graphics to compare the observed temperature movements, and then categorizes related factors into “natural factors” or “human factors.” After that, it combines all the dynamic graphics into one, which makes the results more accessible and more straightforward to compare. Lastly, the authors provide further detailed explanations of dataset sources to support their results. Overall, this case study is straightforward, easy to understand and informative (Roston and Migliozzi 2015 ) (Crooks 2017 ) .

data visualization case study in python

Source: (Roston and Migliozzi 2015 )

3.6.3 Understanding Plastic pollution using visualization

Plastic pollution is the accumulation of plastic products in the environment that adversely affects wildlife, wildlife habitat, or humans. Human usage of plastic has increased manifolds in last few decades. Since plastic is inexpensive and durable, it has a wide variety of uses in our everyday life. Since the 1950’s, an estimated 6.3 billion tons of plastic has been produced, of which only about 9% is recycled (contributors 2019 b ) .

data visualization case study in python

Plastic has become part of our daily life, and human dependence on plastic has increased over time. The visualization below shows some common plastic products undermining environmental health. (Grün 2016 )

data visualization case study in python

With a share of 26 percent, China may be the largest plastic producer in the world; yet the largest plastic consumer is neighboring Japan. The people living in the island nation have consumption that exceeds that of Africa and the rest of Asia combined.

Donut chart is a modern version of pie-chart which looks cleaner, and embedded visual imagery makes the distribution easy to understand. (Grün 2016 )

Plastic Use: Industrial nations top the charts (Grün 2016 )

data visualization case study in python

This visualization uses a simple line chart to show increasing trends. A positive aspect of this chart is the removal of the vertical grid which creates noise in the visualization when its objective is to show the trend, rather than the numbers.

data visualization case study in python

“Plastic where it shouldn’t be” combines four large-scale plastic marine pollution datasets, each published in a different scientific journal over the last five years, totaling 9,490 surface net tows. It is a symbol map shows the amounts of plastic wastes distribute in oceans. Please note: just because there is no plastic displayed in a certain region does not mean that it isn’t there. The open ocean is vast and pollution research is both time- and cost-intensive. (Moret 2014 )

data visualization case study in python

How long does plastic remain in the ocean? (Grün 2016 )

Overall, this visualization is useful in the following ways:

  • It provides content: those plots serve one of the primary purposes of data visualization - storytelling. It naturally leads the audience to understand the effects of plastic pollution.
  • Effective use of charts: the correct use of different types of plots makes the visualization both effective and exciting.
  • Efficient use of color: this visualization is a good example of color playing an essential role in a data visualization by guiding the reader to grasp the relationships in the data. There is no redundant color, and no primary color is missing.

3.7 Language

3.7.1 green honey.

Language shapes the way we view the world. Different languages may have vastly different ways of describing things—including color.

data visualization case study in python

Source: (Lee 2016 )

3.7.2 Linguistic Concepts

This case study is about the use of linguistic concepts; it discusses how the data is being used and how visual graphics are used to deliver the central insights. It presents an educational tool that integrates computational linguistics resources for use in non-technical undergraduate language science courses. By using the tool in conjunction with case studies, it provides opportunities for students to gain an understanding of linguistic concepts and analysis through the lens of practical problems in feasible ways. (Alm, Meyers, and Prud’hommeaux 2017 ) .

HistoBankVis is a novel visualization system designed for the interactive analysis of complex, multidimensional data to facilitate historical linguistic work (Michael Hund 2015 ) . In this paper, the visualization’s efficacy and power are illustrated utilizing a concrete case study investigating the diachronic interaction of word order and subject case in Icelandic.

Much of what computational linguists(CL) fall back upon to improve natural language processing and model language “understanding” is the structure that has, at best, only an indirect attestation in observable data. The sheer complexity of these structures and the visible patterns on which they are based, however, usually limit their accessibility, often even to the researchers creating or studying them. Traditional statistical graphs and custom-designed data illustrations fill the pages of CL papers, providing insight into linguistic and algorithmic structures, but visual ‘externalizations’ such as these are almost exclusively used in CL for presentation and explanation. There are particular statistical methods, falling under the rubric of “exploratory data analysis,” and visualization techniques just for this purpose are available. However, these are not widely used. These novel data visualization techniques offer the potential for creating new methods that reveal structure and detail in data. Visualization can provide new ways for interacting with large corpora, complex linguistic structures, and can lead to a better understanding of the states of stochastic processes.

3.7.3 State of the Union 2014 Minute by Minute on Twitter

Twitter’s data team assembled an impressive interactive data hub that depicts how Twitter users across the globe reacted to each paragraph of President Obama’s 2014 State of the Union address (Belmonte 2014 ) . You can slice and dice the data by topic hashtag (for example, #budget, #defense, or #education) and state, resulting in a powerful detailed and cluttered visualization. Since the visualization is about the topic density in a specific time frame, maybe it’s a good idea for us to use this kind of format when we encounter the expression of a poisson distribution.

data visualization case study in python

Source: (Belmonte 2014 )

3.8 Political Relationships

3.8.1 connecting the dots behind the election.

This article in the New York Times lists several different candidates and creates compelling visuals that link their campaigns to previous ones (Aisch and Yourish 2015 ) (Kayla Darling 2017 ) . Each visual contains several different sized dots that represent a specific campaign, administration, or other governmental organization related to the candidate’s current campaign, which is then connected by arrows. Hovering over a specific dot highlights the connections between the groups. This visual is a great way to summarize what would otherwise require a long slog through years of information into an easily accessible and viewable format so that voters can figure out where the candidates’ experiences lie.

Clinton 2016 Campaign Staff

Clinton 2016 Campaign Staff

3.8.2 A Guide to Who is Fighting Whom in Syria

One of the charts shown in the link (Crooks 2017 ) , the visualization of ‘A Guide to Who is Fighting Whom in Syria’ is an exciting graphic to study. The visualization and its report can be seen at (Keating and Kirk 2015 ) .

Who is Fighting Whom in Syria

Who is Fighting Whom in Syria

Source: (Keating and Kirk 2015 )

This visualization helps elucidate an extremely complicated topic like the Syrian War. It consists of 3 different emojis in three different colors, with each color and facial expression combination showing the ties and conflicts between the various groups involved in the Syrian War. When you click on each emoji, a small dialogue box pops up that explains the relationships between the various countries and rebel groups involved in the war. This is not only easy to understand but is also pleasing to the eyes.

On the other hand, the inherent complexity of relationships between different groups make it difficult to understand the complete picture. If the list of involved parties could be sorted by simplified “sides” (such as Syrian Government on one end with Syrian Rebels on the other) or ranked by how liked they are, then it may be easier for a trend to emerge at first glance. Also, the table format of the visualization means that the data is duplicated, making it appear even more complicated. Instead, one side of the diagonal divide could be greyed-out to simplify the audience’s experience with this visualization.

Green emoji shows ‘Friendly’ relationship

Green emoji shows ‘Friendly’ relationship

Red emoji shows the ‘Enemies’ relationship

Red emoji shows the ‘Enemies’ relationship

Yellow emoji shows ‘Complicated’ relationship

Yellow emoji shows ‘Complicated’ relationship

3.9 Uncategorized

3.9.1 simpson’s paradox.

The Visualizing Urban Data Idealab (VUDlab) out of the University of California-Berkeley put together this visual representation of data that disproves the claim in a 1973 suit that charged the school with sex discrimination. Though the graduate schools had accepted 44% of male applicants but only 35% of female applicants, researchers later uncovered that if the data were properly pooled, there was a small but statistically significant bias in favor of women. This is called a Simpson’s Paradox.

By “properly pooled,” the investigators meant broken down by the department. For instance, men were more inclined towards science and women towards humanities. When compared to each other, the science departments required more specialized skills while the humanities would accept applicants with a more standard undergrad curriculum, thus creating the Simpson’s Paradox.

Simpson’s Paradox originally from vudlab.com

Simpson’s Paradox originally from vudlab.com

Source: (Lewis Lehe 2013 )

3.9.2 Every Satellite Orbiting Earth

This interactive graph, built using a database from the Union of Concerned Scientists, displays the trajectories of the 1,300 active satellites currently orbiting the Earth. Each satellite is represented by a circular icon, color-coded by country and sized according to launch mass (Yanofsky and Fernholz 2015 ) .

Low Earth Orbit Satellites

Low Earth Orbit Satellites

Source: (Yanofsky and Fernholz 2015 )

Interactive graph have its own specific advantages. It helps bridge the gap between programmers and non-programmers. This plot is a good example why using interactive graph is a good idea: - It provides an intuitive way for anyone to understand the data regardless of their technical knowledge. - It helps to identifying causes and trends more quickly - It tells a consistent story through data - It improves efficiency of representing data

3.9.3 Malaria

The authors of Vizwiz redesigned “The Seasonality of Confirmed Malaria Cases in Zambia Southern Province” by pointing out what works well, what could be improved, and why their new visualization will be better (Andy 2009 ) .

Original visualization of malaria cases

This chart below shows number of malaria cases reported for health facilities and community health workers and a comparison between the two over the years. From this chart we can clearly see that as summer approaches, cases of malaria increase indicating a seasonality. The colors are also distinct from each other.

The original visualization effectively shows the seasonality of malaria cases but is unclear if the two reporting categories are stacked or one behind the other and is rather garish. The creator of the redesign made the seasonality more obvious by combining the reporting categories and explaining the spikes better.

Furthermore, by adding the yearly data split by districts, we can lead to a possible actionable solution to the study of malaria cases in Zambia which is an important objective of visualization. The author has combined the data to find out what the data looks like when combined with health facilities and health workers. And the usage of the color scheme is much more effective than the previous version which makes seasonality more evident.

Redesigned visualization of malaria cases

3.9.4 Is it Better to Rent or Buy?

There are many factors involved in deciding to rent or buy a house which has led to many calculators that are supposed to simplify this decision. This calculator includes several sloping charts, each including a factor that will affect how much you will have to pay, such as the individual cost of your home and your mortgage rates (Bostock, Carter, and Tse 2014 ) . A movable scale along the bottom of each chart allows you to enter different data, such as changing the “cost of rent per month” on the side. This can be useful for price comparison: if you can find a similar house to rent for that much per month or less, it is more cost effective just to rent the home. This visualization is incredibly thorough and a useful tool for homeowners of any age and status.

data visualization case study in python

Source: (Bostock, Carter, and Tse 2014 )

3.9.5 An Interactive Visualization of NYC Street Trees

Using data from NYC Open Data, this interactive visualization shows the variety and quantity of street trees planted across the five New York City boroughs (Zapata 2014 ) . As the reader hovers over a tree or bar segment, the connected sections light up, making it easier for the reader to look at what otherwise could have been a very dense chart.

We can see what some of the familiar and uncommon trees planted in the five boroughs of New York City are. This visualization allows one to see the distribution quickly. One can make inferences based on the distribution, such as trees in the Bronx and Manhattan seem to be distributed more uniformly compared to the other three boroughs. It gives a direct comparison between the five boroughs which could be used to make a compelling decision by the audience.

NYC Street Trees

NYC Street Trees

Source: (Zapata 2014 )

The interactive visualization is an advantage that enables the display, and intuitive understanding of multidimensional data provides a variety of visualization chart types and enables the audience to accomplish traditional data exploration tasks by making charts interactive. Moreover, this visualization provides a good example: it enables the audience to explore on their own and finds exciting facts about NYC street trees.

3.9.6 Adding up the White Oscars Winners

A visualization of all previous winners of the Best Actor/Actress Oscar winners can be seen in an article by Bloomberg (“Adding up the White Oscar Winners” 2016 ) . From the attributes of past Oscars winners, the authors have developed a set of attributes that they believe will continue to be prevalent in future Oscar winners. It is fascinating to see how the article shows the features of the Best Actress, Actor, movies, etc. in a simple and captivating visual.

The visualization is interactive, and we can click on each attribute like ‘Hair Color,’ ‘Eye Color,’ etc. to see the features of the actors and actresses who are likely to win the Oscars. Based on different attributes selected, the visualization changes to give you the data specific to the attributes. For each attribute selected, it gives you a fact about the selected attribute related to the Oscar Winner. For instance, when you select the race, it states “In the entire history of the Oscars all but 8 of the Best Actors and Best Actresses have been white”. Similarly, the visualization also gives information about the different aspects of movies that are more likely to win, like ‘Length,’ ‘Month,’ ‘Budget,’ etc., and also predict about the future nominees who are likely to win Oscar.

Best Actor and Best Actress

Best Actor and Best Actress

Best Picture

Best Picture

Source: (“How to Build an Oscar Winner” 2015 )

3.9.7 Kissmetrics blog: visualization of metrics

Kissmetrics blog is a place where people talk about analytics, marketing, and testing through narratives and visualization of metrics. Metrics are essential in the real world, especially when developing/promoting products. Visualization of metrics is also essential so that stakeholders can monitor performance, identify problems and dive deep into potential issues.

This example from the Kissmetrics blog is about Facebook’s organic reach (Patel 2018 ) . One crucial point discussed in the blog is whether the Facebook’s organic reach is decreasing drastically.

The general trend shows that there is a considerable decline in Facebook’s page organic reach.

data visualization case study in python

The following graphs show that the engagement is increasing; that is, while the quantity of content is decreasing, the quantity is increasing.

data visualization case study in python

Source: (Patel 2018 )

This resonates with what we have learned at class regarding how different perspectives of interpreting data can lead to different conclusions.

3.9.8 Describe Artists with Emoji

Using the data from Spotify, the author listed the ten most distinctive emoji used in the playlists related to favorite artists (Insights 2017 ) . The table being used in this visual is very straightforward to link the artist to the emojis and is very easy to compare among artists. When you hover over the emoji, further information is presented.

data visualization case study in python

Source: (Insights 2017 )

3.9.9 Goldilocks Exoplanets

Using data from the Planetary Habitability Laboratory at the University of Puerto Rico, the interactive graph on Astrobiology plots planetary mass, atmospheric pressure, and temperature to determine what exoplanets might be home, or have been home at one point, to living beings (Tomanio and Gonzalez Veira 2014 ) .

One highlight of the graph is how color has been used. The red dots represent planets that are too hot, the blue dots mean too cold, and the green ones mean just the right temperature. This is very intuitive for people to understand without the necessity to read through the notes. The dots are semi-transparent so the overlapping of planets does not detract from the audience’s ability to read the graph. (VERGANO 2014 )

Additionally, the size of each dot represents the radius of each planet. At first glance, one might assume that most planets are much larger than Eath, but the visualization includes a note explaining that larger planets are easier to find. This is a good example of how much explanation to include in a visualization, not so much that the audience is distracted from the graph but enough that they have the information needed to interpret it.

data visualization case study in python

Source:[Astrobiology]

3.9.10 Washington Wizards’ Shooting Stars

This detailed data visualization demonstrates D.C.’s basketball team’s shooting success during the 2013 season (Lindeman and Gamio 2014 ) . Using statistics released by the NBA, the visualization allows viewers to examine data for each of 15 players. For example, viewers can see how successful each player was at a variety of types of shots from a range of spots on the court, compared to others in the league.

data visualization case study in python

Source: (Lindeman and Gamio 2014 )

Generally this is a data visualization for following reasons because it demonstrates complex infomation in a simple and topic-related format. It highlights fact numbers to tell important information. The use of colr is retrained but efficient. However, it is undefined that what is targeted audience. It can also reduce cognitive overload for lines.

3.9.11 Visualization of big data security: a case study on the KDD99 cup data set

This paper utilized a visualization algorithm together with significant data analysis to gain better insights into the KDD99 dataset:

Abstract Cybersecurity has been thrust into the limelight in the modern technological era because of an array of attacks often bypassing new intrusion detection systems (IDSs). Therefore, deciphering better methods for identifying attack types to train IDSs more effectively has become a field of great interest. Critical cyber-attack insights exist in big data; however, an efficient approach is required to determine strong attack types to train IDSs to become more active in critical areas. Despite the rising growth in IDS research, there is a lack of studies involving big data visualization, which is crucial. The KDD99 dataset has served as a reliable benchmark since 1999; therefore, this dataset was utilized in the experiment. This study utilized a hash algorithm, a weight table, and sampling method to deal with the inherent problems caused by analyzing big data: volume, variety, and velocity. By utilizing a visualization algorithm, the researchers were able to gain insights into the KDD99 dataset with precise identification of “normal” clusters and described distinct clusters of possible attacks.

To read the full paper, please follow the reference link:

(Ruan et al. 2017 )

3.9.12 The Atlas of Sustainable Development Goals 2018 - Data Visualization of World Development

(TEAM 2018 )

This is an exciting source and an excellent visual guide to data and development. It discusses trends, comparisons, and measurement issues using accessible and shareable data visualizations. As the graphs cite below, they are informative and clean:

data visualization case study in python

1 2

data visualization case study in python

The data draws on the World Development Indicators- the World Bank’s compilation of internationally comparable statistics about global development and the quality of people’s lives. For each of the SDGs, relevant indicators have been chosen to illustrate important ideas. The Atlas features maps and data visualizations, primarily drawn from World Development Indicators (WDI) - the World Bank’s compilation of internationally comparable statistics about global development and the quality of people’s lives.

The editors have been selected to emphasize on essential issues by experts in the World Bank’s Global Practices. The Atlas aims to reflect the breadth of the Goals themselves and presents national and regional trends and snapshots of progress towards the UN’s seventeen Sustainable Development Goals related to: poverty, hunger, health, education, gender, water, energy, jobs, infrastructure, inequalities, cities, consumption, climate, oceans, the environment, peace, institutions, and partnerships.

Contents of this publication: (Group 2018 a ) . The data is available at (Group 2018 b ) . The code used to generate the majority of figures is available at (Whitby 2018 ) .

3.9.13 Is Beauty Important?

This case study is about this article: https://www.infoworld.com/article/3048315/the-inevitability-of-data-visualization-criticism.html

Andy Cotgreave is the current Senior Technical Evangelist at Tableau. In the above article he defends the use of elaborate visualizations and argues that beauty is a quality worth pursuing when making data visualizations. One visualization that he focuses on is a heat map that shows the effect of introducing vaccines on the number of polio cases in the US made by the Wall Street Journal. This particular visualization received a great deal of attention, and was sent around the internet to demonstrate the positive effects of vaccination. After spending some time on the internet, another author named Randy Olson responded with his own article where he remade the heat map as a simple line graph. Both versions are shown below.

data visualization case study in python

In his article, Cotgreave argues that the heat map was visually striking, and its novelty made him more likely to interact with it. As someone involved in visualizations, he seen hundreds, if not thousands of line graphs, and would’ve likely skipped over the line graph version. Cotgreave doubts that the line version would have won awards, or been virally shared as the heat map was. While Cotgreave acknowledges the readability of the line graph, he ultimately feels that there is a place for visualizations to be beautiful.

The takeaway then, is that the visualization you choose to present should be tailored to your situation. In other words, think of your audience. If you were presenting your visualization to the internet at large, then being beautiful and novel is important. If your visualization becomes viral, then it will advance and promote your message to exponentially more people. On the other hand, if you have a more limited audience, like a team of managers, that wants visualizations that can be read quickly, then the line chart will be more suitable.

“Adding up the White Oscar Winners.” 2016. https://www.bloomberg.com/graphics/2016-oscar-winners/ .

Aisch, Gregor, Robert Gebeloff, and Kevin Quealy. 2014. “Where We Came From and Where We Went, State by State.” https://www.nytimes.com/interactive/2014/08/13/upshot/where-people-in-each-state-were-born.html?abt=0002{\&}abg=0 .

Aisch, Gregor, and Karen Yourish. 2015. “Connecting the Dots Behind the 2016 Presidential Candidates.” https://www.nytimes.com/interactive/2015/05/17/us/elections/2016-presidential-campaigns-staff-connections-clinton-bush-cruz-paul-rubio-walker.html?{\_}r=1 .

Aldhous, Peter, and Charles Seife. 2016. “Spies in the Skies.” https://www.buzzfeed.com/peteraldhous/spies-in-the-skies?utm{\_}term=.so1GQ6ZGDo{\#}.ec8kL3WkZe .

Alm, Cecilia Ovesdotter, Benjamin S. Meyers, and Emily Prud’hommeaux. 2017. An Analysis and Visualization Tool for Case Study Learning of Linguistic Concepts . Copenhagen, Denmark. http://www.aclweb.org/anthology/D17-2003 .

“An Aging Population.” n.d. https://fathom.info/aging/ .

Andy, Kriebel. 2009. “VizWiz.” http://www.vizwiz.com/ .

Ashkenas, Jeremy, and Alicia Parlapiano. 2014. “How the Recession Reshaped the Economy, in 255 Charts.” https://www.nytimes.com/interactive/2014/06/05/upshot/how-the-recession-reshaped-the-economy-in-255-charts.html .

Belmonte, Nicolas. 2014. “SOTU2014: See the State of The Union address minute by minute on Twitter.” http://twitter.github.io/interactive/sotu2014/ .

Bostock, Mike, Shan Carter, and Archie Tse. 2014. “Is It Better to Rent or Buy?” https://www.nytimes.com/interactive/2014/upshot/buy-rent-calculator.html?{\_}r=0 .

Bureau, United States Census. 2018b. “From Pyramid to Pillar: A Century of Change, Population of the U.S.” https://www.census.gov/library/visualizations/2018/comm/century-of-change.html .

Chibana, Nayomi. 2016. “15 Data Visualizations That Explain Trump, the White Oscars and Other Crazy Current Events.” http://blog.visme.co/data-visualizations-current-events/ .

contributors, Wikipedia. 2019a. “Geovisualization.” https://en.wikipedia.org/wiki/Geovisualization .

contributors, Wikipedia. 2019b. “Plastic Pollution.” https://en.wikipedia.org/wiki/Plastic_pollution .

Crooks, Ross. 2017. “16 Captivating Data Visualization Examples.” https://blog.hubspot.com/marketing/great-data-visualization-examples .

Galka, Max. 2016. “Here’s Everyone Who’s Immigrated to the U.S. Since 1820.” http://metrocosm.com/animated-immigration-map/ .

Gloag, David. n.d. “Geovisualization: Tools & Techniques.” https://study.com/academy/lesson/geovisualization-tools-techniques.html .

Google. 2014. “Music Timeline.” https://research.google.com/bigpicture/music/ .

Group, The World Bank. 2018a. “Atlas of Sustainable Development Goals 2018 : From World Development Indicators.” https://openknowledge.worldbank.org/handle/10986/29788 .

Group, The World Bank. 2018b. “Atlas of the Sustainable Development Goals 2018: From the World Development Indicators.” https://datacatalog.worldbank.org/dataset/atlas-sustainable-development-goals-2018-world-development-indicators .

Grün, Gianna-Carina. 2016. “Six Data Visualizations That Explain the Plastic Problem.” http://www.dw.com/en/six-data-visualizations-that-explain-the-plastic-problem/a-36861883 .

Holder, Josh, Caelainn Barr, and Niko Kommenda. 2017. “Young voters, class and turnout: how Britain voted in 2017.” https://www.theguardian.com/politics/datablog/ng-interactive/2017/jun/20/young-voters-class-and-turnout-how-britain-voted-in-2017 .

“How to Build an Oscar Winner.” 2015. http://archive-e.blogspot.com/2015/02/how-to-build-oscar-winner-if-hollywood.html .

Insights, Spotify. 2017. “What Emoji Say About Music.” https://public.tableau.com/en-us/s/gallery/what-emoji-say-about-music?gallery=featured .

Institute, Open Data. 2016. “Britain’s Diet in Data.” http://britains-diet.labs.theodi.org/ .

Kayla Darling. 2017. “15 Cool Information Graphics and Data Viz from 2016.” http://blog.visme.co/best-information-graphics-2016/ .

Keating, Joshua, and Chris Kirk. 2015. “A Guide to Who Is Fighting Whom in Syria.” http://www.slate.com/blogs/the_slatest/2015/10/06/syrian_conflict_relationships_explained.html .

Klimczak, Erik. 2016. “Crafting Data-Driven Maps.” https://medium.com/uber-design/crafting-data-driven-maps-b0835b620554 .

Kroulek, Alison. n.d. “Colors in Translaiotn.” https://www.k-international.com/blog/colors-in-translation/ .

Kurtz, Annalyn, and Tal Yellin. 2018. “Millennial generation is bigger, more diverse than boomers.” http://money.cnn.com/interactive/economy/diversity-millennials-boomers/ .

Lee, Muyueh. 2016. “Green Honey.” http://muyueh.com/greenhoney/?es{\_}p=1228877 .

Lewis Lehe, Victor Powell. 2013. “A Visual Explanation of Simpson’s Paradox.” https://flowingdata.com/2013/09/19/a-visual-explanation-of-simpsons-paradox/ .

Lindeman, Todd, and Lazaro Gamio. 2014. “The Wizards’ Shooting Stars.” http://www.washingtonpost.com/wp-srv/special/sports/wizards-shooting-stars/ .

Manovich, Lev, Moritz Stefaner, Mehrdad Yazdani, Dominikus Baur, Daniel Goddemeyer, and Alise Tifentale. 2014. “SelfieCity.” http://selfiecity.net/ .

Michael Hund, Frederik L. Dennig. 2015. HistoBankVis: Detecting Language Change via Data Visualization . Michael, Hund. http://aclweb.org/anthology/W17-0507 .

Moret, Skye. 2014. “Visualization of Ocean Plastic Collection.” https://www.northeastern.edu/visualization/allprojects/visualization-of-ocean-plastic-collection/ .

MORPHOCODE. 2019. “Data and the City: Urban Visualizations.” https://morphocode.com/data-city-urban-visualizations/ .

Nathan Yau. 2015a. “10 Best Data Visualization Projects of 2015.” http://flowingdata.com/2015/12/22/10-best-data-visualization-projects-of-2015/ .

Nathan Yau. 2015b. “A Day in the Life of Americans.” http://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans/ .

Neira, Mateo. 2016. “Data Visualization: A Tool for Social Change.” https://medium.com/@mateoneira/data-visualization-a-tool-for-social-change-cefb02b7ce4a .

Patel, Neil. 2018. “Is Facebook Organic Reach Really Dead?” https://blog.kissmetrics.com/is-facebook-organic-reach-dead/ .

Qualman, Darrin. 2017. “Global Plastic Production, 1917 to 2017.” https://www.darrinqualman.com/global-plastics-production/ .

Quoctrung Bui. 2014. “Who’s In The Office? The American Workday In One Graph.” https://www.npr.org/sections/money/2014/08/27/343415569/whos-in-the-office-the-american-workday-in-one-graph .

Rosling, Hans. 2010. “Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of States- BBC Four.” BBC. https://www.youtube.com/watch?v=jbkSRLYSojo .

Roston, Eric, and Blacki Migliozzi. 2015. “What’s Really Warming the World?” https://www.bloomberg.com/graphics/2015-whats-warming-the-world/ .

ROUTLEY, NICK. 2018. “THE Plastic Problem, Visualized.” http://www.visualcapitalist.com/ocean-plastic-problem/ .

Ruan, Zichan, Yuantian Miao, Lei Pan, Nicholas Patterson, and Jun Zhang. 2017. “Visualization of big data security: a case study on the KDD99 cup data set.” Digital Communications and Networks 3 (4): 250–59. https://doi.org/10.1016/j.dcan.2017.07.004 .

Stadd, Allison. 2015. “15 Data Visualizations That Will Blow Your Mind.” https://blog.udacity.com/2015/01/15-data-visualizations-will-blow-mind.html .

TEAM, WORLD BANK DATA. 2018. “The 2018 Atlas of Sustainable Development Goals: An All-New Visual Guide to Data and Development.” http://blogs.worldbank.org/opendata/2018-atlas-sustainable-development-goals-all-new-visual-guide-data-and-development .

Tomanio, John, and Xaquin Gonzalez Veira. 2014. “Goldilocks Worlds: Just Right for Life?” https://www.nationalgeographic.com/astrobiology/goldilocks-worlds/ .

Tufte, Edward R. 1986. The Visual Display of Quantitative Information . Cheshire, CT, USA: Graphics Press.

United States Census Bureau. 2018. “An Aging Nation: Projected Number of Children and Older Adults.” https://www.census.gov/library/visualizations/2018/comm/historic-first.html .

VERGANO, DAN. 2014. “Kepler Telescope Discovers Most Earth-Like Planet yet.” https://news.nationalgeographic.com/news/2014/04/140417-earth-planet-kepler-habitable-science-nasa/?_ga=2.208654481.2018531223.1556082373-1845695105.1556082373 .

“We Have Poluted Our Way to the Wealth in the Expense of the Environment.” n.d. https://public.tableau.com/shared/6F6TG3KJD?:display_count=yes&:showVizHome=no .

Whitby, Andrew. 2018. “Replication Code for the World Bank Atlas of Sustainable Development Goals 2018.” https://github.com/worldbank/sdgatlas2018 .

World Resources Institute. 2014. “Carbon Emissions: past, present and future - interactive.” https://www.theguardian.com/environment/ng-interactive/2014/dec/01/carbon-emissions-past-present-and-future-interactive .

Yanofsky, David, and Tim Fernholz. 2015. “This is every active satellite orbiting earth.” https://qz.com/296941/interactive-graphic-every-active-satellite-orbiting-earth .

Yau, Nathan. 2016. “How People Like You Spend Their Time.” http://flowingdata.com/2016/12/06/how-people-like-you-spend-their-time .

Zapata, Cristian. 2014. “An Interactive Visualization of NYC Street Trees.” https://www.cloudred.com/labprojects/nyctrees/ .

The Top 5 Python Libraries for Data Visualization

Author's photo

  • data analysis

Which Python library should you pick for data visualization? Read on to see our assessment!

Data visualization is an increasingly valuable skill, one that’s sought after in many organizations. It helps you to find insight into data and to communicate your findings to less technical audiences. You can benefit from it in your career and use it to pivot toward a data-focused role.

There are many paths to learning and practicing data visualization. Many require setting up, maintaining, and using elaborate BI tools with limited capabilities.

Python, the number one language in data science , offers a better way. It is versatile, needs  little maintenance, and can access almost any available data source.

If you want to use Python to get insights from data, check out our Introduction to Python for Data Science course. It offers over 140 exercises to improve your Python skills and practice loading, transforming, and visualizing data.

You might wonder which Python data visualization library you should learn or use for a given project. Python has a vast ecosystem of visualization tools; it can be hard to pick the right one.

The Top 5 Python Libraries for Data Visualization

Python’s visualization landscape in 2018 ( source ).

This article helps you with that. It lays out why data visualization is important and why Python is one of the best visualization tools. It goes on to showcase the top five Python data visualization libraries, their main features, and when it is a good idea to use them.

Why Data Visualization Is Important

Data visualization is a powerful way to gain and communicate insights from data. Especially in enterprise automation where streamlining processes and optimizing workflows rely on data-driven decision-making.

The main challenge of data analysis is understanding relationships within a dataset and their relevance to the use case. A good visualization often reveals insights faster than hours of data munging (aka data wrangling) and is more intuitive for non-technical audiences. For these reasons, data visualization is a central activity in any organization that wants to make complex data-based decisions .

The Top 5 Python Libraries for Data Visualization

An example of a sales data visualization ( source )

There are many situations where you can benefit from visualizing data – like doing a sales presentation , conducting market research, or setting up a KPI dashboard . You can also use different tools for that. However, some of the tools require too much overhead to set up or are limited in their capabilities.

What if there was a tool that would be versatile enough to use with a wide range of problems, data sources, and use cases? And had little infrastructural requirements?

Fortunately, there is such a tool: Python!

Why Python is a Great Language for Data Visualization

Python is currently one of the most popular programming languages and the primary one when it comes to data science , making it a safe learning choice.

Python is excellent to learn for your career and is a great language to introduce to your organization . It is easy to learn, helps with automation, and provides access to data and analytics. Many big companies use Python to run critical operations within their business.

Python has a thriving data science ecosystem , including data visualization libraries that surpass Excel’s capabilities . This makes Python especially useful in domains where you need to complement your work with analytics, like marketing or sales.

However, Python’s popularity and rich ecosystem might be intimidating for newcomers, as it is hard to understand which visualization library to use for which use case. To help you with that, the rest of our article will give you an overview of the top five Python visualization libraries.

Data Visualization Libraries

We can characterize data visualization libraries using the following factors:

  • Interactivity : Whether the library offers interactive elements.
  • Syntax : What level of control the library offers, and whether it follows a specific paradigm.
  • Main Strength and Use Case : In what situation is the library the best choice?

The following table summarizes the top Python visualization libraries according to these factors:

LibraryInteractive FeaturesSyntaxMain Strength and Use Case
MatplotlibLimitedLow-levelHighly customized plots
seabornLimited (via Matplotlib)High-levelFast, presentable reports
BokehYesHigh- and low-level, influenced by grammar of graphicsInteractive visualization of big data sets
AltairYesHigh level, declarative, follows grammar of graphicsData exploration, and interactive reports
PlotlyYesHigh- and low-levelCommercial applications and dashboards

Let’s discuss each library individually.

Matplotlib is the most widely used visualization library. It was born in 2003 as an open-source replacement of MATLAB, a scientific graphing package.

Because of its early start and popularity, there is a huge community around Matplotlib. You can easily find tutorials and forums discussing it, and many toolkits extend its use (e.g. into geographic data or 3D use cases). Also, many Python libraries (e.g. pandas) rely on it in their visualization features.

Matplotlib provides granular control of plots, making it a versatile package with a wide range of graph types and configuration options. However, its many configuration possibilities complicate its use and can lead to boilerplate code. The default Matplotlib theme does not follow visualization best practices. You also need to rely on other packages (e.g. time data handling) for some fundamental features.

Matplotlib is a good choice in the following cases:

  • You need detailed control over your plots (e.g. in a research setting with unique visualization problems).
  • You want something reliable, with a huge community.
  • You don’t mind the learning curve.

The Top 5 Python Libraries for Data Visualization

Matplotlib plots ( source )

seaborn is a visualization library that makes Matplotlib plots practical. It abstracts away Matplotlib’s complexity and offers an intuitive syntax and presentable results right out of the box.

The seaborn library supports the creation of statistical graphs. It interfaces well with pandas dataframes, provides data mapping onto visualizations, and can transform the data as part of plot creation.

It also has a meaningful default theme, and it offers different color palettes defined around best practices.

Because seaborn is a wrapper around Matplotlib, you can configure your plots by accessing the underlying Matplotlib objects.

seaborn is a good choice if:

  • You value speed.
  • You do not need interactivity.
  • You don’t need low-level configuration.

The Top 5 Python Libraries for Data Visualization

Heatmap created with seaborn ( source )

Bokeh is a visualization library influenced by the grammar of graphics paradigm developed for web-based visualizations of big datasets.

It provides a structured way to create plots and support server-side rendering of interactive visualizations in web applications. It has both a high-level and a low-level interface that you can use depending on your actual need, time, and skill.

Use Bokeh when:

  • You need to create interactive visualizations in a web application (e.g. a dashboard).
  • You like the grammar of graphics approach but do not find Altair intuitive.
  • You have high-level and low-level use cases (e.g. data science and production).

The Top 5 Python Libraries for Data Visualization

An interactive Bokeh plot ( source )

Altair is a visualization library that provides a unique declarative syntax for interactive plot creation. It relies on the Vega-Lite grammar specification, allowing you to compose charts from graphical units and combine them in a modular way.

Altair’s declarative approach allows focusing on the intended visualization outcome and leaving the data transformations to the library. This feature is especially useful for data exploration, when you try to combine different ways to examine and visualize a problem.

Altair is especially useful in the following cases:

  • You are doing lots of data exploration and experimentation and want to share the results in an interactive format.
  • You don’t need low-level customization.
  • You like the grammar of graphics approach and prefer Altair’s syntax.

The Top 5 Python Libraries for Data Visualization

An Altair visualization with interactive linked brush filtering ( source )

Plotly is an open-source data visualization library and part of the ecosystem developed by Plotly, Inc. The company also develops Dash, a Python dashboarding library, and offers data visualization application services for enterprise clients. For this reason, Plotly is a great tool for building business-focused interactive visualizations and dashboards.

Plotly offers a high-level interface for fast development and a low-level one for more control. It also renders plots from simple dictionaries and has a wide range of predefined graph types.

Plotly is beneficial when:

  • You are building commercial products and dashboards with complex relationships and data pipelines.
  • You need a wide range of interactive graphs used in business and research.

The Top 5 Python Libraries for Data Visualization

An interactive Plotly report ( source )

Learn to Visualize Your Data with Python!

This article showed you the usefulness of Python in data visualization. It gave you an overview of the top five Python data visualization libraries. We hope this has helped you pick the right library for your project.

Regardless of your choice, you need to know your way around Python to be able to use these libraries. One of the best ways to do this is to learn and practice Python in a course that’s built around practical projects. We created our Python learning tracks, Python Basics and Python for Data Science, especially with these aspects in mind. Feel free to check them out!

You may also like

data visualization case study in python

How Do You Write a SELECT Statement in SQL?

data visualization case study in python

What Is a Foreign Key in SQL?

data visualization case study in python

Enumerate and Explain All the Basic Elements of an SQL Query

Data Visualizations with Python

Want to know more about the course.

Curious about this course?

Contact us to find out if it’s right for you

Career Advisor

“How would you like to get in touch?”

“I’m here to help you become a Data Visualizations expert!”

Alana, Senior Program Advisor

Data Visualizations with Python Course details

In this course, you will.

Learn the fundamental principles of data visualization, and apply them to a real-world project.

Use Python to conduct advanced analyses using data visualization algorithms.

Gain proficiency using different visualization libraries in Python.

Work 1:1 with an expert mentor, who'll provide you with individualized support, advice, and feedback.

Join an active community of over 5000 graduates and 700 instructors, and get access to exclusive events and webinars.

data visualization case study in python

Fully online

Study for an average of 15–20 hours per week for 2 months

data visualization case study in python

Personalized mentorship

Our course mentors are rated 4.96/5

data visualization case study in python

Outcome-oriented

Finish with a certificate of completion and complete portfolio project

Why learn data visualizations?

Deepen your understanding of Python

By learning Python-based visualizations, you’ll not only improve your Python library proficiency, you’ll also pave the way for transitioning into data sciences.

Boost your career with data visualization skills

Supercharge your resume—whether you want to work on a side project, build your own business, or simply contribute a broader skillset to your company, learning data visualization is a surefire way to maximize the value you provide.

Data visualization experts are in high demand

Employers highly value analysts who possess the ability to present data in a user-friendly and comprehensible manner, enabling effective communication across various sectors within an organization.

Why choose a CareerFoundry course?

Work with your very own course mentor

You'll enjoy a truly collaborative online learning experience, with tailored written and video feedback on everything you do from an expert who works in your new field day in, day out.

Get the perfect balance of theory and practice

With a curriculum designed in-house in collaboration with leading data visualization experts, the course will help you get to grips with various Python libraries, and presenting your insights using an interactive dashboard.

Finish with a job-ready portfolio

Guided by the expert advice of your mentor, you’ll finish the course with a portfolio, complete with a professional case study that showcases your ability to think like a data visualization expert.

Data visualization transforms data into powerful visualizations and provides tactical, operational, and strategic insights.

data visualization case study in python

Meet your new team

At CareerFoundry, you’re never alone! From the moment you start the course, you’ll be assigned a personal mentor. This seasoned and influential expert will act as your teacher, coach, and confidant through every step of the course—providing individualized support, advice, and feedback.

data visualization case study in python

Your mentor

Your mentor will provide detailed video reviews of each project you complete during the course.

Our mentors haven’t just made a name for themselves at top companies in the industry—but have helped shape it.

data visualization case study in python

A project-based curriculum that gets you thinking like a data visualization expert.

Learn the skills you need to stand out as a data analyst with data visualization skills..

Created by experienced instructional designers, authored by industry experts, and kept up-to-date by course editors, our curriculum will serve as the foundation of your learning experience.

Achievement 1: Network Visualizations and Natural Language Processing with Python

1.1 Intro to Freelance and Python Tools Analyze tools for writing and executing Python code and evaluate the advantages and disadvantages of different tools by summarizing their efficacy in various data analytics projects. 1.2 Setting up the Python Workspace Apply knowledge of GitHub commands by cloning a repository and performing push and pull requests, create an SSH key to access a repository from a local machine, and execute code to push new requests in a GitHub repository. 1.3 Virtual Environments in Python Demonstrate an understanding of anaconda by creating and activating a virtual environment and execute apply() and lambda() functions to create a flag column in Anaconda. 1.4 Accessing Web Data with Data Scraping Explain the legal and ethical intricacies of data scraping by checking the terms of use. Organize an environment to perform a data scrape by implementing Python libraries into a virtual environment and execute a data scrape on a website to collect web data. 1.5 Text Mining Apply text mining techniques to a data set, evaluate various text mining techniques to perform a particular analysis, and create bar plots to showcase text mining data. 1.6 Intro to NLP and Network Analysis Evaluate data format by determining if a data set needs wrangling, apply a Natural Language Processing algorithm to text, and apply output from an NLP algorithm to create a data frame for network analysis. 1.7 Creating Network Visualizations Create a network graph using Python code, analyze the quality of a network graph by explaining how it can be improved, and apply improvements to a network graph to create a final iteration.

Achievement 2: Dashboards with Python

2.1 Tools for Creating Dashboards Analyze different visualization tools by comparing their usability in different scenarios, analyze the different Python libraries available for dashboards by conducting research, and evaluate a Python dashboard. 2.2 Project Planning and Sourcing Web Data with an API ​​​​​ Analyze a task by creating a list of questions to solve a problem, and develop a plan to create a dashboard. Apply an API to a web page to gather data and apply Python code to structured and unstructured data sets to clean and merge them. 2.3 Fundamentals of Visualization Libraries 1 Differentiate between procedural and object-oriented programming, create bar and line charts using Matplotlib. Define components in visualization charts by applying customization features and analyze visualizations to interpret data. 2.4 Fundamentals of Visualization Libraries 2 Create bar and line charts using Seaborn, analyze a categorical variable using a visualization, and explain the benefits of utilizing a FacetGrid for analysis. 2.5 Advanced Geospatial Plotting Apply functions in Python to aggregate data, create a geospatial plot using Kepler.gl, and analyze a map by applying filtering to recognize patterns in data. 2.6 Creating a Python Dashboard Create charts in Plotly, execute a Python script to run a program, apply code to initialize a Python dashboard in Streamlit, and create a dashboard integrating multiple Python libraries. 2.7 Refining and Presenting a Dashboard Apply fundamental design principles to a dashboard, create a multipage dashboard using Streamlit, deploy a dashboard using Streamlit, and analyze data by interpreting visualizations and deliver an analysis through a presentation.

  • Learn through our comprehensive, project-based curriculum
  • Receive regular, personalized feedback from your course tutor
  • Deliver two data visualization projects, which will form the basis of your professional portfolio
  • Get an in-depth review of your portfolio project from your mentor on a video call
  • Gain exclusive access to our global community—plus events and webinars

This course is for those who would like to learn how to use Python to conduct advanced analyses using data visualization algorithms and models. The Data Visualizations with Python Course is available as a specialization course for our Data Analytics Program, or it can be taken as a standalone course.

To successfully complete the course, you’ll need to have experience with the programming language Python. You’ll also need to have experience with tools and software such as Excel and Jupyter Notebook, and experience working with statistical algorithms. You should be comfortable using many of the common libraries for working with data, for instance, Pandas, NumPy, and Seaborn. It’s also necessary to have practiced executing basic Python commands in Jupyter Notebook or a similar tool, such as Google Colab, nteract, VS Code, or JupyterLite. Additionally, you’ll need:*

  • Interest in machine learning
  • Written and spoken English proficiency at a B2 level or higher
  • A computer (macOS, Windows, or Linux) with a webcam, microphone, and an internet connection.

*Note: You will be required to invest some independent study time (approximately 1-2 hours per week) towards familiarizing yourself with the tools you’ll use throughout the course, and learning how to use them.

You’ll be using the tools; Jupyter Lab, GitHub, Anaconda, and Python. You'll also use a Word Processing tool and a Presentation tool of your choice. All the tools and software you’ll need are free to use—with no additional cost to you.

Compatible operating systems: Windows 11, macOS versions 10.13 and later, Ubuntu, Debian, CentOS, or Fedora (Linux). Questions? Contact us for more information on requirements for your specific operating system.

Yes, the course is entirely asynchronous and online—so you can study when and wherever you’d like so long as you can get online and complete the course on time. But this doesn’t mean the learning experience is isolated or lonely! You’ll have your mentor, tutor, and student advisor there to support you—as well as access to our active student community on Slack.

We take a rigorously practical approach to learning. You’ll have the opportunity to apply everything you learn in practical ways. Every exercise builds up to a completed portfolio project that your mentor will review and that will show employers the in-demand skills you learn in the course.

If you set aside 15-20 hours per week to study, you’ll complete the course in approximately two months (eight weeks). If you’re able to devote 30-40 hours per week, you can complete the course in about a month (four weeks).

This course offers immersive training in the field of data visualization—including expert-authored curriculum, hands-on projects, and personalized mentorship and support from experts in the field. Everything you need to stand out in the field as the specialist you’ll be. Find out more here:

  • How it works : From the curriculum to your support team, and beyond—here are the details.
  • Meet our mentors : Get to know who the CareerFoundry mentors are and how the dual-mentorship model works.
  • Graduate outcomes : Here’s some of the work our graduates did in the full program—and where they’re at today.

Yes, we offer two payment options for your specialization course. First, you can save a little money by paying your full tuition up front. If that’s not feasible, you can pay a set amount up front (varies depending on currency), and then the remainder in three monthly payments.

data visualization case study in python

Want to learn more about data visualizations before you commit to a course?

We've got you covered! Take a look at our free tutorials, articles, videos, webinars, and social media.

What Is Data Visualization And Why Is It Important?

What Is Data Visualization And Why Is It Important?

13 Of The Most Common Types Of Data Visualization

13 Of The Most Common Types Of Data Visualization

The Top Eight Free Data Visualization Tools For 2023

The Top Eight Free Data Visualization Tools For 2023

CareerFoundry Instagram

Follow us on Instagram for regular tips and insights.

CareerFoundry YouTube

Subscribe to CareerFoundry on YouTube for video tutorials, webinars and guides.

How would you like us to contact you?

Book a time to speak with a program advisor

Send us a message

data visualization case study in python

What questions do you have about the program? We're happy to help.

Our program advisor will be in touch with you shortly.

data visualization case study in python

Google Playstore Case Study

In this module you’ll be learning data visualisation with the help of a case study. This will enable you to understand how visualisation aids you in solving business problems.

Problem Statement

The team at Google Play Store wants to develop a feature that would enable them to boost visibility for the most promising apps. Now, this analysis would require a preliminary understanding of the features that define a well-performing app. You can ask questions like:

  • Does a higher size or price necessarily mean that an app would perform better than the other apps?
  • Or does a higher number of installs give a clear picture of which app would have a better rating than others?

Session 1 - Introduction to Data Visualisation

  • Foundations

thumbnail

Advanced Visualization Cookbook

This Project Pythia Cookbook covers advanced visualization techniques building upon and combining various Python packages.

The possibilities of data visualization in Python are almost endless. Already using matplotlib the workhorse behind many visualization packages, the user has a lot of customization options available to them. cartopy , metpy , seaborn , geocat-viz , and datashader are all also great packages that can offer unique additions to your Python visualization toolbox.

This Cookbook will house various visualization workflow examples that use different visualization packages, highlight the differences in functionality between the packages, any noteable syntax distinctions, and demonstrate combining tools to achieve a specific outcome.

Julia Kent , Anissa Zacharias , Orhan Eroglu , Philip Chmielowiec , John Clyne

Contributors

data visualization case study in python

This cookbook is broken up into a few sections - a “Basics of Geoscience Visualization” intro that compares different visualization packages and plot elements, and then example workflows of advanced visualization applications that are further subdivided.

Basics of Geoscience Visualization

Here we introduce the basics of geoscience visualization, the elements of a plot, different types of plots, and some unique considerations when dealing with model and measured data. Here we also share a comparison of different visualization packages available to the Scientific Python programmer.

Specialty Plots

There are some plot types that are unique to atmospheric science such as Taylor Diagrams or Skew-T plots. Here we will use metpy and geocat-viz to demonstrate these specialty plots.

Visualization of Structured Grids

In this section we will demonstrate how to visualize data that is on a structured grid. Namely, we will look at Spaghetti Hurricane plots. Here we will have workflows that utilize packages such as cartopy and geocat-viz .

Animated plots are great tools for science communication and outreach. We will demonstrate how to make your plots come to life. In this book, we use “animated plots” to refer to stable animations, such as the creation of gifs or videos.

Interactivity

Dynamically rendering, animating, panning & zooming over a plot can be great to increase data fidelity. We will showcase how to use Holoviz technologies with Bokeh backend to create interactive plots, utilizing an unstructured grid data in the Model for Prediction Across Scales (MPAS) format.

Running the Notebooks

You can either run the notebook using Binder or on your local machine.

Running on Binder

The simplest way to interact with a Jupyter Notebook is through Binder , which enables the execution of a Jupyter Book in the cloud. The details of how this works are not important for now. All you need to know is how to launch a Pythia Cookbooks chapter via Binder. Simply navigate your mouse to the top right corner of the book chapter you are viewing and click on the rocket ship icon, (see figure below), and be sure to select “launch Binder”. After a moment you should be presented with a notebook that you can interact with. I.e. you’ll be able to execute and even change the example programs. You’ll see that the code cells have no output at first, until you execute them by pressing Shift + Enter . Complete details on how to interact with a live Jupyter notebook are described in Getting Started with Jupyter .

Running on Your Own Machine

If you are interested in running this material locally on your com

Clone the https://github.com/ProjectPythia/advanced-viz-cookbook repository:

Move into the advanced-viz-cookbook directory

Create and activate your conda environment from the environment.yml file

Move into the notebooks directory and start up Jupyterlab

  • Suggest edit

Data Visualization

Transform raw data into actionable insights with interactive visualizations, dashboards, and data apps.

python viz grid

Python Data Visualization

Izzy Miller

Leverage Hex to transition from SQL data warehouse insights to dynamic visual storytelling using Python's top visualization libraries.

mapping libraries grid image

Python mapping libraries

Access Python's powerful mapping ecosystem right alongside SQL and native geospatial tools

Geospatial grid image

Geospatial Data Analysis

Use powerful GIS techniques alongside SQL and built-in mapping tools

Don't see what you need?

We're always expanding our collection of examples and templates. Let us know what you're working on, and we'll whip up an example just for you.

A quick guide to Data Visualization

Data visualization is the representation of information in a graphical or pictorial format. It allows us to understand patterns, trends, and correlations in data, making complex data more accessible, understandable, and usable. It is an essential part of data analysis and business intelligence. By conveying information in a universally accessible way, data visualization helps to share ideas convincingly and to make informed decisions based on data.

Python is the leading tool in data visualization due to its simplicity, versatility, and the powerful libraries it provides for this purpose. Python's Matplotlib, Seaborn, and Plotly, among other libraries, offer a wide array of options for creating static, animated, and interactive plots, making Python a one-stop-shop for all data visualization needs.

Python for Data Visualization

Python basics like data types, variables, lists, and control structures help handle data effectively for preprocessing and visualization. Libraries such as NumPy and Pandas provide tools for data manipulation, while Matplotlib and Seaborn are essential for data visualization.

Pandas provides two key data structures: DataFrames and Series. These structures are highly flexible and powerful, allowing manipulation of heterogeneously-typed data and integration with many other Python libraries, making them the de-facto structures for data manipulation in Python.

NumPy provides a object for multi-dimensional array manipulation known as the ndarray. This structure allows for efficient operations on large datasets and supports a wide range of mathematical operations, such as vectorized operations.

Once you have your data in these data structures, you can then start to use Python’s plotting libraries to create visualizations:

Matplotlib is the foundation of data visualization in Python, providing a flexible and comprehensive platform for creating static, animated, and interactive visualizations in Python. Its versatility makes it a valuable tool for any data scientist or analyst.

Seaborn simplifies the creation of more complex visualizations, providing a high-level interface for attractive statistical graphics. It is particularly useful when working with DataFrames, offering a more sophisticated approach to visualizing data distributions.

Plotly stands out for its ability to produce interactive and browser-based plots. With its wide array of chart types, Plotly allows users to create complex visuals with ease, adding a layer of engagement and interactivity to data presentations.

Creating Basic Visualizations

Line Plots : Line plots are excellent for showcasing trends over time. They are created by connecting data points in the order they appear in the dataset and are especially useful when working with time-series data. You can create this with Matplotlib’s .plot() method.

Scatter Plots : Scatter plots are used to display the relationship between two numerical variables. By visualizing the data distribution, scatter plots can give a quick overview of correlations, trends, and outliers. You can create this with Matplotlib’s .scatter() method.

Bar Charts : Bar charts represent categorical data with rectangular bars. Each bar's height or length corresponds to the quantity of the data it represents. Bar charts are effective at comparing quantities across different categories. You can create this with Matplotlib’s .bar() method.

Box Plots : Box plots provide a summary of the statistical properties of data, including the median, quartiles, and potential outliers. This makes them a powerful tool for understanding data distribution and variability. You can create this with Matplotlib’s .boxplot() method.

You can then create more advanced visualizations, such as multi-dimensional data visualization to observe complex patterns across multiple variables. Techniques such as parallel coordinate plots, scatterplot matrices, and heatmaps help explore these relationships. You can also create interactive visualizations to allow users to engage with the data more effectively. They can zoom, pan, and hover over the data for more detailed information, leading to better understanding and insight.

On this page, you'll find interactive examples demonstrating interactive data visualization. If you see something interesting, click "Get a copy" to dive deeper and adapt any example to your requirements.

See what else Hex can do

Discover how other data scientists and analysts use Hex for everything from dashboards to deep dives.

new-chart-cell-hero

Introducing: an all-new, interactive visualization experience for Hex

Point-and-click visual filtering, an all-new chart cell, custom color palettes, and more

table-display-hero

Building Better Display Tables

A more beautiful and useful way to visualize dataframes.

notion-hex-hero

A new way to embed Hex apps in Notion

Announcing a new, richer way to embed data from Hex into Notion docs

use-case-category-grid-image-reporting

Learn efficient reporting techniques with practical examples to bring your business data to life. Build interactive reports, beautiful dashboards, and rich data stories with Hex.

Data visualization software is important in business as it enables decision makers to see analytics presented visually, helping them understand complex data, spot patterns, trends, and outliers, and make strategic decisions accordingly.

The best data visualization tool for SQL is often Hex, as it allows direct connection to SQL databases, and you can use SQL queries within the tool to manipulate data before visualization.

The best tool for web-based data visualization include D3.js for customizable, interactive visuals. Hex apps can also be embedded on web pages for data visualization

R is excellent for data visualization, especially with the ggplot2 package, which provides a powerful and flexible system for creating a wide variety of visualizations with a high level of customization.

The best data visualization tool for data analysis are Python-based tools with its libraries (Matplotlib, Seaborn, Plotly) for its deep integration with data manipulation and analysis libraries.

The best free data visualization tools include Matplotlib and Seaborn for Python, ggplot2 for R, and open-source software like Tableau Public and Google Data Studio.

To create data visualizations in Python using Matplotlib, you first need to import the library using "import matplotlib.pyplot as plt", and then use Matplotlib's functions like 'plt.plot()', 'plt.scatter()', or 'plt.bar()' to create line, scatter, and bar plots respectively.

Can't find your answer here? Get in touch .

data visualization case study in python

Google Playstore Case Study

In this module you’ll be learning data visualisation with the help of a case study. This will enable you to understand how visualisation aids you in solving business problems.

Problem Statement

The team at Google Play Store wants to develop a feature that would enable them to boost visibility for the most promising apps. Now, this analysis would require a preliminary understanding of the features that define a well-performing app. You can ask questions like:

  • Does a higher size or price necessarily mean that an app would perform better than the other apps?
  • Or does a higher number of installs give a clear picture of which app would have a better rating than others?

Session 1 - Introduction to Data Visualisation

Using Python for Data Analysis

Using Python for Data Analysis

Table of Contents

Understanding the Need for a Data Analysis Workflow

Setting your objectives, reading data from csv files, reading data from other sources, creating meaningful column names, dealing with missing data, handling financial columns, correcting invalid data types, fixing inconsistencies in data, correcting spelling errors, checking for invalid outliers, removing duplicate data, storing your cleansed data, performing a regression analysis, investigating a statistical distribution, finding no relationship, communicating your findings, resolving an anomaly.

Data analysis is a broad term that covers a wide range of techniques that enable you to reveal any insights and relationships that may exist within raw data. As you might expect, Python lends itself readily to data analysis. Once Python has analyzed your data, you can then use your findings to make good business decisions, improve procedures, and even make informed predictions based on what you’ve discovered.

In this tutorial, you’ll:

  • Understand the need for a sound data analysis workflow
  • Understand the different stages of a data analysis workflow
  • Learn how you can use Python for data analysis

Before you start, you should familiarize yourself with Jupyter Notebook , a popular tool for data analysis. Alternatively, JupyterLab will give you an enhanced notebook experience. You might also like to learn how a pandas DataFrame stores its data. Knowing the difference between a DataFrame and a pandas Series will also prove useful.

Get Your Code: Click here to download the free data files and sample code for your mission into data analysis with Python.

In this tutorial, you’ll use a file named james_bond_data.csv . This is a doctored version of the free James Bond Movie Dataset . The james_bond_data.csv file contains a subset of the original data with some of the records altered to make them suitable for this tutorial. You’ll find it in the downloadable materials. Once you have your data file, you’re ready to begin your first mission into data analysis.

Data analysis is a very popular field and can involve performing many different tasks of varying complexity. Which specific analysis steps you perform will depend on which dataset you’re analyzing and what information you hope to glean. To overcome these scope and complexity issues, you need to take a strategic approach when performing your analysis. This is where a data analysis workflow can help you.

A data analysis workflow is a process that provides a set of steps for your analysis team to follow when analyzing data. The implementation of each of these steps will vary depending on the nature of your analysis, but following an agreed-upon workflow allows everyone involved to know what needs to happen and to see how the project is progressing.

Using a workflow also helps futureproof your analysis methodology. By following the defined set of steps, your efforts become systematic, which minimizes the possibility that you’ll make mistakes or miss something. Furthermore, when you carefully document your work, you can reapply your procedures against future data as it becomes available. Data analysis workflows therefore also provide repeatability and scalability.

There’s no single data workflow process that suits every analysis, nor is there universal terminology for the procedures used within it. To provide a structure for the rest of this tutorial, the diagram below illustrates the stages that you’ll commonly find in most workflows:

diagram of a data analysis workflow with iterations

The solid arrows show the standard data analysis workflow that you’ll work through to learn what happens at each stage. The dashed arrows indicate where you may need to carry out some of the individual steps several times depending upon the success of your analysis. Indeed, you may even have to repeat the entire process should your first analysis reveal something interesting that demands further attention.

Now that you have an understanding of the need for a data analysis workflow, you’ll work through its steps and perform an analysis of movie data. The movies that you’ll analyze all relate to the British secret agent Bond … James Bond.

The very first workflow step in data analysis is to carefully but clearly define your objectives. It’s vitally important for you and your analysis team to be clear on what exactly you’re all trying to achieve. This step doesn’t involve any programming but is every bit as important because, without an understanding of where you want to go, you’re unlikely to ever get there.

The objectives of your data analysis will vary depending on what you’re analyzing. Your team leader may want to know why a new product hasn’t sold, or perhaps your government wants information about a clinical test of a new medical drug. You may even be asked to make investment recommendations based on the past results of a particular financial instrument. Regardless, you must still be clear on your objectives. These define your scope.

In this tutorial, you’ll gain experience in data analysis by having some fun with the James Bond movie dataset mentioned earlier. What are your objectives? Now pay attention, 007 :

  • Is there any relationship between the Rotten Tomatoes ratings and those from IMDb?
  • Are there any insights to be gleaned from analyzing the lengths of the movies?
  • Is there a relationship between the number of enemies James Bond has killed and the user ratings of the movie in which they were killed?

Now that you’ve been briefed on your mission, it’s time to get out into the field and see what intelligence you can uncover.

Acquiring Your Data

Once you’ve established your objectives, your next step is to think about what data you’ll need to achieve them. Hopefully, this data will be readily available, but you may have to work hard to get it. You may need to extract it from the data storage systems within an organization or collect survey data . Regardless, you’ll somehow need to get the data.

In this case, you’re in luck. When your bosses briefed you on your objectives, they also gave you the data in the james_bond_data.csv file. You must now spend some time becoming familiar with what you have in front of you. During the briefing, you made some notes on the content of this file:

Heading Meaning
The release date of the movie
The title of the movie
The actor playing the title role
The manufacturer of James Bond’s car
The movie’s gross US earnings
The movie’s gross worldwide earnings
The movie’s budget, in thousands of US dollars
The running time of the movie
The average user rating from IMDb
The average user rating from Rotten Tomatoes
The number of martinis that Bond drank in the movie

As you can see, you have quite a variety of data. You won’t need all of it to meet your objectives, but you can think more about this later. For now, you’ll concentrate on getting the data out of the file and into Python for cleansing and analysis.

Remember, also, it’s considered best practice to retain the original file in case you need it in the future. So you decide to create a second data file with a cleansed version of the data. This will also simplify any future analysis that may arise as a consequence of your mission.

You can obtain your data in a variety of file formats. One of the most common is the comma-separated values (CSV) file. This is a text file that separates each piece of data with commas. The first row is usually a header row that defines the file’s content, with the subsequent rows containing the actual data. CSV files have been in use for several years and remain popular because several data storage programs use them.

Because james_bond_data.csv is a text file, you can open it in any text editor. The screenshot below shows it opened in Notepad :

image of a raw csv file

As you can see, a CSV file isn’t a pleasant read. Fortunately, you rarely need to read them in their raw form.

When you need to analyze data, Python’s pandas library is a popular option. To install pandas in a Jupyter Notebook, add a new code cell and type !python -m pip install pandas . When you run the cell, you’ll install the library. If you’re working in the command line, then you use the same command, only without the exclamation point ( ! ).

With pandas installed, you can now use it to read your data file into a pandas DataFrame. The code below will do this for you:

Firstly, you import the pandas library into your program. It’s standard practice to alias pandas as pd for code to use as a reference. Next, you use the read_csv() function to read your data file into a DataFrame named james_bond_data . This will not only read your file but also take care of sorting out the headings from the data and indexing each record.

While using pd.read_csv() alone will work, you can also use .convert_dtypes() . This good practice allows pandas to optimize the data types that it uses in the DataFrame.

Suppose your CSV file contained a column of integers with missing values. By default, these will be assigned the numpy.NaN floating-point constant. This forces pandas to assign the column a float64 data type. Any integers in the column are then cast as floats for consistency.

These floating-point values could cause other undesirable floats to appear in the results of subsequent calculations. Similarly, if the original numbers were, for example, ages, then having them cast into floats probably wouldn’t be what you want.

Your use of .convert_dtypes() means that columns will be assigned one of the extension data types . Any integer columns, which were of type int , will now become the new Int64 type. This occurs because pandas.NA represents the original missing values and can be read as an Int64 . Similarly, text columns become string types, rather than the more generic object . Incidentally, floats become the new Float64 extension type, with a capital F.

After creating the DataFrame, you then decide to take a quick look at it to make sure the read has worked as you expected it to. A quick way to do this is to use .head() . This function will display the first five records for you by default, but you can customize .head() to display any number you like by passing an integer to it. Here, you decide to view the default five records:

You now have a pandas DataFrame containing the records along with their headings and a numerical index on the left-hand side. If you’re using a Jupyter Notebook, then the output will look like this:

dataframe showing initial view of data

As you can see, the Jupyter Notebook output is even more readable. However, both are much better than the CSV file that you started with.

Although CSV is a popular data file format, it isn’t particularly good. Lack of format standardization means that some CSV files contain multiple header and footer rows, while others contain neither. Also, the lack of a defined date format and the use of different separator and delimiter characters within and between data can cause issues when you read it.

Fortunately, pandas allows you to read many other formats, like JSON and Excel . It also provides web-scraping capabilities to allow you to read tables from websites. One particularly interesting and relatively new format is the column-oriented Apache Parquet file format used for handling bulk data. Parquet files are also cost-effective when working with cloud storage systems because of their compression ability.

Although having the ability to read basic CSV files is sufficient for this analysis, the downloads section provides some alternative file formats containing the same data as james_bond_data.csv . Each file is named james_bond_data , with a file-specific extension. Why not see if you can figure out how to read each of them into a DataFrame in the same way as you did with your CSV file?

If you want an additional challenge, then try scraping the Books, by publication sequence , table from Wikipedia. If you succeed, then you’ll have gained some valuable knowledge, and M will be very pleased with you .

For solutions to these challenges, expand the following collapsible sections:

How to Read a JSON File Show/Hide

To read in a JSON file, you use pd.read_json() :

As you can see, you only need to specify the JSON file that you want to read. You can also specify some interesting formatting and data conversion options if you need to. The docs page will tell you more.

How to Read an Excel File Show/Hide

Before this will work, you must install the openpyxl library. You use the command !python -m pip install openpyxl from within your Jupyter Notebook or python -m pip install openpyxl at the terminal. To read your Excel file, you then use .read_excel() :

As before, you only need to specify the filename. In cases where you’re reading from one of several worksheets, you must also specify the worksheet name by using the sheet_name argument. The docs page will tell you more.

How to Read a Parquet File Show/Hide

Before this will work, you must install a serialization engine such as pyarrow . To do this, you use the command !python -m pip install pyarrow from within your Jupyter Notebook or python -m pip install pyarrow at the terminal. To read your parquet file, you then use .read_parquet() :

As before, you only need to specify the filename. The docs page will tell you more, including how to use alternative serialization engines.

How to Web Scrape an HTML Table Show/Hide

Before this will work, you must install the lxml library to allow you to read HTML files. To do this, you use the command !python -m pip install lxml from within your Jupyter Notebook or python -m pip install lxml at the terminal. To read, or scrape , an HTML table, you use read_html() :

This time, you pass the URL of the website that you wish to scrape. The read_html() function will return a list of the tables on the web page. The one that interests you in this example is at list index 1 , but finding the one you want may require a certain amount of trial and error. The docs page will tell you more.

Now that you have your data, you might think it’s time to dive deep into it and start your analysis. While this is tempting, you can’t do it just yet. This is because your data might not yet be analyzable. In the next step, you’ll fix this.

Cleansing Your Data With Python

The data cleansing stage of the data analysis workflow is often the stage that takes the longest, particularly when there’s a large volume of data to be analyzed. It’s at this stage that you must check over your data to make sure that it’s free from poorly formatted, incorrect, duplicated, or incomplete data. Unless you have quality data to analyze, your Python data analysis code is highly unlikely to return quality results.

While you must check and re-check your data to resolve as many problems as possible before the analysis, you must also accept that additional problems could appear during your analysis. That is why there’s a possible iteration between the data cleansing and analysis stages in the diagram that you saw earlier .

The traditional way to cleanse data is by applying pandas methods separately until the data has been cleansed. While this works, it means that you create a set of intermediate DataFrame versions, each with a separate fix applied. However, this creates reproducibility problems with future cleansings because you must reapply each fix in strict order.

A better approach is for you to cleanse data by repeatedly updating the same DataFrame in memory using a single piece of code. When writing data cleansing code, you should build it up in increments and test it after writing each increment. Then, once you’ve written enough to cleanse your data fully, you’ll have a highly reusable script for cleansing of any future data that you may need to analyze. This is the approach that you’ll adopt here.

When you extract data from some systems, the column names may not be as meaningful as you’d like. It’s good practice to make sure the columns in your DataFrame are sensibly named. To keep them readable within code, you should adopt the Python variable-naming convention of using all lowercase characters, with multiple words being separated by underscores. This forces your analysis code to use these names and makes it more readable as a result.

To rename the columns in a DataFrame, you use .rename() . You pass it a Python dictionary whose keys are the original column names and whose values are the replacement names:

In the code above, you’ve replaced each of the column names with something more Pythonic. This returns a fresh DataFrame that’s referenced using the data variable, not the original DataFrame referenced by james_bond_data . The data DataFrame is the one that you’ll work with from this point forward.

Note: When analyzing data, it’s good practice to retain the raw data in its original form. This is necessary to ensure that others can reproduce your analysis to confirm its validity. Remember, it’s the raw data, not your cleansed version, that provides the real proof of your conclusions.

As with all stages in data cleansing, it’s important to test that your code has worked as you expect:

To quickly view the column labels in your DataFrame, you use the DataFrame’s .columns property. As you can see, you’ve successfully renamed the columns. You’re now ready to move on and cleanse the actual data itself.

As a starting point, you can quickly check to see if anything is missing within your data. The DataFrame’s .info() method allows you to quickly do this:

When this method runs, you see a very concise summary of the DataFrame. The .info() method has revealed that there’s missing data. The RangeIndex line near the top of the output tells you that there have been twenty-seven rows of data read into the DataFrame. However, the imdb and rotten_tomatoes columns contain only twenty-six non-null values each. Each of these columns has one piece of missing data.

You may also have noticed that some data columns have incorrect data types. To begin with, you’ll concentrate on fixing missing data. You’ll deal with the data type issues afterward.

Before you can fix these columns, you need to see them. The code below will reveal them to you:

To find rows with missing data, you can make use of the DataFrame’s .isna() method. This will analyze the data DataFrame and return a second, identically sized Boolean DataFrame that contains either True or False values, depending on whether or not the corresponding values in the data DataFrame contain <NA> or not.

Once you have this second Boolean DataFrame, you then use its .any(axis="columns") method to return a pandas Series that will contain True where rows in the second DataFrame have a True value, and False if they don’t. The True values in this Series indicate rows containing missing data, while the False values indicate where there’s no missing data.

At this point, you have a Boolean Series of values. To see the rows themselves, you can make use of the DataFrame’s .loc property. Although you usually use .loc to access subsets of rows and columns by their labels, you can also pass it your Boolean Series and get back a DataFrame containing only those rows corresponding to the True entries in the Series. These are the rows with missing data.

If you put all of this together, then you get data.loc[data.isna().any(axis="columns")] . As you can see, the output displays only one row that contains both <NA> values.

When you first saw only one row appear, you might have felt a bit shaken , but now you’re not stirred because you understand why.

JupyterLab: Nobody Does it Better Show/Hide

One of your aims is to produce code that you can reuse in the future. The previous piece of code is really only for locating duplicates and won’t be part of your final production code. If you’re working in a Jupyter Notebook, then you may be tempted to include code such as this. While this is necessary if you want to document everything that you’ve done, you’ll end up with a messy notebook that will be distracting for others to read.

If you’re working within a notebook in JupyterLab, then a good workflow tactic is to open a new console within JupyterLab against your notebook and run your test and exploratory code inside that console. You can copy any code that gives you the desired results to your Jupyter notebook, and you can discard any code that doesn’t do what you expected or that you don’t need.

To add a new console to your notebook, right-click anywhere on the running notebook and choose New Console for Notebook from the pop-up menu that appears. A new console will appear below your notebook. Type any code that you wish to experiment with into the console and tap Shift + Enter to run it. You’ll see the results appear above the code, allowing you to decide whether or not you wish to keep it.

Once your analysis is finished, you should reset and retest your entire Jupyter notebook from scratch. To do this, select Kernel → Restart Kernel → Clear Outputs of All Cells from the menu. This will reset your notebook’s kernel, removing all traces of the previous results. You can then rerun the code cells sequentially to verify that everything works correctly.

Two Jupyter notebooks are provided as part of the downloadable content, which you can get by clicking the link below:

The data_analysis_results.ipynb notebook contains a reusable version of the code for cleansing and analyzing the data, while the data_analysis_findings.ipynb notebook contains a log of the procedures used to arrive at these final results.

You can complete this tutorial using other Python environments, but Jupyter Notebook within JupyterLab is highly recommended.

To fix these errors, you need to update the data DataFrame. As you learned earlier, you’ll build up all changes temporarily in a DataFrame referenced by the data variable then write them to disk when they are all complete. You’ll now add some code to fix those <NA> values that you’ve discovered.

After doing some research , you find out the missing values are 7.1 and 6.8 , respectively. The code below will update each missing value correctly:

Here, you’ve chosen to define a DataFrame using a Python dictionary. The keys of the dictionary define its column headings, while its values define the data. Each value consists of a nested dictionary. The keys of this nested dictionary provide the row index, while the values provide the updates. The DataFrame looks like this:

Then when you call .combine_first() and pass it this DataFrame, the two missing values in the imdb and rotten_tomatoes columns in row 10 are replaced by 7.1 and 6.8 , respectively. Remember, you haven’t updated the original james_bond_data DataFrame. You’ve only changed the DataFrame referenced by the data variable.

You must now test your efforts. Go ahead and run data[data.isna().any(axis="columns")] to make sure no rows are returned. You should see an empty DataFrame.

Now you’ll fix the invalid data types. Without this, numerical analysis of your data is meaningless, if not impossible. To begin with, you’ll fix the currency columns.

The data.info() code that you ran earlier also revealed to you a subtler issue. The income_usa , income_world , movie_budget , and film_length columns all have data types of string . However, these should all be numeric types because strings are of little use for calculations. Similarly, the release column, which contains the release date, is also a string . This should be a date type.

First of all, you need to take a look at some of the data in each of the columns to learn what the problem is:

To access multiple columns, you pass a list of column names into the DataFrame’s [] operator. Although you could also use data.loc[] , using data[] alone is cleaner. Either option will return a DataFrame containing all the data from those columns. To keep things manageable, you use the .head() method to restrict the output to the first five records.

As you can see, the three financial columns each have dollar signs and comma separators, while the film_length column contains "mins" . You’ll need to remove all of this to use the remaining numbers in the analysis. These additional characters are why the data types are being misinterpreted as strings.

Although you could replace the $ sign in the entire DataFrame, this may remove it in places where you don’t want to. It’s safer if you remove it one column at a time. To do this, you can make excellent use of the .assign() method of a DataFrame. This can either add a new column to a DataFrame, or replace existing columns with updated values.

As a starting point, suppose you wanted to replace the $ symbols in the income_usa column of the data DataFrame that you’re creating. The additional code in lines 6 through 12 achieves this:

To correct the income_usa column, you define its new data as a pandas Series and pass it into the data DataFrame’s .assign() method. This method will then either overwrite an existing column with a new Series or create a new column containing it. You define the name of the column to be updated or created as a named parameter that references the new data Series. In this case, you’ll pass a parameter named income_usa .

It’s best to create the new Series using a lambda function. The lambda function used in this example accepts the data DataFrame as its argument and then uses the .replace() method to remove the $ and comma separators from each value in the income_usa column. Finally, it converts the remaining digits, which are currently of type string , to Float64 .

To actually remove the $ symbol and commas, you pass the regular expression [$,] into .replace() . By enclosing both characters in [] , you’re specifying that you want to remove all instances of both. Then you define their replacements as "" . You also set the regex parameter to True to allow [$,] to be interpreted as a regular expression.

The result of the lambda function is a Series with no $ or comma separators. You then assign this Series to the variable income_usa . This causes the .assign() method to overwrite the existing income_usa column’s data with the cleansed updates.

Take another look at the above code, and you’ll see how this all fits together. You pass .assign() a parameter named income_usa , which references a lambda function that calculates a Series containing the updated content. You assign the Series that the lambda produces to a parameter named income_usa , which tells .assign() to update the existing income_usa column with the new values.

Now go ahead and run this code to remove the offending characters from the income_usa column. Don’t forget to test your work and verify that you’ve made the replacements. Also, remember to verify that the data type of income_usa is indeed Float64 .

Note: When using assign() , it’s also possible to pass in a column directly by using .assign(income_usa=data["income_usa"]...) . However, this causes problems if the income_usa column has been changed earlier in the pipeline. These changes won’t be available for the calculation of the updated data. By using a lambda function, you force the calculation of a new set of column data based on the most recent version of that data.

Of course, it isn’t only the income_usa column that you need to work on. You also need to do the same with the income_world and movie_budget columns. You can also achieve this using the same .assign() method. You can use it to create and assign as many columns as you like. You simply pass them in as separate named arguments.

Why not go ahead and see if you can write the code that removes the same two characters from the income_world and movie_budget columns? As before, don’t forget to verify that your code has worked as you expect, but remember to check the correct columns!

Note : Remember that the parameter names you use within .assign() to update columns must be valid Python identifiers. This was another reason why changing the column names the way you did at the beginning of your data cleansing was a good idea.

Once you’ve tried your hand at resolving the remaining issues with these columns, you can reveal the solution below:

Removing Remaining Currency Symbols and Separators Show/Hide

In the code below, you’ve used the earlier code but added a lambda to remove the remaining "$" and separator strings:

Line 12 deals with the income_world data, while line 17 deals with the movie_budget data. As you can see, all three lambdas work in the same way.

Once you’ve made these corrections, remember to test your code by using data.info() . You’ll see the financial figures are no longer string types, but Float64 numbers. To view the actual changes, you can use data.head() .

With the currency colum data type now corrected, you can fix the remaining invalid types.

Next, you must remove the "mins" string from the film length values, then convert the column to the integer type. This will allow you to analyze the values. To remove the offending "mins" text, you decide to use pandas’ .str.removesuffix() Series method. This allows you to remove the string passed to it from the right-hand side of the film_length column. You can then use .astype("Int64") to take care of the data type.

Using the above information, go ahead and see if you can update the film_length column using a lambda, and add it in as another parameter to the .assign() method.

You can reveal the solution below:

Removing a Substring Show/Hide

In the code below, you’re using the earlier code but adding a lambda to remove the "mins" string starting at line 22:

As you can see, the lambda uses .removesuffix() at line 24 to update the film_length column by generating a new Series based on the data from the original film_length column but minus the "mins" string from the end of each value. To make sure you can use the column’s data as numbers, you use .astype("Int64") .

As before, test your code with the .info() and .head() methods that you used earlier. You should see the film_length column now has a more useful Int64 data type, and you’ve removed "mins" .

In addition to the problems with the financial data, you also noticed that the release_date column was being treated as a string. To convert its data into datetime format, you can use pd.to_datetime() .

To use to_datetime() , you pass the Series data["release_date"] into it, not forgetting to specify a format string to allow the date values to be interpreted correctly. Each date here is of the form June, 1962 , so in your code, you use %B followed by a comma and space to denote the position of the month names, then %Y to denote the four-digit years.

You also take the opportunity to create a new column in your DataFrame named release_year for storing the year portion of your updated data["release_date"] column data. The code to access this value is data["release_date"].dt.year . You figure that having each year separate may be useful for future analysis and even perhaps a future DataFrame index.

Using the above information, go ahead and see if you can update the release_date column to the correct type, and also create a new release_year column containing the year that each movie came out. As before, you can achieve both with .assign() and lambdas, and again as before, remember to test your efforts.

Adjusting Dates Show/Hide

In the code below, you use the earlier code with the addition of lambdas to update the release_date column’s data type and create a new column containing the release year:

As you can see, the lambda assigned to release_date on line 27 updates the release_date column, while the lambda on line 30 creates a new release_year column containing the year part of the dates from the release_date column.

As always, don’t forget to test your efforts.

Now that you’ve resolved these initial issues, you rerun data.info() to verify that you’ve fixed all of your initial concerns:

As you can see, the original twenty-seven entries now all have data in them. The release_date column has a datetime64 format, and the three earnings and film_length columns all have numeric types. There’s even a new release_year column in your DataFrame as well. Of course, this check wasn’t really necessary because, like all good secret agents, you already checked your code as you wrote it.

You may also have noticed that the column order has changed. This has happened as a result of your earlier use of combine_first() . In this analysis, the column order doesn’t matter because you never need to display the DataFrame. If necessary, you can specify a column order by using square brackets, as in data[["column_1", ...]] .

At this point, you’ve made sure nothing is missing from your data and that it’s all of the correct type. Next you turn your attention to the actual data itself.

While updating the movie_budget column label earlier, you may have noticed that its numbers appear small compared to the other financial columns. The reason is that its data is in thousands, whereas the other columns are actual figures. You decide to do something about this because it could cause problems if you compared this data with the other financial columns that you worked on.

You might be tempted to write another lambda and pass it into .assign() using the movie_budget parameter. Unfortunately, this won’t work because you can’t use the same parameter twice in the same function. You could revisit the movie_budget parameter and add functionality to multiply its result by 1000 , or you could create yet another column based on the movie_budget column values. Alternatively, you could create a separate .assign() call.

Each of these options would work, but multiplying the existing values is probably the simplest. Go ahead and see if you can multiply the results of your earlier movie_budget lambda by 1000 .

Adjusting Quantities Show/Hide

The code below is similar to the earlier version. To multiply the lambda results, you use multiplication:

The lambda starting at line 17 fixes the currency values, and you’ve adjusted it at line 21 to multiply those values by 1000 . All financial columns are now in the same units, making comparisons possible.

You can use the techniques that you used earlier to view the values in movie_budget and confirm that you’ve correctly adjusted them.

Now that you’ve sorted out some formatting issues, it’s time for you to move on and do some other checks.

One of the most difficult data cleansing tasks is checking for typos because they can appear anywhere. As a consequence, you’ll often not encounter them until late in your analysis and, indeed, may never notice them at all.

In this exercise, you’ll look for typos in the names of the actors who played Bond and in the car manufacturers’ names. This is relatively straightforward to do because both of these columns contain data items from a finite set of allowable values:

The .value_counts() method allows you to quickly obtain a count of each element within a pandas Series. Here you use it to help you find possible typos in the bond_actor column. As you can see, one instance of Sean Connery and one of Roger Moore contain typos.

To fix these with string replacement, you use the .str.replace() method of a pandas data Series. In its simplest form, you only need to pass it the original string and the string that you want to replace it with. In this case, you can replace both typos at the same time by chaining two calls to .str.replace() .

Using the above information, go ahead and see if you can correct the typos in the bond_actor column. As before, you can achieve this with a lambda.

Fixing The Actors' Names Show/Hide

In the updated code, you’ve fixed the actors’ names:

As you can see, a new lambda on line 36 updates both typos in the bond_actor column. The first .str.replace() changes all instances of Shawn to Sean , while the second one fixes the MOORE instances.

You can test that these changes have been made by rerunning the .value_counts() method once more.

As an exercise, why don’t you analyze the car manufacturer’s names and see if you can spot any typos? If there are any, use the techniques shown above to fix them.

Checking The Car Names For Typos Show/Hide

Once again, you use value_counts() to analyze the car_manufacturer column:

This time, there are two rogue entries for a car named Astin Martin . These are incorrect and need fixing:

To fix the typo, you use the same techniques as earlier, only this time you replace "Astin" with "Aston" in the car_manufacturer column. The lambda at line 41 achieves this.

Before you go any further, you should, of course, rerun the .value_counts() method against your data to validate your updates.

With the typos fixed, next you’ll see if you can find any suspicious-looking data.

The next check that you’ll perform is verifying that the numerical data is in the correct range. This again requires careful thought, because any large or small data point could be a genuine outlier, so you may need to recheck your source. But some may indeed be incorrect.

In this example, you’ll investigate the martinis that Bond consumed in each movie, as well as the length of each movie, to make sure their values are within a sensible range. There are several ways that you could analyze numerical data to check for outliers. A quick way is to use the .describe() method:

When you use .describe() on either a pandas Series or a DataFrame, it gives you a set of statistical measures relating to the Series or DataFrame’s numerical values. As you can see, .describe() has given you a range of statistical data relating to each of the two columns of the DataFrame that you called it on. These also reveal some probable errors.

Looking at the film_length column, the quartile figures reveal that most movies are around 130 minutes long, yet the mean is almost 170 minutes. The mean has been skewed by the maximum, which is a whopping 1200 minutes.

Depending on the nature of the analysis, you’d probably want to recheck your source to find out if this maximum value is indeed incorrect. In this scenario, having a movie lasting twenty hours clearly indicates a typo. After verifying your original dataset , you find 120 to be the correct value.

Turning next to the number of martinis that Bond drank during each movie, the minimum figure of -6 simply doesn’t make sense. As before, you recheck the source and find that this should be 6.

You can fix both of these errors using the .replace() method introduced earlier. For example data["martinis_consumed"].replace(-6, 6) will update the martini figures, and you can use a similar technique for the film duration. As before, you can do both using lambdas within .assign() , so why not give it a try?

You can reveal the updated cleansing code, including these latest additions, below:

Fixing Invalid Outliers Show/Hide

Now you’ve added in the two additional lambdas:

Earlier, you used a lambda to remove a "mins" string from the film_length column entries. You can’t, therefore, create a separate lambda within the same .assign() to replace the incorrect film length because doing so would mean passing in a second parameter into .assign() with the same name as this. This, of course, is illegal.

However, there’s an alternative solution that requires some lateral thinking. You could’ve created a separate .assign() method, but it’s probably more readable to keep all changes to the same column in the same .assign() .

To perform the replacement, you adjusted this existing lambda starting at line 23 to replace the invalid 1200 with 120 . You fixed the martinis column with a new lambda on line 45 that replaced -6 with 6 .

As ever, you should test these updates once more using the describe() method. You should now see sensible values for the maximum film_length and the minimum martinis_consumed columns.

Your data is almost cleansed. There is just one more thing to check and fix, and that’s the possibility that drinking too many vodka martinis has left you seeing double.

The final issue that you’ll check for is whether any of the rows of data have been duplicated. It’s usually good practice to leave this step until last because it’s possible that your earlier changes could cause duplicate data to occur. This most commonly happens when you fix strings within data because often it’s different variants of the same string that cause unwanted duplicates to occur in the first place.

The easiest way to detect duplicates is to use the DataFrame’s .duplicated() method:

By setting keep=False , the .duplicated() method will return a Boolean Series with duplicate rows marked as True . As you saw earlier, when you pass this Boolean Series into data.loc[] , the duplicate DataFrame rows are revealed to you. In your data, two rows have been duplicated. So your next step is to get rid of one instance of each row.

To get rid of duplicate rows, you call the .drop_duplicates() method on the data DataFrame that you’re building up. As its name suggests, this method will look through the DataFrame and remove any duplicate rows that it finds, leaving only one. To reindex the DataFrame sequentially, you set ignore_index=True .

See if you can figure out where to insert .drop_duplicates() in your code. You don’t use a lambda, but duplicates are removed after the call to .assign() has finished. Test your effort to make sure that you’ve indeed removed the duplicates.

Removing Duplicates Show/Hide

In the updated code, you’ve dropped the duplicate row:

As you can see, you’ve placed .drop_duplicates() on line 49, after the .assign() method has finished adjusting and creating columns.

If you rerun data.loc[data.duplicated(keep=False)] , it won’t return any rows. Each row is now unique.

You’ve now successfully identified several flaws with your data and used various techniques to cleanse them. Keep in mind that if your analysis highlights new flaws, then you may need to revisit the cleansing phase once more. On this occasion, this isn’t necessary.

With your data suitably cleansed, you might be tempted to jump in and start your analysis. But just before you start, don’t forget that other very important task that you still have left to do!

As part of your training, you’ve learned that you should save your cleansed DataFrame to a fresh file. Other analysts can then use this to save the trouble of having to recleanse the same issues once more, but you’re also allowing access to the original file in case they need it for reference. The .to_csv() method allows you to perform this good practice:

You write your cleansed DataFrame out to a CSV file named james_bond_data_cleansed.csv . By setting index=False , you’re not writing the index, only the pure data. This file will be useful to future analysts.

Note: Earlier you saw how you can source data from a variety of different file types, including Excel spreadsheets and JSON. You won’t be surprised to learn that pandas also allows you to write DataFrame content back out to these files. To do this, you use an appropriate method like .to_excel() or .to_parquet() . The input/output section of the pandas documentation contains the details.

Parquet is a great format for storing your intermediate files, because they’re compressed and support the different data types you’re working with. Its biggest disadvantage is that Parquet is not supported by all tools.

Before moving on, take a moment to reflect on what you’ve achieved up to this point. You’ve cleansed your data such that it’s now structurally sound with nothing missing, no duplicates, and no invalid data types or outliers. You’ve also removed spelling errors and inconsistencies between similar data values.

Your great effort so far not only allows you to analyze your data with confidence, but by highlighting these issues, it may be possible for you to revisit the data source and fix those issues there as well. Indeed, you can perhaps prevent similar issues from reappearing in future if you’ve highlighted a flaw in the processes for acquiring the original data.

Data cleansing really is worth putting time and effort into, and you’ve reached an important milestone. Now that you’ve tidied up and stored your data, it’s time to move on to the main part of your mission. It’s time to start meeting your objectives.

Performing Data Analysis Using Python

Data analysis is a huge topic and requires extensive study to master. However, there are four major types of analysis:

Descriptive analysis uses previous data to explain what’s happened in the past . Common examples include identifying sales trends or your customers’ behaviors.

Diagnostic analysis takes things a stage further and tries to find out why those events have happened . For example, why did the sales trend occur? And why exactly did your customers do what they did?

Predictive analysis builds on the previous analysis and uses techniques to try and predict what might happen in the future . For example, what do you expect future sales trends to do? Or what do you expect your customers to do next?

Prescriptive analysis takes everything discovered by the earlier analysis types and uses that information to formulate a future strategy . For example, you might want to implement measures to prevent sales trend predictions from falling or to prevent your customers from purchasing elsewhere.

In this tutorial, you’ll use Python to perform some descriptive analysis techniques on your james_bond_data_cleansed.csv data file to answer the questions that your boss asked earlier. It’s time to dive in and see what you can find.

The purpose of the analysis stage in the workflow diagram that you saw at the start of this tutorial is for you to process your cleansed data and extract insights and relationships from it that are of use to other interested parties. Although it’s probably your conclusions that others will be interested in, if you’re ever challenged on how you arrived at them, you have the source data to support your claims.

To complete the remainder of this tutorial, you’ll need to install both the matplotlib and scikit-learn libraries. You can do this by using python -m pip install matplotlib scikit-learn , but don’t forget to prefix it with ! if you’re using it from within a Jupyter Notebook.

During your analysis, you’ll be drawing some plots of your data. To do this, you’ll use the plotting capabilities of the Matplotlib library.

In addition, you’ll be performing a regression analysis , so you’ll need to use some tools from the scikit-learn library.

Your data contains reviews from both Rotten Tomatoes and IMDb . Your first objective is to find out if there’s a relationship between the Rotten Tomatoes ratings and those from IMDb. To do this, you’ll use a regression analysis to see if the two rating sets are related.

When performing a regression analysis, a good first step is to draw a scatterplot of the two sets of data that you’re analyzing. The shape of this plot gives you a quick visual clue as to the presence of any relationship between them, and if so, whether it’s linear , quadratic or exponential .

The code below sets you up to eventually produce a scatterplot of both ratings sets:

To begin with, you import the pandas library to allow you to read your shiny new james_bond_data_cleansed.csv into a DataFrame. You also import the matplotlib.pyplot library, which you’ll use to create the actual scatterplot.

You then use the following code to actually create the scatterplot:

Calling the subplots() function sets up an infrastructure that allows you to add one or more plots into the same figure. This won’t concern you here because you’ll only have one, but its capabilities are worth investigating.

To create the initial scatterplot, you specify the horizontal Series as the imdb column of your data and the vertical Series as the rotten_tomatoes column. The order is arbitrary here because it’s the relationship between them that interests you.

To help readers understand your plot, you next give your plot a title, and then provide sensible labels for both axes. The fig.show() code, which is optional in a Jupyter Notebook, may be needed to display your plot.

In Jupyter Notebooks, your plot should look like this:

scatterplot of both sets of rating data

The scatterplot shows a distinct slope upwards from left to right. This means that as one set of ratings increases, the other set does as well. To dig deeper and find a mathematical relationship that will allow you to estimate one set based on the other, you need to perform a regression analysis. This means that you need to expand your previous code as follows:

First of all, you import LinearRegression . As you’ll see shortly, you’ll need this to perform the linear regression calculation. You then create a pandas DataFrame and a pandas Series. Your x is a DataFrame that contains the imdb column’s data, while y is a Series that contains the rotten_tomatoes column’s data. You could potentially regress on several features, which is why x is defined as a DataFrame with a list of columns.

You now have everything you need to perform the linear regression calculations:

First of all, you create a LinearRegression instance and pass in both data sets to it using .fit() . This will perform the actual calculations for you. By default, it uses ordinary least squares (OLS) to do so.

Once you’ve created and populated the LinearRegression instance, its .score() method calculates the R-squared, or coefficient of determination, value. This measures how close the best-fit line is to the actual values. In your analysis, the R-squared value of 0.79 indicates a 79 percent accuracy between the best-fit line and the actual values. You convert it to a string named r_squared for plotting later. You round the value for neatness.

To construct a string of the equation of the best-fit straight line, you use your LinearRegression object’s .coef_ attribute to get its gradient, and its .intercept_ attribute to find the y -intercept. The equation is stored in a variable named best_fit so that you can plot it later.

Note: You may be wondering why both the model.coef_ and model.intercept_ variables have underscore suffixes. This is a scikit-learn convention to indicate variables that contain estimated values.

To get the various y coordinates that the model predicts for each given value of x , you use your model’s .predict() method and pass it the x values. You store these values in a variable named y_pred , again to allow you to plot the line later.

Finally, you produce your scatterplot:

The first three lines add the best-fit line onto the scatterplot. The text() function places the r_squared and best_fit at the coordinates passed to it, while the .plot() method adds the best-fit line, in red, to the scatterplot. As before, fig.show() isn’t needed in a Jupyter Notebook.

The Jupyter Notebook result of all of this is shown below:

screenshot of a scatterplot with linear regression line

Now that you’ve completed your regression analysis, you can use its equation to predict one rating from the other with approximately 79 percent accuracy.

Your data includes information on the running times of each of the different Bond movies. Your second objective asks you to find out if there are any insights to glean from analyzing the lengths of the movies. To do this, you’ll create a bar plot of movie timings and see if it reveals anything interesting:

This time, you create a bar plot using the plotting capabilities of pandas. While these aren’t as extensive as Matplotlib’s, they do use some of Matplotlib’s underlying functionality. You create a Series consisting of the data from the Film_Length column of your data. You then use .value_counts() to create a Series containing the count of each movie’s length. Finally, you group them into seven ranges by passing in bins=7 .

Once you’ve created the Series, you can quickly plot it using .plot.bar() . This allows you to define a title and axis labels for your plot as shown. The resulting plot reveals a very common statistical distribution:

screenshot showing a normal distribution of movie lengths

As you can see from the plot, the movie lengths resemble a normal distribution . The mean movie time sits between 122 minutes and 130 minutes, a little over two hours.

Note that neither the fig, ax = plt.subplots() nor fig.show() code is necessary in a Jupyter Notebook. Some environments may need it to allow them to display the plot.

You can find more specific statistical values if you wish:

Each pandas data Series has a useful .agg() method that allows you to pass in a list of functions. Each of these is then applied to the data in the Series. As you can see, the mean is indeed in the 122 to 130 minutes range. The standard deviation is small, meaning there isn’t much spread in the range of movie times. The minimum and maximum are 106 minutes and 163 minutes, respectively.

In this final analysis, you’ve been asked to investigate whether or not there’s any relationship between a movie’s user rating and the number of kills that Bond achieves in it.

You decide to proceed along similar lines as you did when you analyzed the relationship between the two different ratings sets. You start with a scatterplot:

The code is virtually identical to what you used in your earlier scatterplot. You decided to use the IMDb data in the analysis, but you could’ve used the Rotten Tomatoes data instead. You’ve already established that there’s a close relationship between the two, so it doesn’t matter which you choose.

This time, when you draw the scatterplot, it looks like this:

screenshot showing scatterplot of kills vs movie ratings

As you can see, the scatterplot shows you that the data is randomly distributed. This indicates that there’s no relationship between movie rating and the number of Bond kills. Whether the victim wound up on the wrong side of a Walther PPK , got sucked out of a plane, or was left to drift off into space, Bond movie fans don’t seem to care much about the number of bad guys that Bond eliminates.

When analyzing data, it’s important to realize that you may not always find something useful. Indeed, one of the pitfalls that you must avoid when performing data analysis is introducing your own bias into your data before analyzing it, and then using it to justify your preconceived conclusions. Sometimes there’s simply nothing to conclude.

At this point, you’re happy with your findings. It’s time for you to communicate them back to your bosses.

Once your data modeling is complete and you’ve obtained useful information from it, the next stage is to communicate your findings to other interested parties. After all, they’re not For Your Eyes Only . You could do this using a report or presentation. You’ll likely discuss your data sources and analysis methodology before stating your conclusions. Having the data and methodology behind your conclusions gives them authority.

You may find that once you’ve presented your findings, questions will come up that require future analysis. Once more, you may need to set additional objectives and work through the entire workflow process to resolve these new points. Look back at the diagram, and you’ll see that there’s a possible cyclic, as well as iterative, nature to a data analysis workflow.

In some cases, you may reuse your analysis methods. If so, you may consider writing some scripts that read future versions of the data, cleanse it, and analyze it in the same way that you just have. This will allow future results to be compared to yours and will add scalability to your efforts. By repeating your analysis in the future, you can monitor your original findings to see how well they stand up in the face of future data.

Alternatively, you may discover a flaw in your methodology and need to reanalyze your data differently. Again, the workflow diagram notes this possibility as well.

As you analyzed the dataset, you may have noticed that one of the James Bond Movies is missing. Take a look back and see if you can figure out which one it is. You can reveal the answer below, but no peeking! Also, if you run data["bond_actor"].value_counts() you may be surprised to find that Sean Connery played Bond only six times to Roger Moore’s seven. Or did he?

Bond Is Back! Show/Hide

The dataset that you’re using in this tutorial doesn’t include Never Say Never Again . This movie wasn’t considered an official part of the James Bond franchise. However, it did star Sean Connery in the title role. So technically, both Connery and Moore have played Bond 007 times each.

That’s it, your mission is complete. M is very pleased. As a reward, he instructs Q to give you a pen that turns into a helicopter. Always a handy tool to have for tracking down future data for analysis.

You’ve now gained experience in using a data analysis workflow to analyze some data and draw conclusions from your findings. You understand the main stages in a data analysis workflow and the reasons for following them. As you learn more advanced analysis techniques in the future, you can still use the key skills that you learned here to make sure your future data analysis projects progress thoroughly and efficiently.

In this tutorial, you’ve learned:

  • The importance of a data analysis workflow
  • The purpose of the main stages in a data analysis workflow
  • Common techniques for cleansing data
  • How to use some common data analysis methods to meet objectives
  • How to display the results of a data analysis graphically.

You should consider learning more data analysis techniques and practicing your skills using them. If you’ve done any further analysis on the James Bond data used here, then feel free to share your interesting findings in the comments section below. In fact, try finding something to share with us that’s shocking. Positively shocking.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Ian Eyre

Ian Eyre

Ian is an avid Pythonista and Real Python contributor who loves to learn and teach others.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: intermediate best-practices data-science python

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Using Python for Data Analysis (Sample Code)

🔒 No spam. We take your privacy seriously.

data visualization case study in python

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Case Study on Visualization in Python on Sales Data

rashuddar/Python-Visualization-Case-Study

Folders and files.

NameName
1 Commit
  • Jupyter Notebook 100.0%
  • Python Basics
  • Interview Questions
  • Python Quiz
  • Popular Packages
  • Python Projects
  • Practice Python
  • AI With Python
  • Learn Python3
  • Python Automation
  • Python Web Dev
  • DSA with Python
  • Python OOPs
  • Dictionaries

How to Fix Python “Can’t Convert np.ndarray of Type numpy.object_”?

When working with NumPy we might encounter the error message “Can’t Convert np.ndarray of Type numpy.object_.” This error typically arises when attempting to convert or perform the operations on the NumPy array that contains mixed data types or objects that are not supported by the intended operation. Understanding why this error occurs and how to fix it can help us effectively manage the data and perform the desired computations.

Problem Statement

The error “Can’t Convert np.ndarray of Type numpy.object_” occurs when trying to perform the operations on a NumPy array that contains elements of the mixed or unsupported data types. This usually happens when the array is of the type numpy.object_ which can hold any Python object leading to issues during the type-specific operations.

Running this code will result in the following error:

ds99

Common Causes

  • Mixed Data Types : If your array contains elements of different data types (e.g., numbers and strings), NumPy will store them as objects.
  • Improper Array Initialization : Initializing an array with an explicit dtype=object can lead to this error if you later try to perform numerical operations.
  • Data Reading Issues : When loading data from files (e.g., CSV, Excel), columns with mixed types can be interpreted as objects.

Approach to Solving the Problem

To resolve this error we need to ensure that the NumPy array contains elements of the single compatible data type before performing operations on it. Here are the steps to approach this problem:

  • Identify the mixed data types in the array.
  • Filter or convert elements to the compatible data type.
  • Perform the desired operation on the cleaned array.

Solution to Fix “Can’t Convert np.ndarray of Type numpy.object_”

To resolve the “Can’t Convert np.ndarray of the Type numpy.object_” error consider the following solutions based on the root cause:

1. Ensure Homogeneous Data Types

The NumPy arrays work best when all elements are of the same data type. To ensure this:

Check Data Types: The Verify the data types of the elements in the array using the dtype attribute of the NumPy arrays.

Here, arr contains elements of the type object which can cause issues. Convert the array to the homogeneous type whenever possible.

Convert Data Types: If possible convert the array to the homogeneous type that suits the needs using methods like astype().

This converts the array elements to the integers ensuring they are homogeneous.

2. Handle Mixed Data Appropriately

If your array must contain elements of the different types consider handling them appropriately without relying on the NumPy’s automatic conversions:

Use Lists: Instead of using the NumPy array consider using the Python list which can handle heterogeneous data types more naturally.

Lists allow to the mix data types without the strict type requirements of NumPy arrays.

Explicit Iteration: If we need to perform the operations on elements with the different types iterate over them explicitly and handle each element based on its type.

3. Debugging and Error Handling

  • Debugging: Use print statements or debuggers to the inspect the content and type of the arrays during the runtime to the identify where the type conversion issue originates.
  • Error Handling: Implement error handling mechanisms to the gracefully handle cases where data types cannot be converted as expected providing the meaningful feedback or alternative approaches.

The error “Can’t Convert np.ndarray of Type numpy.object_” can be resolved by the ensuring that the NumPy array contains elements of a single compatible data type before performing the operations. Different approaches such as the filtering and converting elements using the Pandas for the data cleaning or manual type conversion with the error handling can help we manage and clean the data effectively. By following these methods we can perform the desired computations without the type-related errors.

Please Login to comment...

Similar reads.

  • Python-numpy

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

COMMENTS

  1. Visualizing Data in Python With Seaborn

    First, you import seaborn into your Python code. By convention, you import it as sns. Although you can use any alias you like, sns is a nod to the fictional character the library was named after. To work with data in seaborn, you usually load it into a pandas DataFrame, although other data structures can also be used.

  2. Geoffrey-lab/Data-Visualization-with-Python

    Data Preparation: Delve into the process of preparing data for visualization, encompassing data loading, cleaning, and aggregation. Hands-on Examples: Follow along with practical examples utilizing real-world datasets to comprehend the application of data visualization techniques.

  3. Chapter 3 Case Studies

    Chapter 3 Case Studies. Chapter 3. Case Studies. This chapter explores some interesting case studies of data visualizations. Critiquing these case studies is a valuable exercise that helps both expand our knowledge of possible visual representations of data as well as develop the type of critical thinking that improves our own visualizations.

  4. Python Data Visualization Case study

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 4. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources.

  5. Data Visualization With Python (Learning Path)

    Applied Data Visualization. In this final section, apply your data visualization skills in Python on real world tasks. Learn to build interactive web applications with Dash, and interactive web maps using Folium. Then, explore the creative side of data visualization by drawing the Mandelbrot set, a famous fractal, using Matplotlib and Pillow.

  6. The Top 5 Python Libraries for Data Visualization

    Python's visualization landscape in 2018 . This article helps you with that. It lays out why data visualization is important and why Python is one of the best visualization tools. It goes on to showcase the top five Python data visualization libraries, their main features, and when it is a good idea to use them.

  7. Data Visualizations with Python

    Guided by the expert advice of your mentor, you'll finish the course with a portfolio, complete with a professional case study that showcases your ability to think like a data visualization expert. ... Apply functions in Python to aggregate data, create a geospatial plot using Kepler.gl, and analyze a map by applying filtering to recognize ...

  8. Data Visualisation In Python Case Study

    In this module you'll be learning data visualisation with the help of a case study. This will enable you to understand how visualisation aids you in solving business problems. Problem Statement. The team at Google Play Store wants to develop a feature that would enable them to boost visibility for the most promising apps. Now, this analysis ...

  9. Develop Data Visualization Interfaces in Python With Dash

    Dash is an open-source framework for building data visualization interfaces. Released in 2017 as a Python library, it's grown to include implementations for R, Julia, and F#. Dash helps data scientists build analytical web applications without requiring advanced web development knowledge.

  10. Advanced Visualization Cookbook

    This Project Pythia Cookbook covers advanced visualization techniques building upon and combining various Python packages. Motivation. The possibilities of data visualization in Python are almost endless. Already using matplotlib the workhorse behind many visualization packages, the user has a lot of customization options available to them.

  11. data-visualization-python · GitHub Topics · GitHub

    To associate your repository with the data-visualization-python topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  12. An Intuitive Guide to Data Visualization in Python (with examples)

    A quick guide to Data Visualization. Data visualization is the representation of information in a graphical or pictorial format. It allows us to understand patterns, trends, and correlations in data, making complex data more accessible, understandable, and usable. It is an essential part of data analysis and business intelligence.

  13. Data Visualization with Python

    Matplotlib is an easy-to-use, low-level data visualization library that is built on NumPy arrays. It consists of various plots like scatter plot, line plot, histogram, etc. Matplotlib provides a lot of flexibility. To install this type the below command in the terminal. pip install matplotlib.

  14. Data Visualisation In Python Case Study

    In this module you'll be learning data visualisation with the help of a case study. This will enable you to understand how visualisation aids you in solving business problems. Problem Statement. The team at Google Play Store wants to develop a feature that would enable them to boost visibility for the most promising apps. Now, this analysis ...

  15. Data Visualisation in Python

    Something went wrong and this page crashed! If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 4. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource]

  16. Python Data Visualizations

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 4. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from Iris Species.

  17. data-analysis-and-visualization-using-python/Ch08/Embarak Ch08 NCHS

    Source Code for 'Data Analysis and Visualization Using Python' by Dr. Ossama Embarak - Apress/data-analysis-and-visualization-using-python

  18. Using Python for Data Analysis

    Data analysis is a broad term that covers a wide range of techniques that enable you to reveal any insights and relationships that may exist within raw data. As you might expect, Python lends itself readily to data analysis. Once Python has analyzed your data, you can then use your findings to make good business decisions, improve procedures, and even make informed predictions based on what ...

  19. Data Visualization Case Study in Python

    CASE STUDY Visualizations in Python on Sales Data. Website: analytixlabs.co Email: [email protected]. Bldg-41, 14th Main Road, GF - SCO-382, Sector 29, Sector 7, HSR Layout, Adj to IFFCO Chowk Metro Station, (Near Mc Donald's and Opp to Max Store) Besides Magic Pin Bengaluru, KA.

  20. rashuddar/Python-Visualization-Case-Study

    Case Study on Visualization in Python on Sales Data - rashuddar/Python-Visualization-Case-Study

  21. How to Fix Python "Can't Convert np.ndarray of Type numpy.object

    The NumPy ndarray.itemset() method inserts a scalar into an array. Key Points:ndarray.itemset function needs at least one argument.The last argument you pass in the function is considered an "item". arr.itemset(*args) is a quicker way to do same thing as arr[args] = item.

  22. Python Case Study 4

    Explore and run machine learning code with Kaggle Notebooks | Using data from Python Case Study 4 - Python Visualization