Getuplearn

Data Representation in Computer: Number Systems, Characters, Audio, Image and Video

  • Post author: Anuj Kumar
  • Post published: 16 July 2021
  • Post category: Computer Science
  • Post comments: 0 Comments

Table of Contents

  • 1 What is Data Representation in Computer?
  • 2.1 Binary Number System
  • 2.2 Octal Number System
  • 2.3 Decimal Number System
  • 2.4 Hexadecimal Number System
  • 3.4 Unicode
  • 4 Data Representation of Audio, Image and Video
  • 5.1 What is number system with example?

What is Data Representation in Computer?

A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory.

Before discussing data representation of numbers, let us see what a number system is.

Number Systems

Number systems are the technique to represent numbers in the computer system architecture, every value that you are saving or getting into/from computer memory has a defined number system.

A number is a mathematical object used to count, label, and measure. A number system is a systematic way to represent numbers. The number system we use in our day-to-day life is the decimal number system that uses 10 symbols or digits.

The number 289 is pronounced as two hundred and eighty-nine and it consists of the symbols 2, 8, and 9. Similarly, there are other number systems. Each has its own symbols and method for constructing a number.

A number system has a unique base, which depends upon the number of symbols. The number of symbols used in a number system is called the base or radix of a number system.

Let us discuss some of the number systems. Computer architecture supports the following number of systems:

Binary Number System

Octal number system, decimal number system, hexadecimal number system.

Number Systems

A Binary number system has only two digits that are 0 and 1. Every number (value) represents 0 and 1 in this number system. The base of the binary number system is 2 because it has only two digits.

The octal number system has only eight (8) digits from 0 to 7. Every number (value) represents with 0,1,2,3,4,5,6 and 7 in this number system. The base of the octal number system is 8, because it has only 8 digits.

The decimal number system has only ten (10) digits from 0 to 9. Every number (value) represents with 0,1,2,3,4,5,6, 7,8 and 9 in this number system. The base of decimal number system is 10, because it has only 10 digits.

A Hexadecimal number system has sixteen (16) alphanumeric values from 0 to 9 and A to F. Every number (value) represents with 0,1,2,3,4,5,6, 7,8,9,A,B,C,D,E and F in this number system. The base of the hexadecimal number system is 16, because it has 16 alphanumeric values.

Here A is 10, B is 11, C is 12, D is 13, E is 14 and F is 15 .

Data Representation of Characters

There are different methods to represent characters . Some of them are discussed below:

Data Representation of Characters

The code called ASCII (pronounced ‘􀀏’.S-key”), which stands for American Standard Code for Information Interchange, uses 7 bits to represent each character in computer memory. The ASCII representation has been adopted as a standard by the U.S. government and is widely accepted.

A unique integer number is assigned to each character. This number called ASCII code of that character is converted into binary for storing in memory. For example, the ASCII code of A is 65, its binary equivalent in 7-bit is 1000001.

Since there are exactly 128 unique combinations of 7 bits, this 7-bit code can represent only128 characters. Another version is ASCII-8, also called extended ASCII, which uses 8 bits for each character, can represent 256 different characters.

For example, the letter A is represented by 01000001, B by 01000010 and so on. ASCII code is enough to represent all of the standard keyboard characters.

It stands for Extended Binary Coded Decimal Interchange Code. This is similar to ASCII and is an 8-bit code used in computers manufactured by International Business Machines (IBM). It is capable of encoding 256 characters.

If ASCII-coded data is to be used in a computer that uses EBCDIC representation, it is necessary to transform ASCII code to EBCDIC code. Similarly, if EBCDIC coded data is to be used in an ASCII computer, EBCDIC code has to be transformed to ASCII.

ISCII stands for Indian Standard Code for Information Interchange or Indian Script Code for Information Interchange. It is an encoding scheme for representing various writing systems of India. ISCII uses 8-bits for data representation.

It was evolved by a standardization committee under the Department of Electronics during 1986-88 and adopted by the Bureau of Indian Standards (BIS). Nowadays ISCII has been replaced by Unicode.

Using 8-bit ASCII we can represent only 256 characters. This cannot represent all characters of written languages of the world and other symbols. Unicode is developed to resolve this problem. It aims to provide a standard character encoding scheme, which is universal and efficient.

It provides a unique number for every character, no matter what the language and platform be. Unicode originally used 16 bits which can represent up to 65,536 characters. It is maintained by a non-profit organization called the Unicode Consortium.

The Consortium first published version 1.0.0 in 1991 and continues to develop standards based on that original work. Nowadays Unicode uses more than 16 bits and hence it can represent more characters. Unicode can represent characters in almost all written languages of the world.

Data Representation of Audio, Image and Video

In most cases, we may have to represent and process data other than numbers and characters. This may include audio data, images, and videos. We can see that like numbers and characters, the audio, image, and video data also carry information.

We will see different file formats for storing sound, image, and video .

Multimedia data such as audio, image, and video are stored in different types of files. The variety of file formats is due to the fact that there are quite a few approaches to compressing the data and a number of different ways of packaging the data.

For example, an image is most popularly stored in Joint Picture Experts Group (JPEG ) file format. An image file consists of two parts – header information and image data. Information such as the name of the file, size, modified data, file format, etc. is stored in the header part.

The intensity value of all pixels is stored in the data part of the file. The data can be stored uncompressed or compressed to reduce the file size. Normally, the image data is stored in compressed form. Let us understand what compression is.

Take a simple example of a pure black image of size 400X400 pixels. We can repeat the information black, black, …, black in all 16,0000 (400X400) pixels. This is the uncompressed form, while in the compressed form black is stored only once and information to repeat it 1,60,000 times is also stored.

Numerous such techniques are used to achieve compression. Depending on the application, images are stored in various file formats such as bitmap file format (BMP), Tagged Image File Format (TIFF), Graphics Interchange Format (GIF), Portable (Public) Network Graphic (PNG).

What we said about the header file information and compression is also applicable for audio and video files. Digital audio data can be stored in different file formats like WAV, MP3, MIDI, AIFF, etc. An audio file describes a format, sometimes referred to as the ‘container format’, for storing digital audio data.

For example, WAV file format typically contains uncompressed sound and MP3 files typically contain compressed audio data. The synthesized music data is stored in MIDI(Musical Instrument Digital Interface) files.

Similarly, video is also stored in different files such as AVI (Audio Video Interleave) – a file format designed to store both audio and video data in a standard package that allows synchronous audio with video playback, MP3, JPEG-2, WMV, etc.

FAQs About Data Representation in Computer

What is number system with example.

Let us discuss some of the number systems. Computer architecture supports the following number of systems: 1. Binary Number System 2. Octal Number System 3. Decimal Number System 4. Hexadecimal Number System

Related posts:

10 Types of Computers | History of Computers, Advantages

  • What is Microprocessor? Evolution of Microprocessor, Types, Features

What is operating system? Functions, Types, Types of User Interface

What is cloud computing classification, characteristics, principles, types of cloud providers.

  • What is Debugging? Types of Errors

What are Functions of Operating System? 6 Functions

What is flowchart in programming symbols, advantages, preparation.

  • Advantages and Disadvantages of Flowcharts

What is C++ Programming Language? C++ Character Set, C++ Tokens

  • What are C++ Keywords? Set of 59 keywords in C ++

What are Data Types in C++? Types

What are operators in c different types of operators in c, what are expressions in c types, what are decision making statements in c types, types of storage devices, advantages, examples, you might also like.

Problem Solving Algorithm

What is Problem Solving Algorithm?, Steps, Representation

Types of Computer Software

Types of Computer Software: Systems Software, Application Software

What is big data

What is Big Data? Characteristics, Tools, Types, Internet of Things (IOT)

what is meaning of cloud computing

Types of Computer Memory, Characteristics, Primary Memory, Secondary Memory

What is Computer System

What is Computer System? Definition, Characteristics, Functional Units, Components

What is artificial intelligence

What is Artificial Intelligence? Functions, 6 Benefits, Applications of AI

Data and Information

Data and Information: Definition, Characteristics, Types, Channels, Approaches

Flowchart in Programming

Advantages and Disadvantages of Operating System

Process Operating System

Generations of Computer First To Fifth, Classification, Characteristics, Features, Examples

functions of operating system

  • Entrepreneurship
  • Organizational Behavior
  • Financial Management
  • Communication
  • Human Resource Management
  • Sales Management
  • Marketing Management

Data representation

types of data representation in computer science

Computers use binary - the digits 0 and 1 - to store data. A binary digit, or bit, is the smallest unit of data in computing. It is represented by a 0 or a 1. Binary numbers are made up of binary digits (bits), eg the binary number 1001. The circuits in a computer's processor are made up of billions of transistors. A transistor is a tiny switch that is activated by the electronic signals it receives. The digits 1 and 0 used in binary reflect the on and off states of a transistor. Computer programs are sets of instructions. Each instruction is translated into machine code - simple binary codes that activate the CPU. Programmers write computer code and this is converted by a translator into binary instructions that the processor can execute. All software, music, documents, and any other information that is processed by a computer, is also stored using binary. [1]

To include strings, integers, characters and colours. This should include considering the space taken by data, for instance the relation between the hexadecimal representation of colours and the number of colours available.

This video is superb place to understand this topic

  • 1 How a file is stored on a computer
  • 2 How an image is stored in a computer
  • 3 The way in which data is represented in the computer.
  • 6 Standards
  • 7 References

How a file is stored on a computer [ edit ]

How an image is stored in a computer [ edit ]

The way in which data is represented in the computer. [ edit ].

To include strings, integers, characters and colours. This should include considering the space taken by data, for instance the relation between the hexadecimal representation of colours and the number of colours available [3] .

This helpful material is used with gratitude from a computer science wiki under a Creative Commons Attribution 3.0 License [4]

Sound [ edit ]

  • Let's look at an oscilloscope
  • The BBC has an excellent article on how computers represent sound

See Also [ edit ]

Standards [ edit ].

  • Outline the way in which data is represented in the computer.

References [ edit ]

  • ↑ http://www.bbc.co.uk/education/guides/zwsbwmn/revision/1
  • ↑ https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
  • ↑ IBO Computer Science Guide, First exams 2014
  • ↑ https://compsci2014.wikispaces.com/2.1.10+Outline+the+way+in+which+data+is+represented+in+the+computer

A unit of abstract mathematical system subject to the laws of arithmetic.

A natural number, a negative of a natural number, or zero.

Give a brief account.

  • Computer organization
  • Very important ideas in computer science

Data Representation 5.3. Numbers

Data representation.

  • 5.1. What's the big picture?
  • 5.2. Getting started

Understanding the base 10 number system

Representing whole numbers in binary, shorthand for binary numbers - hexadecimal, computers representing numbers in practice, how many bits are used in practice, representing negative numbers in practice.

  • 5.5. Images and Colours
  • 5.6. Program Instructions
  • 5.7. The whole story!
  • 5.8. Further reading

In this section, we will look at how computers represent numbers. To begin with, we'll revise how the base 10 number system that we use every day works, and then look at binary , which is base 2. After that, we'll look at some other charactertistics of numbers that computers must deal with, such as negative numbers and numbers with decimal points.

The number system that humans normally use is in base 10 (also known as decimal). It's worth revising quickly, because binary numbers use the same ideas as decimal numbers, just with fewer digits!

In decimal, the value of each digit in a number depends on its place in the number. For example, in $123, the 3 represents $3, whereas the 1 represents $100. Each place value in a number is worth 10 times more than the place value to its right, i.e. there are the "ones", the "tens", the "hundreds", the "thousands" the "ten thousands", the "hundred thousands", the "millions", and so on. Also, there are 10 different digits (0,1,2,3,4,5,6,7,8,9) that can be at each of those place values.

If you were only able to use one digit to represent a number, then the largest number would be 9. After that, you need a second digit, which goes to the left, giving you the next ten numbers (10, 11, 12... 19). It's because we have 10 digits that each one is worth 10 times as much as the one to its right.

You may have encountered different ways of expressing numbers using "expanded form". For example, if you want to write the number 90328 in expanded form you might have written it as:

A more sophisticated way of writing it is:

If you've learnt about exponents, you could write it as:

The key ideas to notice from this are:

  • Decimal has 10 digits – 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
  • A place is the place in the number that a digit is, i.e. ones, tens, hundreds, thousands, and so on. For example, in the number 90328, 3 is in the "hundreds" place, 2 is in the "tens" place, and 9 is in the "ten thousands" place.
  • Numbers are made with a sequence of digits.
  • The right-most digit is the one that's worth the least (in the "ones" place).
  • The left-most digit is the one that's worth the most.
  • Because we have 10 digits, the digit at each place is worth 10 times as much as the one immediately to the right of it.

All this probably sounds really obvious, but it is worth thinking about consciously, because binary numbers have the same properties.

As discussed earlier, computers can only store information using bits, which have 2 possible states. This means that they cannot represent base 10 numbers using digits 0 to 9, the way we write down numbers in decimal. Instead, they must represent numbers using just 2 digits – 0 and 1.

Binary works in a very similar way to decimal, even though it might not initially seem that way. Because there are only 2 digits, this means that each digit is 2 times the value of the one immediately to the right.

The base 10 (decimal) system is sometimes called denary, which is more consistent with the name binary for the base 2 system. The word "denary" also refers to the Roman denarius coin, which was worth ten asses (an "as" was a copper or bronze coin). The term "denary" seems to be used mainly in the UK; in the US, Australia and New Zealand the term "decimal" is more common.

The interactive below illustrates how this binary number system represents numbers. Have a play around with it to see what patterns you can see.

Thumbnail of Base Calculator interactive

Base Calculator

Find the representations of 4, 7, 12, and 57 using the interactive.

What is the largest number you can make with the interactive? What is the smallest? Is there any integer value in between the biggest and the smallest that you can’t make? Are there any numbers with more than one representation? Why/ why not?

  • 000000 in binary, 0 in decimal is the smallest number.
  • 111111 in binary, 63 in decimal is the largest number.
  • All the integer values (0, 1, 2... 63) in the range can be represented (and there is a unique representation for each one). This is exactly the same as decimal!

You have probably noticed from the interactive that when set to 1, the leftmost bit (the "most significant bit") adds 32 to the total, the next adds 16, and then the rest add 8, 4, 2, and 1 respectively. When set to 0, a bit does not add anything to the total. So the idea is to make numbers by adding some or all of 32, 16, 8, 4, 2, and 1 together, and each of those numbers can only be included once.

If you get an 11/100 on a CS test, but you claim it should be counted as a 'C', they'll probably decide you deserve the upgrade.

Choose a number less than 61 (perhaps your house number, your age, a friend's age, or the day of the month you were born on), set all the binary digits to zero, and then start with the left-most digit (32), trying out if it should be zero or one. See if you can find a method for converting the number without too much trial and error. Try different numbers until you find a quick way of doing this.

Figure out the binary representation for 23 without using the interactive? What about 4, 0, and 32? Check all your answers using the interactive to verify they are correct.

Can you figure out a systematic approach to counting in binary? i.e. start with the number 0, then increment it to 1, then 2, then 3, and so on, all the way up to the highest number that can be made with the 7 bits. Try counting from 0 to 16, and see if you can detect a pattern. Hint: Think about how you add 1 to a number in base 10. e.g. how do you work out 7 + 1, 38 + 1, 19 + 1, 99 + 1, 230899999 + 1, etc? Can you apply that same idea to binary?

Using your new knowledge of the binary number system, can you figure out a way to count to higher than 10 using your 10 fingers? What is the highest number you can represent using your 10 fingers? What if you included your 10 toes as well (so you have 20 fingers and toes to count with).

A binary number can be incremented by starting at the right and flipping all consecutive bits until a 1 comes up (which will be on the very first bit half of the time).

Counting on fingers in binary means that you can count to 31 on 5 fingers, and 1023 on 10 fingers. There are a number of videos on YouTube of people counting in binary on their fingers. One twist is to wear white gloves with the numbers 16, 8, 4, 2, 1 on the 5 fingers respectively, which makes it easy to work out the value of having certain fingers raised.

The interactive used exactly 6 bits. In practice, we can use as many or as few bits as we need, just like we do with decimal. For example, with 5 bits, the place values would be 16, 8, 4, 2 and 1, so the largest value is 11111 in binary, or 31 in decimal. Representing 14 with 5 bits would give 01110.

Write representations for the following. If it is not possible to do the representation, put "Impossible".

  • Represent 101 with 7 bits
  • Represent 28 with 10 bits
  • Represent 7 with 3 bits
  • Represent 18 with 4 bits
  • Represent 28232 with 16 bits

The answers are (spaces are added to make the answers easier to read, but are not required).

  • 101 with 7 bits is: 110 0101
  • 28 with 10 bits is: 00 0001 1100
  • 7 with 3 bits is: 111
  • 18 with 4 bits is: Impossible (not enough bits to represent value)
  • 28232 with 16 bits is: 0110 1110 0100 1000

An important concept with binary numbers is the range of values that can be represented using a given number of bits. When we have 8 bits the binary numbers start to get useful – they can represent values from 0 to 255, so it is enough to store someone's age, the day of the month, and so on.

Groups of 8 bits are so useful that they have their own name: a byte . Computer memory and disk space are usually divided up into bytes, and bigger values are stored using more than one byte. For example, two bytes (16 bits) are enough to store numbers from 0 to 65,535. Four bytes (32 bits) can store numbers up to 4,294,967,295. You can check these numbers by working out the place values of the bits. Every bit that's added will double the range of the number.

In practice, computers store numbers with either 16, 32, or 64 bits. This is because these are full numbers of bytes (a byte is 8 bits), and makes it easier for computers to know where each number starts and stops.

Candles on birthday cakes use the base 1 numbering system, where each place is worth 1 more than the one to its right. For example, the number 3 is 111, and 10 is 1111111111. This can cause problems as you get older – if you've ever seen a cake with 100 candles on it, you'll be aware that it's a serious fire hazard.

The image shows two people with birthday cakes, however a cake with 100 candles on it turns into a big fireball!

Luckily it's possible to use binary notation for birthday candles – each candle is either lit or not lit. For example, if you are 18, the binary notation is 10010, and you need 5 candles (with only two of them lit).

There's a video on using binary notation for counting up to 1023 on your hands, as well as using it for birthday cakes .

It's a lot smarter to use binary notation on candles for birthdays as you get older, as you don't need as many candles.

Most of the time binary numbers are stored electronically, and we don't need to worry about making sense of them. But sometimes it's useful to be able to write down and share numbers, such as the unique identifier assigned to each digital device (MAC address), or the colours specified in an HTML page.

Writing out long binary numbers is tedious – for example, suppose you need to copy down the 16-bit number 0101001110010001. A widely used shortcut is to break the number up into 4-bit groups (in this case, 0101 0011 1001 0001), and then write down the digit that each group represents (giving 5391). There's just one small problem: each group of 4 bits can go up to 1111, which is 15, and the digits only go up to 9.

The solution is simple: we introduce symbols for the digits from 1010 (10) to 1111 (15), which are just the letters A to F. So, for example, the 16-bit binary number 1011 1000 1110 0001 can be written more concisely as B8E1. The "B" represents the binary 1011, which is the decimal number 11, and the E represents binary 1110, which is decimal 14.

Because we now have 16 digits, this representation is base 16, and known as hexadecimal (or hex for short). Converting between binary and hexadecimal is very simple, and that's why hexadecimal is a very common way of writing down large binary numbers.

Here's a full table of all the 4-bit numbers and their hexadecimal digit equivalent:

For example, the largest 8-bit binary number is 11111111. This can be written as FF in hexadecimal. Both of those representations mean 255 in our conventional decimal system (you can check that by converting the binary number to decimal).

Which notation you use will depend on the situation; binary numbers represent what is actually stored, but can be confusing to read and write; hexadecimal numbers are a good shorthand of the binary; and decimal numbers are used if you're trying to understand the meaning of the number or doing normal math. All three are widely used in computer science.

It is important to remember though, that computers only represent numbers using binary. They cannot represent numbers directly in decimal or hexadecimal.

A common place that numbers are stored on computers is in spreadsheets or databases. These can be entered either through a spreadsheet program or database program, through a program you or somebody else wrote, or through additional hardware such as sensors, collecting data such as temperatures, air pressure, or ground shaking.

Some of the things that we might think of as numbers, such as the telephone number (03) 555-1234, aren't actually stored as numbers, as they contain important characters (like dashes and spaces) as well as the leading 0 which would be lost if it was stored as a number (the above number would come out as 35551234, which isn't quite right). These are stored as text , which is discussed in the next section.

On the other hand, things that don't look like a number (such as "30 January 2014") are often stored using a value that is converted to a format that is meaningful to the reader (try typing two dates into Excel, and then subtract one from the other – the result is a useful number). In the underlying representation, a number is used. Program code is used to translate the underlying representation into a meaningful date on the user interface.

The difference between two dates in Excel is the number of days between them; the date itself (as in many systems) is stored as the amount of time elapsed since a fixed date (such as 1 January 1900). You can test this by typing a date like "1 January 1850" – chances are that it won't be formatted as a normal date. Likewise, a date sufficiently in the future may behave strangely due to the limited number of bits available to store the date.

Numbers are used to store things as diverse as dates, student marks, prices, statistics, scientific readings, sizes and dimensions of graphics.

The following issues need to be considered when storing numbers on a computer:

  • What range of numbers should be able to be represented?
  • How do we handle negative numbers?
  • How do we handle decimal points or fractions?

In practice, we need to allocate a fixed number of bits to a number, before we know how big the number is. This is often 32 bits or 64 bits, although can be set to 16 bits, or even 128 bits, if needed. This is because a computer has no way of knowing where a number starts and ends, otherwise.

Any system that stores numbers needs to make a compromise between the number of bits allocated to store the number, and the range of values that can be stored.

In some systems (like the Java and C programming languages and databases) it's possible to specify how accurately numbers should be stored; in others it is fixed in advance (such as in spreadsheets).

Some are able to work with arbitrarily large numbers by increasing the space used to store them as necessary (e.g. integers in the Python programming language). However, it is likely that these are still working with a multiple of 32 bits (e.g. 64 bits, 96 bits, 128 bits, 160 bits, etc). Once the number is too big to fit in 32 bits, the computer would reallocate it to have up to 64 bits.

In some programming languages there isn't a check for when a number gets too big (overflows). For example, if you have an 8-bit number using two's complement, then 01111111 is the largest number (127), and if you add one without checking, it will change to 10000000, which happens to be the number -128. (Don't worry about two's complement too much, it's covered later in this section.) This can cause serious problems if not checked for, and is behind a variant of the Y2K problem, called the Year 2038 problem , involving a 32-bit number overflowing for dates on Tuesday, 19 January 2038.

A xkcd comic on number overflow

On tiny computers, such as those embedded inside your car, washing machine, or a tiny sensor that is barely larger than a grain of sand, we might need to specify more precisely how big a number needs to be. While computers prefer to work with chunks of 32 bits, we could write a program (as an example for an earthquake sensor) that knows the first 7 bits are the lattitude, the next 7 bits are the longitude, the next 10 bits are the depth, and the last 8 bits are the amount of force.

Even on standard computers, it is important to think carefully about the number of bits you will need. For example, if you have a field in your database that could be either "0", "1", "2", or "3" (perhaps representing the four bases that can occur in a DNA sequence), and you used a 64 bit number for every one, that will add up as your database grows. If you have 10,000,000 items in your database, you will have wasted 62 bits for each one (only 2 bits is needed to represent the 4 numbers in the example), a total of 620,000,000 bits, which is around 74 MB. If you are doing this a lot in your database, that will really add up – human DNA has about 3 billion base pairs in it, so it's incredibly wasteful to use more than 2 bits for each one.

And for applications such as Google Maps, which are storing an astronomical amount of data, wasting space is not an option at all!

It is really useful to know roughly how many bits you will need to represent a certain value. Have a think about the following scenarios, and choose the best number of bits out of the options given. You want to ensure that the largest possible number will fit within the number of bits, but you also want to ensure that you are not wasting space.

  • Storing the day of the week - a) 1 bit - b) 4 bits - c) 8 bits - d) 32 bits
  • Storing the number of people in the world - a) 16 bits - b) 32 bits - c) 64 bits - d) 128 bits
  • Storing the number of roads in New Zealand - a) 16 bits - b) 32 bits - c) 64 bits - d) 128 bits
  • Storing the number of stars in the universe - a) 16 bits - b) 32 bits - c) 64 bits - d) 128 bits
  • b (actually, 3 bits is enough as it gives 8 values, but amounts that fit evenly into 8-bit bytes are easier to work with)
  • c (32 bits is slightly too small, so you will need 64 bits)
  • b (This is a challenging question, but one a database designer would have to think about. There's about 94,000 km of roads in New Zealand, so if the average length of a road was 1km, there would be too many roads for 16 bits. Either way, 32 bits would be a safe bet.)
  • d (Even 64 bits is not enough, but 128 bits is plenty! Remember that 128 bits isn't twice the range of 64 bits.)

The binary number representation we have looked at so far allows us to represent positive numbers only. In practice, we will want to be able to represent negative numbers as well, such as when the balance of an account goes to a negative amount, or the temperature falls below zero. In our normal representation of base 10 numbers, we represent negative numbers by putting a minus sign in front of the number. But in binary, is it this simple?

We will look at two possible approaches: Adding a simple sign bit, much like we do for decimal, and then a more useful system called two's complement.

Using a simple sign bit

On a computer we don’t have minus signs for numbers (it doesn't work very well to use the text based one when representing a number because you can't do arithmetic on characters), but we can do it by allocating one extra bit, called a sign bit, to represent the minus sign. Just like with decimal numbers, we put the negative indicator on the left of the number — when the sign bit is set to "0", that means the number is positive and when the sign bit is set to "1", the number is negative (just as if there were a minus sign in front of it).

For example, if we wanted to represent the number 41 using 7 bits along with an additional bit that is the sign bit (to give a total of 8 bits), we would represent it by 00101001 . The first bit is a 0, meaning the number is positive, then the remaining 7 bits give 41 , meaning the number is +41 . If we wanted to make -59 , this would be 10111011 . The first bit is a 1, meaning the number is negative, and then the remaining 7 bits represent 59 , meaning the number is -59 .

Using 8 bits as described above (one for the sign, and 7 for the actual number), what would be the binary representations for 1, -1, -8, 34, -37, -88, and 102?

The spaces are not necessary, but are added to make reading the binary numbers easier

  • 1 is 0000 0001
  • -1 is 1000 0001
  • -8 is 1000 1000
  • 34 is 0010 0010
  • -37 is 1010 0101
  • -88 is 1101 1000
  • 102 is 0110 0110

Going the other way is just as easy. If we have the binary number 10010111 , we know it is negative because the first digit is a 1. The number part is the next 7 bits 0010111 , which is 23 . This means the number is -23 .

What would the decimal values be for the following, assuming that the first bit is a sign bit?

  • 00010011 is 19
  • 10000110 is -6
  • 10100011 is -35
  • 01111111 is 127
  • 11111111 is -127

But what about 10000000? That converts to -0 . And 00000000 is +0 . Since -0 and +0 are both just 0, it is very strange to have two different representations for the same number.

This is one of the reasons that we don't use a simple sign bit in practice. Instead, computers usually use a more sophisticated representation for negative binary numbers called two's complement .

Two's complement

There's an alternative representation called two's complement , which avoids having two representations for 0, and more importantly, makes it easier to do arithmetic with negative numbers.

Representing positive numbers with two's complement

Representing positive numbers is the same as the method you have already learnt. Using 8 bits ,the leftmost bit is a zero and the other 7 bits are the usual binary representation of the number; for example, 1 would be 00000001 , and 50 would be 00110010 .

Representing negative numbers with two's complement

This is where things get more interesting. In order to convert a negative number to its two's complement representation, use the following process. 1. Convert the number to binary (don't use a sign bit, and pretend it is a positive number). 2. Invert all the digits (i.e. change 0's to 1's and 1's to 0's). 3. Add 1 to the result (Adding 1 is easy in binary; you could do it by converting to decimal first, but think carefully about what happens when a binary number is incremented by 1 by trying a few; there are more hints in the panel below).

For example, assume we want to convert -118 to its two's complement representation. We would use the process as follows. 1. The binary number for 118 is 01110110 . 2. 01110110 with the digits inverted is 10001001 . 3. 10001001 + 1 is 10001010 .

Therefore, the two's complement representation for -118 is 10001010 .

The rule for adding one to a binary number is pretty simple, so we'll let you figure it out for yourself. First, if a binary number ends with a 0 (e.g. 1101010), how would the number change if you replace the last 0 with a 1? Now, if it ends with 01, how much would it increase if you change the 01 to 10? What about ending with 011? 011111?

The method for adding is so simple that it's easy to build computer hardware to do it very quickly.

What would be the two's complement representation for the following numbers, using 8 bits ? Follow the process given in this section, and remember that you do not need to do anything special for positive numbers.

  • 19 in binary is 0001 0011 , which is the two's complement for a positive number.
  • For -19, we take the binary of the positive, which is 0001 0011 (above), invert it to 1110 1100, and add 1, giving a representation of 1110 1101 .
  • 107 in binary is 0110 1011 , which is the two's complement for a positive number.
  • For -107, we take the binary of the positive, which is 0110 1011 (above), invert it to 1001 0100, and add 1, giving a representation of 1001 0101 .
  • For -92, we take the binary of the positive, which is 0101 1100, invert it to 1010 0011, and add 1, giving a representation of 1010 0100 . (If you have this incorrect, double check that you incremented by 1 correctly).

Converting a two's complement number back to decimal

In order to reverse the process, we need to know whether the number we are looking at is positive or negative. For positive numbers, we can simply convert the binary number back to decimal. But for negative numbers, we first need to convert it back to a normal binary number.

So how do we know if the number is positive or negative? It turns out (for reasons you will understand later in this section) that two's complement numbers that are negative always start in a 1, and positive numbers always start in a 0. Have a look back at the previous examples to double check this.

So, if the number starts with a 1, use the following process to convert the number back to a negative decimal number.

  • Subtract 1 from the number.
  • Invert all the digits.
  • Convert the resulting binary number to decimal.
  • Add a minus sign in front of it.

So if we needed to convert 11100010 back to decimal, we would do the following.

  • Subtract 1 from 11100010 , giving 11100001 .
  • Invert all the digits, giving 00011110 .
  • Convert 00011110 to a binary number, giving 30 .
  • Add a negative sign, giving -30 .

Convert the following two's complement numbers to decimal.

  • 10001100 -> (-1) 10001011 -> (inverted) 01110100 -> (to decimal) 116 -> (negative sign added) -116
  • 10111111 -> (-1) 10111110 -> (inverted) 01000001 -> (to decimal) 65 -> (negative sign added) -65

How many numbers can be represented using two's complement?

While it might initially seem that there is no bit allocated as the sign bit, the left-most bit behaves like one. With 8 bits, you can still only make 256 possible patterns of 0's and 1's. If you attempted to use 8 bits to represent positive numbers up to 255, and negative numbers down to -255, you would quickly realise that some numbers were mapped onto the same pattern of bits. Obviously, this will make it impossible to know what number is actually being represented!

In practice, numbers within the following ranges can be represented. Unsigned Range is how many numbers you can represent if you only allow positive numbers (no sign is needed), and two's complement Range is how many numbers you can represent if you require both positive and negative numbers. You can work these out because the range of 8-bit values if they are stored using unsigned numbers will be from 00000000 to 11111111 (i.e. 0 to 255 in decimal), while the signed two's complement range is from 10000000 (the lowest number, -128 in decimal) to 01111111 (the highest number, 127 in decimal). This might seem a bit weird, but it works out really well because normal binary addition can be used if you use this representation even if you're adding a negative number.

Adding negative binary numbers

Before adding negative binary numbers, we'll look at adding positive numbers. It's basically the same as the addition methods used on decimal numbers, except the rules are way simpler because there are only two different digits that you might add!

You've probably learnt about column addition. For example, the following column addition would be used to do 128 + 255 .

When you go to add 5 + 8, the result is higher than 9, so you put the 3 in the one's column, and carry the 1 to the 10's column. Binary addition works in exactly the same way.

Adding positive binary numbers

If you wanted to add two positive binary numbers, such as 00001111 and 11001110 , you would follow a similar process to the column addition. You only need to know 0+0, 0+1, 1+0, and 1+1, and 1+1+1. The first three are just what you might expect. Adding 1+1 causes a carry digit, since in binary 1+1 = 10, which translates to "0, carry 1" when doing column addition. The last one, 1+1+1 adds up to 11 in binary, which we can express as "1, carry 1". For our two example numbers, the addition works like this:

Remember that the digits can be only 1 or 0. So you will need to carry a 1 to the next column if the total you get for a column is (decimal) 2 or 3.

Adding negative numbers with a simple sign bit

With negative numbers using sign bits like we did before, this does not work. If you wanted to add +11 (01011) and -7 (10111) , you would expect to get an answer of +4 (00100) .

Which is -2 .

One way we could solve the problem is to use column subtraction instead. But this would require giving the computer a hardware circuit which could do this. Luckily this is unnecessary, because addition with negative numbers works automatically using two's complement!

Adding negative numbers with two's complement

For the above addition (+11 + -7), we can start by converting the numbers to their 5-bit two's complement form. Because 01011 (+11) is a positive number, it does not need to be changed. But for the negative number, 00111 (-7) (sign bit from before removed as we don't use it for two's complement), we need to invert the digits and then add 1, giving 11001 .

Adding these two numbers works like this:

Any extra bits to the left (beyond what we are using, in this case 5 bits) have been truncated. This leaves 00100 , which is 4 , like we were expecting.

We can also use this for subtraction. If we are subtracting a positive number from a positive number, we would need to convert the number we are subtracting to a negative number. Then we should add the two numbers. This is the same as for decimal numbers, for example 5 - 2 = 3 is the same as 5 + (-2) = 3.

This property of two's complement is very useful. It means that positive numbers and negative numbers can be handled by the same computer circuit, and addition and subtraction can be treated as the same operation.

The idea of using a "complementary" number to change subtraction to addition can be seen by doing the same in decimal. The complement of a decimal digit is the digit that adds up to 10; for example, the complement of 4 is 6, and the complement of 8 is 2. (The word "complement" comes from the root "complete" – it completes it to a nice round number.)

Subtracting 2 from 6 is the same as adding the complement, and ignoring the extra 1 digit on the left. The complement of 2 is 8, so we add 8 to 6, giving (1)4.

For larger numbers (such as subtracting the two 3-digit numbers 255 - 128), the complement is the number that adds up to the next power of 10 i.e. 1000-128 = 872. Check that adding 872 to 255 produces (almost) the same result as subtracting 128.

Working out complements in binary is way easier because there are only two digits to work with, but working them out in decimal may help you to understand what is going on.

Using sign bits vs using two's complement

We have now looked at two different ways of representing negative numbers on a computer. In practice, a simple sign bit is rarely used, because of having two different representations of zero, and requiring a different computer circuit to handle negative and positive numbers, and to do addition and subtraction.

Two's complement is widely used, because it only has one representation for zero, and it allows positive numbers and negative numbers to be treated in the same way, and addition and subtraction to be treated as one operation.

There are other systems such as "One's Complement" and "Excess-k", but two's complement is by far the most widely used in practice.

types of data representation in computer science

types of data representation in computer science

  • Data representation

Bytes of memory

  • Abstract machine

Unsigned integer representation

Signed integer representation, pointer representation, array representation, compiler layout, array access performance, collection representation.

  • Consequences of size and alignment rules

Uninitialized objects

Pointer arithmetic, undefined behavior.

  • Computer arithmetic

Arena allocation

This course is about learning how computers work, from the perspective of systems software: what makes programs work fast or slow, and how properties of the machines we program impact the programs we write. We want to communicate ideas, tools, and an experimental approach.

The course divides into six units:

  • Assembly & machine programming
  • Storage & caching
  • Kernel programming
  • Process management
  • Concurrency

The first unit, data representation , is all about how different forms of data can be represented in terms the computer can understand.

Computer memory is kind of like a Lite Brite.

Lite Brite

A Lite Brite is big black backlit pegboard coupled with a supply of colored pegs, in a limited set of colors. You can plug in the pegs to make all kinds of designs. A computer’s memory is like a vast pegboard where each slot holds one of 256 different colors. The colors are numbered 0 through 255, so each slot holds one byte . (A byte is a number between 0 and 255, inclusive.)

A slot of computer memory is identified by its address . On a computer with M bytes of memory, and therefore M slots, you can think of the address as a number between 0 and M −1. My laptop has 16 gibibytes of memory, so M = 16×2 30 = 2 34 = 17,179,869,184 = 0x4'0000'0000 —a very large number!

The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes.

Powers of ten and powers of two. Digital computers love the number two and all powers of two. The electronics of digital computers are based on the bit , the smallest unit of storage, which a base-two digit: either 0 or 1. More complicated objects are represented by collections of bits. This choice has many scale and error-correction advantages. It also refracts upwards to larger choices, and even into terminology. Memory chips, for example, have capacities based on large powers of two, such as 2 30 bytes. Since 2 10 = 1024 is pretty close to 1,000, 2 20 = 1,048,576 is pretty close to a million, and 2 30 = 1,073,741,824 is pretty close to a billion, it’s common to refer to 2 30 bytes of memory as “a giga byte,” even though that actually means 10 9 = 1,000,000,000 bytes. But for greater precision, there are terms that explicitly signal the use of powers of two. 2 30 is a gibibyte : the “-bi-” component means “binary.”
Virtual memory. Modern computers actually abstract their memory spaces using a technique called virtual memory . The lowest-level kind of address, called a physical address , really does take on values between 0 and M −1. However, even on a 16GiB machine like my laptop, the addresses we see in programs can take on values like 0x7ffe'ea2c'aa67 that are much larger than M −1 = 0x3'ffff'ffff . The addresses used in programs are called virtual addresses . They’re incredibly useful for protection: since different running programs have logically independent address spaces, it’s much less likely that a bug in one program will crash the whole machine. We’ll learn about virtual memory in much more depth in the kernel unit ; the distinction between virtual and physical addresses is not as critical for data representation.

Most programming languages prevent their users from directly accessing memory. But not C and C++! These languages let you access any byte of memory with a valid address. This is powerful; it is also very dangerous. But it lets us get a hands-on view of how computers really work.

C++ programs accomplish their work by constructing, examining, and modifying objects . An object is a region of data storage that contains a value, such as the integer 12. (The standard specifically says “a region of data storage in the execution environment, the contents of which can represent values”.) Memory is called “memory” because it remembers object values.

In this unit, we often use functions called hexdump to examine memory. These functions are defined in hexdump.cc . hexdump_object(x) prints out the bytes of memory that comprise an object named x , while hexdump(ptr, size) prints out the size bytes of memory starting at a pointer ptr .

For example, in datarep1/add.cc , we might use hexdump_object to examine the memory used to represent some integers:

This display reports that a , b , and c are each four bytes long; that a , b , and c are located at different, nonoverlapping addresses (the long hex number in the first column); and shows us how the numbers 1, 2, and 3 are represented in terms of bytes. (More on that later.)

The compiler, hardware, and standard together define how objects of different types map to bytes. Each object uses a contiguous range of addresses (and thus bytes), and objects never overlap (objects that are active simultaneously are always stored in distinct address ranges).

Since C and C++ are designed to help software interface with hardware devices, their standards are transparent about how objects are stored. A C++ program can ask how big an object is using the sizeof keyword. sizeof(T) returns the number of bytes in the representation of an object of type T , and sizeof(x) returns the size of object x . The result of sizeof is a value of type size_t , which is an unsigned integer type large enough to hold any representable size. On 64-bit architectures, such as x86-64 (our focus in this course), size_t can hold numbers between 0 and 2 64 –1.

Qualitatively different objects may have the same data representation. For example, the following three objects have the same data representation on x86-64, which you can verify using hexdump :

In C and C++, you can’t reliably tell the type of an object by looking at the contents of its memory. That’s why tricks like our different addf*.cc functions work.

An object can have many names. For example, here, local and *ptr refer to the same object:

The different names for an object are sometimes called aliases .

There are five objects here:

  • ch1 , a global variable
  • ch2 , a constant (non-modifiable) global variable
  • ch3 , a local variable
  • ch4 , a local variable
  • the anonymous storage allocated by new char and accessed by *ch4

Each object has a lifetime , which is called storage duration by the standard. There are three different kinds of lifetime.

  • static lifetime: The object lasts as long as the program runs. ( ch1 , ch2 )
  • automatic lifetime: The compiler allocates and destroys the object automatically as the program runs, based on the object’s scope (the region of the program in which it is meaningful). ( ch3 , ch4 )
  • dynamic lifetime: The programmer allocates and destroys the object explicitly. ( *allocated_ch )

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth. Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare” . But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives the function that created it.

The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments . Objects with different lifetimes are placed into different segments. The most important segments are:

  • Code (also known as text or read-only data ). Contains instructions and constant global objects. Unmodifiable; static lifetime.
  • Data . Contains non-constant global objects. Modifiable; static lifetime.
  • Heap . Modifiable; dynamic lifetime.
  • Stack . Modifiable; automatic lifetime.

The compiler decides on a segment for each object based on its lifetime. The final compiler phase, which is called the linker , then groups all the program’s objects by segment (so, for instance, global variables from different compiler runs are grouped together into a single segment). Finally, when a program runs, the operating system loads the segments into memory. (The stack and heap segments grow on demand.)

We can use a program to investigate where objects with different lifetimes are stored. (See cs61-lectures/datarep2/mexplore0.cc .) This shows address ranges like this:

Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).

An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then loaded by the OS before the program starts running. There is an exception, however: the “bss” segment is used to hold modifiable static-lifetime data with initial value zero. Such data is common, since all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Clearing memory is faster than loading data from disk, so this optimization saves both time (the program loads faster) and space (the executable is smaller).

Abstract machine and hardware

Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs.

First you have an idea , like “I want to make a flappy bird iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program , written in some programming language . This process is called programming.

A C++ program actually runs on an abstract machine . The behavior of this machine is defined by the C++ standard , a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!

C++ programs run on hardware (mostly), and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.

This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C++ manages memory in units of bytes —8 contiguous bits that together can represent numbers between 0 and 255. C’s unit for a byte is char : the abstract machine says a byte is stored in char . That means an unsigned char holds values in the inclusive range [0, 255].

The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago , bytes could hold nine bits! (!?)

But larger numbers, such as 258, don’t fit in a single byte. To represent such numbers, we must use multiple bytes. The abstract machine doesn’t specify exactly how this is done—it’s the compiler and hardware’s job to implement a choice. But modern computers always use place–value notation , just like in decimal numbers. In decimal, the number 258 is written with three digits, the meanings of which are determined both by the digit and by their place in the overall number:

\[ 258 = 2\times10^2 + 5\times10^1 + 8\times10^0 \]

The computer uses base 256 instead of base 10. Two adjacent bytes can represent numbers between 0 and \(255\times256+255 = 65535 = 2^{16}-1\) , inclusive. A number larger than this would take three or more bytes.

\[ 258 = 1\times256^1 + 2\times256^0 \]

On x86-64, the ones place, the least significant byte, is on the left, at the lowest address in the contiguous two-byte range used to represent the integer. This is the opposite of how decimal numbers are written: decimal numbers put the most significant digit on the left. The representation choice of putting the least-significant byte in the lowest address is called little-endian representation. x86-64 uses little-endian representation.

Some computers actually store multi-byte integers the other way, with the most significant byte stored in the lowest address; that’s called big-endian representation. The Internet’s fundamental protocols, such as IP and TCP, also use big-endian order for multi-byte integers, so big-endian is also called “network” byte order.

The C++ standard defines five fundamental unsigned integer types, along with relationships among their sizes. Here they are, along with their actual sizes and ranges on x86-64:

Other architectures and operating systems implement different ranges for these types. For instance, on IA32 machines like Intel’s Pentium (the 32-bit processors that predated x86-64), sizeof(long) was 4, not 8.

Note that all values of a smaller unsigned integer type can fit in any larger unsigned integer type. When a value of a larger unsigned integer type is placed in a smaller unsigned integer object, however, not every value fits; for instance, the unsigned short value 258 doesn’t fit in an unsigned char x . When this occurs, the C++ abstract machine requires that the smaller object’s value equals the least -significant bits of the larger value (so x will equal 2).

In addition to these types, whose sizes can vary, C++ has integer types whose sizes are fixed. uint8_t , uint16_t , uint32_t , and uint64_t define 8-bit, 16-bit, 32-bit, and 64-bit unsigned integers, respectively; on x86-64, these correspond to unsigned char , unsigned short , unsigned int , and unsigned long .

This general procedure is used to represent a multi-byte integer in memory.

  • Write the large integer in hexadecimal format, including all leading zeros required by the type size. For example, the unsigned value 65534 would be written 0x0000FFFE . There will be twice as many hexadecimal digits as sizeof(TYPE) .
  • Divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.

In little-endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:

In big-endian representation, the bytes are stored in the reverse order.

Computers are often fastest at dealing with fixed-length numbers, rather than variable-length numbers, and processor internals are organized around a fixed word size . A word is the natural unit of data used by a processor design . In most modern processors, this natural unit is 8 bytes or 64 bits , because this is the power-of-two number of bytes big enough to hold those processors’ memory addresses. Many older processors could access less memory and had correspondingly smaller word sizes, such as 4 bytes (32 bits).

The best representation for signed integers—and the choice made by x86-64, and by the C++20 abstract machine—is two’s complement . Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.

To see what this means, let’s think about what -x should mean when x is an unsigned integer. Wait, negative unsigned?! This isn’t an oxymoron because C++ uses modular arithmetic for unsigned integers: the result of an arithmetic operation on unsigned values is always taken modulo 2 B , where B is the number of bits in the unsigned value type. Thus, on x86-64,

-x is simply the number that, when added to x , yields 0 (mod 2 B ). For example, when unsigned x = 0xFFFFFFFFU , then -x == 1U , since x + -x equals zero (mod 2 32 ).

To obtain -x , we flip all the bits in x (an operation written ~x ) and then add 1. To see why, consider the bit representations. What is x + (~x + 1) ? Well, (~x) i (the i th bit of ~x ) is 1 whenever x i is 0, and vice versa. That means that every bit of x + ~x is 1 (there are no carries), and x + ~x is the largest unsigned integer, with value 2 B -1. If we add 1 to this, we get 2 B . Which is 0 (mod 2 B )! The highest “carry” bit is dropped, leaving zero.

Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:

  • If the most-significant bit is 1, the represented number is negative. Specifically, the represented number is – (~x + 1) , where the outer negative sign is mathematical negation (not computer arithmetic).
  • If every bit is 0, the represented number is 0.
  • If the most-significant but is 0 but some other bit is 1, the represented number is positive.

The most significant bit is also called the sign bit , because if it is 1, then the represented value depends on the signedness of the type (and that value is negative for signed types).

Another way to think about two’s-complement is that, for B -bit integers, the most-significant bit has place value 2 B –1 in unsigned arithmetic and negative 2 B –1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.

The two’s-complement bit pattern for x + y is the same whether x and y are considered as signed or unsigned values. For example, in 4-bit arithmetic, 5 has representation 0b0101 , while the representation 0b1100 represents 12 if unsigned and –4 if signed ( ~0b1100 + 1 = 0b0011 + 1 == 4). Let’s add those bit patterns and see what we get:

Note that this is the right answer for both signed and unsigned arithmetic : 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.

Subtraction and multiplication also produce the same results for unsigned arithmetic and signed two’s-complement arithmetic. (For instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).) This is not true of division. (Consider dividing the 4-bit representation 0b1110 by 2. In signed arithmetic, 0b1110 represents -2, so 0b1110/2 == 0b1111 (-1); but in unsigned arithmetic, 0b1110 is 14, so 0b1110/2 == 0b0111 (7).) And, of course, it is not true of comparison. In signed 4-bit arithmetic, 0b1110 < 0 , but in unsigned 4-bit arithmetic, 0b1110 > 0 . This means that a C compiler for a two’s-complement machine can use a single add instruction for either signed or unsigned numbers, but it must generate different instruction patterns for signed and unsigned division (or less-than, or greater-than).

There are a couple quirks with C signed arithmetic. First, in two’s complement, there are more negative numbers than positive numbers. A representation with sign bit is 1, but every other bit 0, has no positive counterpart at the same bit width: for this number, -x == x . (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 == 0b1000 .) Second, and far worse, is that arithmetic overflow on signed integers is undefined behavior .

The C++ abstract machine requires that signed integers have the same sizes as their unsigned counterparts.

We distinguish pointers , which are concepts in the C abstract machine, from addresses , which are hardware concepts. A pointer combines an address and a type.

The memory representation of a pointer is the same as the representation of its address value. The size of that integer is the machine’s word size; for example, on x86-64, a pointer occupies 8 bytes, and a pointer to an object located at address 0x400abc would be stored as:

The C++ abstract machine defines an unsigned integer type uintptr_t that can hold any address. (You have to #include <inttypes.h> or <cinttypes> to get the definition.) On most machines, including x86-64, uintptr_t is the same as unsigned long . Cast a pointer to an integer address value with syntax like (uintptr_t) ptr ; cast back to a pointer with syntax like (T*) addr . Casts between pointer types and uintptr_t are information preserving, so this assertion will never fail:

Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.

To represent an array of integers, C++ and C allocate the integers next to each other in memory, in sequential addresses, with no gaps or overlaps. Here, we put the integers 0, 1, and 258 next to each other, starting at address 1008:

Say that you have an array of N integers, and you access each of those integers in order, accessing each integer exactly once. Does the order matter?

Computer memory is random-access memory (RAM), which means that a program can access any bytes of memory in any order—it’s not, for example, required to read memory in ascending order by address. But if we run experiments, we can see that even in RAM, different access orders have very different performance characteristics.

Our arraysum program sums up all the integers in an array of N integers, using an access order based on its arguments, and prints the resulting delay. Here’s the result of a couple experiments on accessing 10,000,000 items in three orders, “up” order (sequential: elements 0, 1, 2, 3, …), “down” order (reverse sequential: N , N −1, N −2, …), and “random” order (as it sounds).

Wow! Down order is just a bit slower than up, but random order seems about 40 times slower. Why?

Random order is defeating many of the internal architectural optimizations that make memory access fast on modern machines. Sequential order, since it’s more predictable, is much easier to optimize.

Foreshadowing. This part of the lecture is a teaser for the Storage unit, where we cover access patterns and caching, including the processor caches that explain this phenomenon, in much more depth.

The C++ programming language offers several collection mechanisms for grouping subobjects together into new kinds of object. The collections are arrays, structs, and unions. (Classes are a kind of struct. All library types, such as vectors, lists, and hash tables, use combinations of these collection types.) The abstract machine defines how subobjects are laid out inside a collection. This is important, because it lets C/C++ programs exchange messages with hardware and even with programs written in other languages: messages can be exchanged only when both parties agree on layout.

Array layout in C++ is particularly simple: The objects in an array are laid out sequentially in memory, with no gaps or overlaps. Assume a declaration like T x[N] , where x is an array of N objects of type T , and say that the address of x is a . Then the address of element x[i] equals a + i * sizeof(T) , and sizeof(a) == N * sizeof(T) .

Sidebar: Vector representation

The C++ library type std::vector defines an array that can grow and shrink. For instance, this function creates a vector containing the numbers 0 up to N in sequence:

Here, v is an object with automatic lifetime. This means its size (in the sizeof sense) is fixed at compile time. Remember that the sizes of static- and automatic-lifetime objects must be known at compile time; only dynamic-lifetime objects can have varying size based on runtime parameters. So where and how are v ’s contents stored?

The C++ abstract machine requires that v ’s elements are stored in an array in memory. (The v.data() method returns a pointer to the first element of the array.) But it does not define std::vector ’s layout otherwise, and C++ library designers can choose different layouts based on their needs. We found these to hold for the std::vector in our library:

sizeof(v) == 24 for any vector of any type, and the address of v is a stack address (i.e., v is located in the stack segment).

The first 8 bytes of the vector hold the address of the first element of the contents array—call it the begin address . This address is a heap address, which is as expected, since the contents must have dynamic lifetime. The value of the begin address is the same as that of v.data() .

Bytes 8–15 hold the address just past the contents array—call it the end address . Its value is the same as &v.data()[v.size()] . If the vector is empty, then the begin address and the end address are the same.

Bytes 16–23 hold an address greater than or equal to the end address. This is the capacity address . As a vector grows, it will sometimes outgrow its current location and move its contents to new memory addresses. To reduce the number of copies, vectors usually to request more memory from the operating system than they immediately need; this additional space, which is called “capacity,” supports cheap growth. Often the capacity doubles on each growth spurt, since this allows operations like v.push_back() to execute in O (1) time on average.

Compilers must also decide where different objects are stored when those objects are not part of a collection. For instance, consider this program:

The abstract machine says these objects cannot overlap, but does not otherwise constrain their positions in memory.

On Linux, GCC will put all these variables into the stack segment, which we can see using hexdump . But it can put them in the stack segment in any order , as we can see by reordering the declarations (try declaration order i1 , c1 , i2 , c2 , c3 ), by changing optimization levels, or by adding different scopes (braces). The abstract machine gives the programmer no guarantees about how object addresses relate. In fact, the compiler may move objects around during execution, as long as it ensures that the program behaves according to the abstract machine. Modern optimizing compilers often do this, particularly for automatic objects.

But what order does the compiler choose? With optimization disabled, the compiler appears to lay out objects in decreasing order by declaration, so the first declared variable in the function has the highest address. With optimization enabled, the compiler follows roughly the same guideline, but it also rearranges objects by type—for instance, it tends to group char s together—and it can reuse space if different variables in the same function have disjoint lifetimes. The optimizing compiler tends to use less space for the same set of variables. This is because it’s arranging objects by alignment.

The C++ compiler and library restricts the addresses at which some kinds of data appear. In particular, the address of every int value is always a multiple of 4, whether it’s located on the stack (automatic lifetime), the data segment (static lifetime), or the heap (dynamic lifetime).

A bunch of observations will show you these rules:

These are the alignment restrictions for an x86-64 Linux machine.

These restrictions hold for most x86-64 operating systems, except that on Windows, the long type has size and alignment 4. (The long long type has size and alignment 8 on all x86-64 operating systems.)

Just like every type has a size, every type has an alignment. The alignment of a type T is a number a ≥1 such that the address of every object of type T must be a multiple of a . Every object with type T has size sizeof(T) —it occupies sizeof(T) contiguous bytes of memory; and has alignment alignof(T) —the address of its first byte is a multiple of alignof(T) . You can also say sizeof(x) and alignof(x) where x is the name of an object or another expression.

Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.

The compiler, library, and operating system all work together to enforce alignment restrictions.

On x86-64 Linux, alignof(T) == sizeof(T) for all fundamental types (the types built in to C: integer types, floating point types, and pointers). But this isn’t always true; on x86-32 Linux, double has size 8 but alignment 4.

It’s possible to construct user-defined types of arbitrary size, but the largest alignment required by a machine is fixed for that machine. C++ lets you find the maximum alignment for a machine with alignof(std::max_align_t) ; on x86-64, this is 16, the alignment of the type long double (and the alignment of some less-commonly-used SIMD “vector” types ).

We now turn to the abstract machine rules for laying out all collections. The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.

1. First-member rule. The address of the first member of a collection equals the address of the collection.

Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.

The next three rules depend on the class of collection. Every C abstract machine enforces these rules.

2. Array rule. Arrays are laid out sequentially as described above.

3. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.

4. Union rule. All members of a union share the address of the union.

In C, every struct follows the struct rule, but in C++, only simple structs follow the rule. Complicated structs, such as structs with some public and some private members, or structs with virtual functions, can be laid out however the compiler chooses. The typical situation is that C++ compilers for a machine architecture (e.g., “Linux x86-64”) will all agree on a layout procedure for complicated structs. This allows code compiled by different compilers to interoperate.

That next rule defines the operation of the malloc library function.

5. Malloc rule. Any non-null pointer returned by malloc has alignment appropriate for any type. In other words, assuming the allocated size is adequate, the pointer returned from malloc can safely be cast to T* for any T .

Oddly, this holds even for small allocations. The C++ standard (the abstract machine) requires that malloc(1) return a pointer whose alignment is appropriate for any type, including types that don’t fit.

And the final rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work.

6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.

The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI” —its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.

Consequences of the size and alignment rules

From these rules we can derive some interesting consequences.

First, the size of every type is a multiple of its alignment .

To see why, consider an array with two elements. By the array rule, these elements have addresses a and a+sizeof(T) , where a is the address of the array. Both of these addresses contain a T , so they are both a multiple of alignof(T) . That means sizeof(T) is also a multiple of alignof(T) .

We can also characterize the sizes and alignments of different collections .

  • The size of an array of N elements of type T is N * sizeof(T) : the sum of the sizes of its elements. The alignment of the array is alignof(T) .
  • The size of a union is the maximum of the sizes of its components (because the union can only hold one component at a time). Its alignment is also the maximum of the alignments of its components.
  • The size of a struct is at least as big as the sum of the sizes of its components. Its alignment is the maximum of the alignments of its components.

Thus, the alignment of every collection equals the maximum of the alignments of its components.

It’s also true that the alignment equals the least common multiple of the alignments of its components. You might have thought lcm was a better answer, but the max is the same as the lcm for every architecture that matters, because all fundamental alignments are powers of two.

The size of a struct might be larger than the sum of the sizes of its components, because of alignment constraints. Since the compiler must lay out struct components in order, and it must obey the components’ alignment constraints, and it must ensure different components occupy disjoint addresses, it must sometimes introduce extra space in structs. Here’s an example: the struct will have 3 bytes of padding after char c , to ensure that int i2 has the correct alignment.

Thanks to padding, reordering struct components can sometimes reduce the total size of a struct. Padding can happen at the end of a struct as well as the middle. Padding can never happen at the start of a struct, however (because of Rule 1).

The rules also imply that the offset of any struct member —which is the difference between the address of the member and the address of the containing struct— is a multiple of the member’s alignment .

To see why, consider a struct s with member m at offset o . The malloc rule says that any pointer returned from malloc is correctly aligned for s . Every pointer returned from malloc is maximally aligned, equalling 16*x for some integer x . The struct rule says that the address of m , which is 16*x + o , is correctly aligned. That means that 16*x + o = alignof(m)*y for some integer y . Divide both sides by a = alignof(m) and you see that 16*x/a + o/a = y . But 16/a is an integer—the maximum alignment is a multiple of every alignment—so 16*x/a is an integer. We can conclude that o/a must also be an integer!

Finally, we can also derive the necessity for padding at the end of structs. (How?)

What happens when an object is uninitialized? The answer depends on its lifetime.

  • static lifetime (e.g., int global; at file scope): The object is initialized to 0.
  • automatic or dynamic lifetime (e.g., int local; in a function, or int* ptr = new int ): The object is uninitialized and reading the object’s value before it is assigned causes undefined behavior.

Compiler hijinks

In C++, most dynamic memory allocation uses special language operators, new and delete , rather than library functions.

Though this seems more complex than the library-function style, it has advantages. A C compiler cannot tell what malloc and free do (especially when they are redefined to debugging versions, as in the problem set), so a C compiler cannot necessarily optimize calls to malloc and free away. But the C++ compiler may assume that all uses of new and delete follow the rules laid down by the abstract machine. That means that if the compiler can prove that an allocation is unnecessary or unused, it is free to remove that allocation!

For example, we compiled this program in the problem set environment (based on test003.cc ):

The optimizing C++ compiler removes all calls to new and delete , leaving only the call to m61_printstatistics() ! (For instance, try objdump -d testXXX to look at the compiled x86-64 instructions.) This is valid because the compiler is explicitly allowed to eliminate unused allocations, and here, since the ptrs variable is local and doesn’t escape main , all allocations are unused. The C compiler cannot perform this useful transformation. (But the C compiler can do other cool things, such as unroll the loops .)

One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.

We’ve already seen one of these effects. The hexdump function has this signature (arguments and return type):

But we can just pass an array as argument to hexdump :

When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:

C programmers transition between arrays and pointers very naturally.

A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions by reference rather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance: void f ( int a[ 2 ]) { a[ 0 ] = 1 ; } int main () { int x[ 2 ] = { 100 , 101 }; f(x); printf( "%d \n " , x[ 0 ]); // prints 1! } If you don’t like this behavior, you can get around it by using a struct or a C++ std::array . #include <array> struct array1 { int a[ 2 ]; }; void f1 (array1 arg) { arg.a[ 0 ] = 1 ; } void f2 (std :: array < int , 2 > a) { a[ 0 ] = 1 ; } int main () { array1 x = {{ 100 , 101 }}; f1(x); printf( "%d \n " , x.a[ 0 ]); // prints 100 std :: array < int , 2 > x2 = { 100 , 101 }; f2(x2); printf( "%d \n " , x2[ 0 ]); // prints 100 }

C++ extends the logic of this array–pointer correspondence to support arithmetic on pointers as well.

Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.

Specifically, consider an array T a[n] and pointers T* p1 = &a[i] and T* p2 = &a[j] . Then:

Equality : p1 == p2 if and only if (iff) p1 and p2 point to the same address, which happens iff i == j .

Inequality : Similarly, p1 != p2 iff i != j .

Less-than : p1 < p2 iff i < j .

Also, p1 <= p2 iff i <= j ; and p1 > p2 iff i > j ; and p1 >= p2 iff i >= j .

Pointer difference : What should p1 - p2 mean? Using array indexes as the basis, p1 - p2 == i - j . (But the type of the difference is always ptrdiff_t , which on x86-64 is long , the signed version of size_t .)

Addition : p1 + k (where k is an integer type) equals the pointer &a[i + k] . ( k + p1 returns the same thing.)

Subtraction : p1 - k equals &a[i - k] .

Increment and decrement : ++p1 means p1 = p1 + 1 , which means p1 = &a[i + 1] . Similarly, --p1 means p1 = &a[i - 1] . (There are also postfix versions, p1++ and p1-- , but C++ style prefers the prefix versions.)

No other arithmetic operations on pointers are allowed. You can’t multiply pointers, for example. (You can multiply addresses by casting the pointers to the address type, uintptr_t —so (uintptr_t) p1 * (uintptr_t) p2 —but why would you?)

From pointers to iterators

Let’s write a function that can sum all the integers in an array.

This function can compute the sum of the elements of any int array. But because of the pointer–array relationship, its a argument is really a pointer . That allows us to call it with subarrays as well as with whole arrays. For instance:

This way of thinking about arrays naturally leads to a style that avoids sizes entirely, using instead a sentinel or boundary argument that defines the end of the interesting part of the array.

These expressions compute the same sums as the above:

Note that the data from first to last forms a half-open range . iIn mathematical notation, we care about elements in the range [first, last) : the element pointed to by first is included (if it exists), but the element pointed to by last is not. Half-open ranges give us a simple and clear way to describe empty ranges, such as zero-element arrays: if first == last , then the range is empty.

Note that given a ten-element array a , the pointer a + 10 can be formed and compared, but must not be dereferenced—the element a[10] does not exist. The C/C++ abstract machines allow users to form pointers to the “one-past-the-end” boundary elements of arrays, but users must not dereference such pointers.

So in C, two pointers naturally express a range of an array. The C++ standard template library, or STL, brilliantly abstracts this pointer notion to allow two iterators , which are pointer-like objects, to express a range of any standard data structure—an array, a vector, a hash table, a balanced tree, whatever. This version of sum works for any container of int s; notice how little it changed:

Some example uses:

Addresses vs. pointers

What’s the difference between these expressions? (Again, a is an array of type T , and p1 == &a[i] and p2 == &a[j] .)

The first expression is defined analogously to index arithmetic, so d1 == i - j . But the second expression performs the arithmetic on the addresses corresponding to those pointers. We will expect d2 to equal sizeof(T) * d1 . Always be aware of which kind of arithmetic you’re using. Generally arithmetic on pointers should not involve sizeof , since the sizeof is included automatically according to the abstract machine; but arithmetic on addresses almost always should involve sizeof .

Although C++ is a low-level language, the abstract machine is surprisingly strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded undefined behavior .

Given an array a[N] of N elements of type T :

Forming a pointer &a[i] (or a + i ) with 0 ≤ i ≤ N is safe.

Forming a pointer &a[i] with i < 0 or i > N causes undefined behavior.

Dereferencing a pointer &a[i] with 0 ≤ i < N is safe.

Dereferencing a pointer &a[i] with i < 0 or i ≥ N causes undefined behavior.

(For the purposes of these rules, objects that are not arrays count as single-element arrays. So given T x , we can safely form &x and &x + 1 and dereference &x .)

What “undefined behavior” means is horrible. A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes : undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” Other possible behaviors include allowing hackers from the moon to steal all of a program’s data, take it over, and force it to delete the hard drive on which it is running. Once undefined behavior executes, a program may do anything, including making demons fly out of the programmer’s nose.

Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:

that causes undefined behavior.

If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to uintptr_t first.

Undefined behavior and optimization

A program that causes undefined behavior is not a C++ program . The abstract machine says that a C++ program, by definition, is a program whose behavior is always defined. The C++ compiler is allowed to assume that its input is a C++ program. (Obviously!) So the compiler can assume that its input program will never cause undefined behavior. Thus, since undefined behavior is “impossible,” if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.

Consider this program:

If we supply a value equal to (char*) -1 , we’re likely to see output like this:

with no assertion failure! But that’s an apparently impossible result. The printout can only happen if x + 1 > x (otherwise, the assertion will fail and stop the printout). But x + 1 , which equals 0 , is less than x , which is the largest 8-byte value!

The impossible happens because of undefined behavior reasoning. When the compiler sees an expression like x + 1 > x (with x a pointer), it can reason this way:

“Ah, x + 1 . This must be a pointer into the same array as x (or it might be a boundary pointer just past that array, or just past the non-array object x ). This must be so because forming any other pointer would cause undefined behavior.

“The pointer comparison is the same as an index comparison. x + 1 > x means the same thing as &x[1] > &x[0] . But that holds iff 1 > 0 .

“In my infinite wisdom, I know that 1 > 0 . Thus x + 1 > x always holds, and the assertion will never fail.

“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”

Integer undefined behavior

Arithmetic on signed integers also has important undefined behaviors. Signed integer arithmetic must never overflow. That is, the compiler may assume that the mathematical result of any signed arithmetic operation, such as x + y (with x and y both int ), can be represented inside the relevant type. It causes undefined behavior, therefore, to add 1 to the maximum positive integer. (The ubexplore.cc program demonstrates how this can produce impossible results, as with pointers.)

Arithmetic on unsigned integers is much safer with respect to undefined behavior. Unsigned integers are defined to perform arithmetic modulo their size. This means that if you add 1 to the maximum positive unsigned integer, the result will always be zero.

Dividing an integer by zero causes undefined behavior whether or not the integer is signed.

Sanitizers, which in our makefiles are turned on by supplying SAN=1 , can catch many undefined behaviors as soon as they happen. Sanitizers are built in to the compiler itself; a sanitizer involves cooperation between the compiler and the language runtime. This has the major performance advantage that the compiler introduces exactly the required checks, and the optimizer can then use its normal analyses to remove redundant checks.

That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link] .

Signed integer undefined behavior

File cs61-lectures/datarep5/ubexplore2.cc contains the following program.

What will be printed if we run the program with ./ubexplore2 0x7ffffffe 0x7fffffff ?

0x7fffffff is the largest positive value can be represented by type int . Adding one to this value yields 0x80000000 . In two's complement representation this is the smallest negative number represented by type int .

Assuming that the program behaves this way, then the loop exit condition i > n2 can never be met, and the program should run (and print out numbers) forever.

However, if we run the optimized version of the program, it prints only two numbers and exits:

The unoptimized program does print forever and never exits.

What’s going on here? We need to look at the compiled assembly of the program with and without optimization (via objdump -S ).

The unoptimized version basically looks like this:

This is a pretty direct translation of the loop.

The optimized version, though, does it differently. As always, the optimizer has its own ideas. (Your compiler may produce different results!)

The compiler changed the source’s less than or equal to comparison, i <= n2 , into a not equal to comparison in the executable, i != n2 + 1 (in both cases using signed computer arithmetic, i.e., modulo 2 32 )! The comparison i <= n2 will always return true when n2 == 0x7FFFFFFF , the maximum signed integer, so the loop goes on forever. But the i != n2 + 1 comparison does not always return true when n2 == 0x7FFFFFFF : when i wraps around to 0x80000000 (the smallest negative integer), then i equals n2 + 1 (which also wrapped), and the loop stops.

Why did the compiler make this transformation? In the original loop, the step-6 jump is immediately followed by another comparison and jump in steps 1 and 2. The processor jumps all over the place, which can confuse its prediction circuitry and slow down performance. In the transformed loop, the step-7 jump is never followed by a comparison and jump; instead, step 7 goes back to step 4, which always prints the current number. This more streamlined control flow is easier for the processor to make fast.

But the streamlined control flow is only a valid substitution under the assumption that the addition n2 + 1 never overflows . Luckily (sort of), signed arithmetic overflow causes undefined behavior, so the compiler is totally justified in making that assumption!

Programs based on ubexplore2 have demonstrated undefined behavior differences for years, even as the precise reasons why have changed. In some earlier compilers, we found that the optimizer just upgraded the int s to long s—arithmetic on long s is just as fast on x86-64 as arithmetic on int s, since x86-64 is a 64-bit architecture, and sometimes using long s for everything lets the compiler avoid conversions back and forth. The ubexplore2l program demonstrates this form of transformation: since the loop variable is added to a long counter, the compiler opportunistically upgrades i to long as well. This transformation is also only valid under the assumption that i + 1 will not overflow—which it can’t, because of undefined behavior.

Using unsigned type prevents all this undefined behavior, because arithmetic overflow on unsigned integers is well defined in C/C++. The ubexplore2u.cc file uses an unsigned loop index and comparison, and ./ubexplore2u and ./ubexplore2u.noopt behave exactly the same (though you have to give arguments like ./ubexplore2u 0xfffffffe 0xffffffff to see the overflow).

Computer arithmetic and bitwise operations

Basic bitwise operators.

Computers offer not only the usual arithmetic operators like + and - , but also a set of bitwise operators. The basic ones are & (and), | (or), ^ (xor/exclusive or), and the unary operator ~ (complement). In truth table form:

In C or C++, these operators work on integers. But they work bitwise: the result of an operation is determined by applying the operation independently at each bit position. Here’s how to compute 12 & 4 in 4-bit unsigned arithmetic:

These basic bitwise operators simplify certain important arithmetics. For example, (x & (x - 1)) == 0 tests whether x is zero or a power of 2.

Negation of signed integers can also be expressed using a bitwise operator: -x == ~x + 1 . This is in fact how we define two's complement representation. We can verify that x and (-x) does add up to zero under this representation:

Bitwise "and" ( & ) can help with modular arithmetic. For example, x % 32 == (x & 31) . We essentially "mask off", or clear, higher order bits to do modulo-powers-of-2 arithmetics. This works in any base. For example, in decimal, the fastest way to compute x % 100 is to take just the two least significant digits of x .

Bitwise shift of unsigned integer

x << i appends i zero bits starting at the least significant bit of x . High order bits that don't fit in the integer are thrown out. For example, assuming 4-bit unsigned integers

Similarly, x >> i appends i zero bits at the most significant end of x . Lower bits are thrown out.

Bitwise shift helps with division and multiplication. For example:

A modern compiler can optimize y = x * 66 into y = (x << 6) + (x << 1) .

Bitwise operations also allows us to treat bits within an integer separately. This can be useful for "options".

For example, when we call a function to open a file, we have a lot of options:

  • Open for reading?
  • Open for writing?
  • Read from the end?
  • Optimize for writing?

We have a lot of true/false options.

One bad way to implement this is to have this function take a bunch of arguments -- one argument for each option. This makes the function call look like this:

The long list of arguments slows down the function call, and one can also easily lose track of the meaning of the individual true/false values passed in.

A cheaper way to achieve this is to use a single integer to represent all the options. Have each option defined as a power of 2, and simply | (or) them together and pass them as a single integer.

Flags are usually defined as powers of 2 so we set one bit at a time for each flag. It is less common but still possible to define a combination flag that is not a power of 2, so that it sets multiple bits in one go.

File cs61-lectures/datarep5/mb-driver.cc contains a memory allocation benchmark. The core of the benchmark looks like this:

The benchmark tests the performance of memnode_arena::allocate() and memnode_arena::deallocate() functions. In the handout code, these functions do the same thing as new memnode and delete memnode —they are wrappers for malloc and free . The benchmark allocates 4096 memnode objects, then free-and-then-allocates them for noperations times, and then frees all of them.

We only allocate memnode s, and all memnode s are of the same size, so we don't need metadata that keeps track of the size of each allocation. Furthermore, since all dynamically allocated data are freed at the end of the function, for each individual memnode_free() call we don't really need to return memory to the system allocator. We can simply reuse these memory during the function and returns all memory to the system at once when the function exits.

If we run the benchmark with 100000000 allocation, and use the system malloc() , free() functions to implement the memnode allocator, the benchmark finishes in 0.908 seconds.

Our alternative implementation of the allocator can finish in 0.355 seconds, beating the heavily optimized system allocator by a factor of 3. We will reveal how we achieved this in the next lecture.

We continue our exploration with the memnode allocation benchmark introduced from the last lecture.

File cs61-lectures/datarep6/mb-malloc.cc contains a version of the benchmark using the system new and delete operators.

In this function we allocate an array of 4096 pointers to memnode s, which occupy 2 3 *2 12 =2 15 bytes on the stack. We then allocate 4096 memnode s. Our memnode is defined like this:

Each memnode contains a std::string object and an unsigned integer. Each std::string object internally contains a pointer points to an character array in the heap. Therefore, every time we create a new memnode , we need 2 allocations: one to allocate the memnode itself, and another one performed internally by the std::string object when we initialize/assign a string value to it.

Every time we deallocate a memnode by calling delete , we also delete the std::string object, and the string object knows that it should deallocate the heap character array it internally maintains. So there are also 2 deallocations occuring each time we free a memnode.

We make the benchmark to return a seemingly meaningless result to prevent an aggressive compiler from optimizing everything away. We also use this result to make sure our subsequent optimizations to the allocator are correct by generating the same result.

This version of the benchmark, using the system allocator, finishes in 0.335 seconds. Not bad at all.

Spoiler alert: We can do 15x better than this.

1st optimization: std::string

We only deal with one file name, namely "datarep/mb-filename.cc", which is constant throughout the program for all memnode s. It's also a string literal, which means it as a constant string has a static life time. Why can't we just simply use a const char* in place of the std::string and let the pointer point to the static constant string? This saves us the internal allocation/deallocation performed by std::string every time we initialize/delete a string.

The fix is easy, we simply change the memnode definition:

This version of the benchmark now finishes in 0.143 seconds, a 2x improvement over the original benchmark. This 2x improvement is consistent with a 2x reduction in numbers of allocation/deallocation mentioned earlier.

You may ask why people still use std::string if it involves an additional allocation and is slower than const char* , as shown in this benchmark. std::string is much more flexible in that it also deals data that doesn't have static life time, such as input from a user or data the program receives over the network. In short, when the program deals with strings that are not constant, heap data is likely to be very useful, and std::string provides facilities to conveniently handle on-heap data.

2nd optimization: the system allocator

We still use the system allocator to allocate/deallocate memnode s. The system allocator is a general-purpose allocator, which means it must handle allocation requests of all sizes. Such general-purpose designs usually comes with a compromise for performance. Since we are only memnode s, which are fairly small objects (and all have the same size), we can build a special- purpose allocator just for them.

In cs61-lectures/datarep5/mb2.cc , we actually implement a special-purpose allocator for memnode s:

This allocator maintains a free list (a C++ vector ) of freed memnode s. allocate() simply pops a memnode off the free list if there is any, and deallocate() simply puts the memnode on the free list. This free list serves as a buffer between the system allocator and the benchmark function, so that the system allocator is invoked less frequently. In fact, in the benchmark, the system allocator is only invoked for 4096 times when it initializes the pointer array. That's a huge reduction because all 10-million "recycle" operations in the middle now doesn't involve the system allocator.

With this special-purpose allocator we can finish the benchmark in 0.057 seconds, another 2.5x improvement.

However this allocator now leaks memory: it never actually calls delete ! Let's fix this by letting it also keep track of all allocated memnode s. The modified definition of memnode_arena now looks like this:

With the updated allocator we simply need to invoke arena.destroy_all() at the end of the function to fix the memory leak. And we don't even need to invoke this method manually! We can use the C++ destructor for the memnode_arena struct, defined as ~memnode_arena() in the code above, which is automatically called when our arena object goes out of scope. We simply make the destructor invoke the destroy_all() method, and we are all set.

Fixing the leak doesn't appear to affect performance at all. This is because the overhead added by tracking the allocated list and calling delete only affects our initial allocation the 4096 memnode* pointers in the array plus at the very end when we clean up. These 8192 additional operations is a relative small number compared to the 10 million recycle operations, so the added overhead is hardly noticeable.

Spoiler alert: We can improve this by another factor of 2.

3rd optimization: std::vector

In our special purpose allocator memnode_arena , we maintain an allocated list and a free list both using C++ std::vector s. std::vector s are dynamic arrays, and like std::string they involve an additional level of indirection and stores the actual array in the heap. We don't access the allocated list during the "recycling" part of the benchmark (which takes bulk of the benchmark time, as we showed earlier), so the allocated list is probably not our bottleneck. We however, add and remove elements from the free list for each recycle operation, and the indirection introduced by the std::vector here may actually be our bottleneck. Let's find out.

Instead of using a std::vector , we could use a linked list of all free memnode s for the actual free list. We will need to include some extra metadata in the memnode to store pointers for this linked list. However, unlike in the debugging allocator pset, in a free list we don't need to store this metadata in addition to actual memnode data: the memnode is free, and not in use, so we can use reuse its memory, using a union:

We then maintain the free list like this:

Compared to the std::vector free list, this free list we always directly points to an available memnode when it is not empty ( free_list !=nullptr ), without going through any indirection. In the std::vector free list one would first have to go into the heap to access the actual array containing pointers to free memnode s, and then access the memnode itself.

With this change we can now finish the benchmark under 0.3 seconds! Another 2x improvement over the previous one!

Compared to the benchmark with the system allocator (which finished in 0.335 seconds), we managed to achieve a speedup of nearly 15x with arena allocation.

Find Study Materials for

  • Business Studies
  • Combined Science
  • Computer Science
  • Engineering
  • English Literature
  • Environmental Science
  • Human Geography
  • Macroeconomics
  • Microeconomics
  • Social Studies
  • Browse all subjects
  • Read our Magazine

Create Study Materials

Dive deep into the realm of Computer Science with this comprehensive guide about data representation. Data representation, a fundamental concept in computing, refers to the various ways that information can be expressed digitally. The interpretation of this data plays a critical role in decision-making procedures in businesses and scientific research. Gain an understanding of binary data representation, the backbone of digital computing. 

Mockup Schule

Explore our app and discover over 50 million learning materials for free.

Data Representation in Computer Science

Want to get better grades, get free, full access to:.

  • Explanations
  • Study Planner
  • Textbook solutions
  • StudySmarter AI
  • Textbook Solutions
  • Algorithms in Computer Science
  • Computer Network
  • Computer Organisation and Architecture
  • Computer Programming
  • Computer Systems
  • Analogue Signal
  • Binary Arithmetic
  • Binary Conversion
  • Binary Number System
  • Bitmap Graphics
  • Data Compression
  • Data Encoding
  • Digital Signal
  • Hexadecimal Conversion
  • Hexadecimal Number System
  • Huffman Coding
  • Image Representation
  • Lempel Ziv Welch
  • Logic Circuits
  • Lossless Compression
  • Lossy Compression
  • Numeral Systems
  • Quantisation
  • Run Length Encoding
  • Sample Rate
  • Sampling Informatics
  • Sampling Theorem
  • Signal Processing
  • Sound Representation
  • Two's Complement
  • What is ASCII
  • What is Unicode
  • What is Vector Graphics
  • Data Structures
  • Functional Programming
  • Issues in Computer Science
  • Problem Solving Techniques
  • Theory of Computation

Lerne mit deinen Freunden und bleibe auf dem richtigen Kurs mit deinen persönlichen Lernstatistiken

Nie wieder prokastinieren mit unseren Lernerinnerungen.

Dive deep into the realm of Computer Science with this comprehensive guide about data representation. Data representation, a fundamental concept in computing, refers to the various ways that information can be expressed digitally. The interpretation of this data plays a critical role in decision-making procedures in businesses and scientific research. Gain an understanding of binary data representation, the backbone of digital computing.

Binary data representation uses a system of numerical notation that has just two possible states represented by 0 and 1 (also known as 'binary digits' or 'bits'). Grasp the practical applications of binary data representation and explore its benefits.

Finally, explore the vast world of data model representation. Different types of data models offer a variety of ways to organise data in databases . Understand the strategic role of data models in data representation, and explore how they are used to design efficient database systems. This comprehensive guide positions you at the heart of data representation in Computer Science.

Understanding Data Representation in Computer Science

In the realm of Computer Science, data representation plays a paramount role. It refers to the methods or techniques used to represent, or express information in a computer system. This encompasses everything from text and numbers to images, audio, and beyond.

Basic Concepts of Data Representation

Data representation in computer science is about how a computer interprets and functions with different types of information. Different information types require different representation techniques. For instance, a video will be represented differently than a text document.

When working with various forms of data, it is important to grasp a fundamental understanding of:

  • Binary system
  • Bits and Bytes
  • Number systems: decimal, hexadecimal
  • Character encoding: ASCII, Unicode

Data in a computer system is represented in binary format, as a sequence of 0s and 1s, denoting 'off' and 'on' states respectively. The smallest component of this binary representation is known as a bit , which stands for 'binary digit'.

A byte , on the other hand, generally encompasses 8 bits. An essential aspect of expressing numbers and text in a computer system, are the decimal and hexadecimal number systems, and character encodings like ASCII and Unicode.

Role of Data Representation in Computer Science

Data Representation is the foundation of computing systems and affects both hardware and software designs. It enables both logic and arithmetic operations to be performed in the binary number system , on which computers are based.

An illustrative example of the importance of data representation is when you write a text document. The characters you type are represented in ASCII code - a set of binary numbers. Each number is sent to the memory, represented as electrical signals; everything you see on your screen is a representation of the underlying binary data.

Computing operations and functions, like searching, sorting or adding, rely heavily on appropriate data representation for efficient execution. Also, computer programming languages and compilers require a deep understanding of data representation to successfully interpret and execute commands.

As technology evolves, so too does our data representation techniques. Quantum computing, for example, uses quantum bits or "qubits". A qubit can represent a 0, 1, or both at the same time, thanks to the phenomenon of quantum superposition.

Types of Data Representation

In computer systems , various types of data representation techniques are utilized:

Numbers can be represented in real, integer, and rational formats. Text is represented by using different types of encodings, such as ASCII or Unicode. Images can be represented in various formats like JPG, PNG, or GIF, each having its specific rendering algorithm and compression techniques.

Tables are another important way of data representation, especially in the realm of databases .

This approach is particularly effective in storing structured data, making information readily accessible and easy to handle. By understanding the principles of data representation, you can better appreciate the complexity and sophistication behind our everyday interactions with technology.

Data Representation and Interpretation

To delve deeper into the world of Computer Science, it is essential to study the intricacies of data representation and interpretation. While data representation is about the techniques through which data are expressed or encoded in a computer system, data interpretation refers to the computing machines' ability to understand and work with these encoded data.

Basics of Data Representation and Interpretation

The core of data representation and interpretation is founded on the binary system. Represented by 0s and 1s, the binary system signifies the 'off' and 'on' states of electric current, seamlessly translating them into a language comprehensible to computing hardware.

For instance, \[ 1101 \, \text{in binary is equivalent to} \, 13 \, \text{in decimal} \] This interpretation happens consistently in the background during all of your interactions with a computer system.

Now, try imagining a vast array of these binary numbers. It could get overwhelming swiftly. To bring order and efficiency to this chaos, binary digits (or bits) are grouped into larger sets like bytes, kilobytes, and so on. A single byte , the most commonly used set, contains eight bits. Here's a simplified representation of how bits are grouped:

  • 1 bit = Binary Digit
  • 8 bits = 1 byte
  • 1024 bytes = 1 kilobyte (KB)
  • 1024 KB = 1 megabyte (MB)
  • 1024 MB = 1 gigabyte (GB)
  • 1024 GB = 1 terabyte (TB)

However, the binary system isn't the only number system pivotal for data interpretation. Both decimal (base 10) and hexadecimal (base 16) systems play significant roles in processing numbers and text data. Moreover, translating human-readable language into computer interpretable format involves character encodings like ASCII (American Standard Code for Information Interchange) and Unicode.

These systems interpret alphabetic characters, numerals, punctuation marks, and other common symbols into binary code. For example, the ASCII value for capital 'A' is 65, which corresponds to \(01000001\) in binary.

In the world of images, different encoding schemes interpret pixel data. JPG, PNG, and GIF, being common examples of such encoded formats. Similarly, audio files utilise encoding formats like MP3 and WAV to store sound data.

Importance of Data Interpretation in Computer Science

Understanding data interpretation in computer science is integral to unlocking the potential of any computing process or system. When coded data is input into a system, your computer must interpret this data accurately to make it usable.

Consider typing a document in a word processor like Microsoft Word. As you type, each keystroke is converted to an ASCII code by your keyboard. Stored as binary, these codes are transmitted to the active word processing software. The word processor interprets these codes back into alphabetic characters, enabling the correct letters to appear on your screen, as per your keystrokes.

Data interpretation is not just an isolated occurrence, but a recurring necessity - needed every time a computing process must deal with data. This is no different when you're watching a video, browsing a website, or even when the computer boots up.

Rendering images and videos is an ideal illustration of the importance of data interpretation.

Digital photos and videos are composed of tiny dots, or pixels, each encoded with specific numbers to denote colour composition and intensity. Every time you view a photo or play a video, your computer interprets the underlying data and reassembles the pixels to form a comprehensible image or video sequence on your screen.

Data interpretation further extends to more complex territories like facial recognition, bioinformatics, data mining, and even artificial intelligence. In these applications, data from various sources is collected, converted into machine-acceptable format, processed, and interpreted to provide meaningful outputs.

In summary, data interpretation is vital for the functionality, efficiency, and progress of computer systems and the services they provide. Understanding the basics of data representation and interpretation, thereby, forms the backbone of computer science studies.

Delving into Binary Data Representation

Binary data representation is the most fundamental and elementary form of data representation in computing systems. At the lowermost level, every piece of information processed by a computer is converted into a binary format.

Understanding Binary Data Representation

Binary data representation is based on the binary numeral system. This system, also known as the base-2 system, uses only two digits - 0 and 1 to represent all kinds of data. The concept dates back to the early 18th-century mathematics and has since found its place as the bedrock of modern computers. In computing, the binary system's digits are called bits (short for 'binary digit'), and they are the smallest indivisible unit of data.

Each bit can be in one of two states representing 0 ('off') or 1 ('on'). Formally, the binary number \( b_n b_{n-1} ... b_2 b_1 b_0 \), is interpreted using the formula: \[ B = b_n \times 2^n + b_{n-1} \times 2^{n-1} + ... + b_2 \times 2^2 + b_1 \times 2^1 + b_0 \times 2^0 \] Where \( b_i \) are the binary digits and \( B \) is the corresponding decimal number.

For example, for the binary number 1011, the process will look like this: \[ B = 1*2^3 + 0*2^2 + 1*2^1 + 1*2^0 \]

This mathematical translation makes it possible for computing machines to perform complex operations even though they understand only the simple language of 'on' and 'off' signals.

When representing character data, computing systems use binary-encoded formats. ASCII and Unicode are common examples. In ASCII, each character is assigned a unique 7-bit binary code. For example, the binary representation for the uppercase letter 'A' is 0100001. Interpreting such encoded data back to a human-readable format is a core responsibility of computing systems and forms the basis for the exchange of digital information globally.

Practical Application of Binary Data Representation

Binary data representation is used across every single aspect of digital computing. From simple calculations performed by a digital calculator to the complex animations rendered in a high-definition video game, binary data representation is at play in the background.

Consider a simple calculation like 7+5. When you input this into a digital calculator, the numbers and the operation get converted into their binary equivalents. The microcontroller inside the calculator processes these binary inputs, performs the sum operation in binary, and finally, returns the result as a binary output. This binary output is then converted back into a decimal number which you see displayed on the calculator screen.

When it comes to text files, every character typed into the document is converted to its binary equivalent using a character encoding system, typically ASCII or Unicode. It is then saved onto your storage device as a sequence of binary digits.

Similarly, for image files, every pixel is represented as a binary number. Each binary number, called a 'bit map', specifies the colour and intensity of each pixel. When you open the image file, the computer reads the binary data and presents it on your screen as a colourful, coherent image. The concept extends even further into the internet and network communications, data encryption , data compression , and more.

When you are downloading a file over the internet, it is sent to your system as a stream of binary data. The web browser on your system receives this data, recognizes the type of file and accordingly interprets the binary data back into the intended format.

In essence, every operation that you can perform on a computer system, no matter how simple or complex, essentially boils down to large-scale manipulation of binary data. And that sums up the practical application and universal significance of binary data representation in digital computing.

Binary Tree Representation in Data Structures

Binary trees occupy a central position in data structures , especially in algorithms and database designs. As a non-linear data structure, a binary tree is essentially a tree-like model where each node has a maximum of two children, often distinguished as 'left child' and 'right child'.

Fundamentals of Binary Tree Representation

A binary tree is a tree data structure where each parent node has no more than two children, typically referred to as the left child and the right child. Each node in the binary tree contains:

  • A data element
  • Pointer or link to the left child
  • Pointer or link to the right child

The topmost node of the tree is known as the root. The nodes without any children, usually dwelling at the tree's last level, are known as leaf nodes or external nodes. Binary trees are fundamentally differentiated by their properties and the relationships among the elements. Some types include:

  • Full Binary Tree: A binary tree where every node has 0 or 2 children.
  • Complete Binary Tree: A binary tree where all levels are completely filled except possibly the last level, which is filled from left to right.
  • Perfect Binary Tree: A binary tree where all internal nodes have two children and all leaves are at the same level.
  • Skewed Binary Tree: A binary tree where every node has only left child or only right child.

In a binary tree, the maximum number of nodes \( N \) at any level \( L \) can be calculated using the formula \( N = 2^{L-1} \). Conversely, for a tree with \( N \) nodes, the maximum height or maximum number of levels is \( \lceil Log_2(N+1) \rceil \).

Binary tree representation employs arrays and linked lists. Sometimes, an implicit array-based representation suffices, especially for complete binary trees. The root is stored at index 0, while for each node at index \( i \), the left child is stored at index \( 2i + 1 \), and the right child at \( 2i + 2 \).

However, the most common representation is the linked-node representation that utilises a node-based structure. Each node in the binary tree is a data structure that contains a data field and two pointers pointing to its left and right child nodes.

Usage of Binary Tree in Data Structures

Binary trees are typically used for expressing hierarchical relationships, and thus find application across various areas in computer science. In mathematical applications, binary trees are ideal for expressing certain elements' relationships.

For example, binary trees are used to represent expressions in arithmetic and Boolean algebra.

Consider an arithmetic expression like (4 + 5) * 6. This can be represented using a binary tree where the operators are parent nodes, and the operands are children. The expression gets evaluated by performing operations in a specific tree traversal order.

Among the more complex usages, binary search trees — a variant of binary trees — are employed in database engines and file systems .

  • Binary Heaps, a type of binary tree, are used as an efficient priority queue in many algorithms like Dijkstra's algorithm and the Heap Sort algorithm.
  • Binary trees are also used in creating binary space partition trees, which are used for quickly finding objects in games and 3D computer graphics.
  • Syntax trees used in compilers are a direct application of binary trees. They help translate high-level language expressions into machine code.
  • Huffman Coding Trees, which are used in data compression algorithms, are another variant of binary trees.

The theoretical underpinnings of all these binary tree applications are the traversal methods and operations, such as insertion and deletion, which are intrinsic to the data structure.

Binary trees are also used in advanced machine-learning algorithms. Decision Tree is a type of binary tree that uses a tree-like model of decisions. It is one of the most successful forms of supervised learning algorithms in data mining and machine learning.

The advantages of a binary tree lie in their efficient organisation and quick data access, making them a cornerstone of many complex data structures and algorithms. Understanding the workings and fundamentals of binary tree representation will equip you with a stronger pillaring in the world of data structures and computer science in general.

Grasping Data Model Representation

When dealing with vast amounts of data, organising and understanding the relationships between different pieces of data is of utmost importance. This is where data model representation comes into play in computer science. A data model provides an abstract, simplified view of real-world data. It defines the data elements and the relationships among them, providing an organised and consistent representation of data.

Exploring Different Types of Data Models

Understanding the intricacies of data models will equip you with a solid foundation in making sense of complex data relationships. Some of the most commonly used data models include:

  • Hierarchical Model
  • Network Model
  • Relational Model
  • Entity-Relationship Model
  • Object-Oriented Model
  • Semantic Model

The Hierarchical Model presents data in a tree-like structure, where each record has one parent record and many children. This model is largely applied in file systems and XML documents. The limitations are that this model does not allow a child to have multiple parents, thus limiting its real-world applications.

The Network Model, an enhancement of the hierarchical model, allows a child node to have multiple parent nodes, resulting in a graph structure. This model is suitable for representing complex relationships but comes with its own challenges such as iteration and navigation, which can be intricate.

The Relational Model, created by E.F. Codd, uses a tabular structure to depict data and their relationships. Each row represents a collection of related data values, and each column represents a particular attribute. This is the most widely used model due to its simplicity and flexibility.

The Entity-Relationship Model illustrates the conceptual view of a database. It uses three basic concepts: Entities, Attributes (the properties of these entities), and Relationships among entities. This model is most commonly used in database design .

The Object-Oriented Model goes a step further and adds methods (functions) to the entities besides attributes. This data model integrates the data and the operations applicable to the data into a single component known as an object. Such an approach enables encapsulation, a significant characteristic of object-oriented programming.

The Semantic Model aims to capture more meaning of data by defining the nature of data and the relationships that exist between them. This model is beneficial in representing complex data interrelations and is used in expert systems and artificial intelligence fields.

The Role of Data Models in Data Representation

Data models provide a method for the efficient representation and interaction of data elements, thus forming an integral part of any database system. They provide the theoretical foundation for designing databases, thereby playing an essential role in the development of applications.

A data model is a set of concepts and rules for formally describing and representing real-world data. It serves as a blueprint for designing and implementing databases and assists communication between system developers and end-users.

Databases serve as vast repositories, storing a plethora of data. Such vast data needs effective organisation and management for optimal access and usage. Here, data models come into play, providing a structural view of data, thereby enabling the efficient organisation, storage and retrieval of data.

Consider a library system. The system needs to record data about books, authors, publishers, members, and loans. All these items represent different entities. Relationships exist between these entities. For example, a book is published by a publisher, an author writes a book, or a member borrows a book. Using an Entity-Relationship Model, we can effectively represent all these entities and relationships, aiding the library system's development process.

Designing such a model requires careful consideration of what data is required to be stored and how different data elements relate to each other. Depending on their specific requirements, database developers can select the most suitable data model representation. This choice can significantly affect the functionality, performance, and scalability of the resulting databases.

From decision-support systems and expert systems to distributed databases and data warehouses, data models find a place in various applications.

Modern NoSQL databases often use several models simultaneously to meet their needs. For example, a document-based model for unstructured data and a column-based model for analyzing large data sets. In this way, data models continue to evolve and adapt to the burgeoning needs of the digital world.

Therefore, acquiring a strong understanding of data model representations and their roles forms an integral part of the database management and design process. It empowers you with the ability to handle large volumes of diverse data efficiently and effectively.

Data Representation - Key takeaways

  • Data representation refers to techniques used to express information in computer systems, encompassing text, numbers, images, audio, and more.
  • Data Representation is about how computers interpret and function with different information types, including binary systems, bits and bytes, number systems (decimal, hexadecimal) and character encoding (ASCII, Unicode).
  • Binary Data Representation is the conversion of all kinds of information processed by a computer into binary format.
  • Express hierarchical relationships across various areas in computer science.
  • Represent relationships in mathematical applications, used in database engines, file systems, and priority queues in algorithms.
  • Data Model Representation is an abstract, simplified view of real-world data that defines the data elements, and their relationships and provides a consistently organised way of representing data.

Frequently Asked Questions about Data Representation in Computer Science

--> what is data representation.

Data representation is the method used to encode information into a format that can be used and understood by computer systems. It involves the conversion of real-world data, such as text, images, sounds, numbers, into forms like binary or hexadecimal which computers can process. The choice of representation can affect the quality, accuracy and efficiency of data processing. Precisely, it's how computer systems interpret and manipulate data.

--> What does data representation mean?

Data representation refers to the methods or techniques used to express, display or encode data in a readable format for a computer or a user. This could be in different forms such as binary, decimal, or alphabetic forms. It's crucial in computer science since it links the abstract world of thought and concept to the concrete domain of signals, signs and symbols. It forms the basis of information processing and storage in contemporary digital computing systems.

--> Why is data representation important?

Data representation is crucial as it allows information to be processed, transferred, and interpreted in a meaningful way. It helps in organising and analysing data effectively, providing insights for decision-making processes. Moreover, it facilitates communication between the computer system and the real world, enabling computing outcomes to be understood by users. Finally, accurate data representation ensures integrity and reliability of the data, which is vital for effective problem solving.

--> How to make a graphical representation of data?

To create a graphical representation of data, first collect and organise your data. Choose a suitable form of data representation such as bar graphs, pie charts, line graphs, or histograms depending on the type of data and the information you want to display. Use a data visualisation tool or software such as Excel or Tableau to help you generate the graph. Always remember to label your axes and provide a title and legend if necessary.

--> What is data representation in statistics?

Data representation in statistics refers to the various methods used to display or present data in meaningful ways. This often includes the use of graphs, charts, tables, histograms or other visual tools that can help in the interpretation and analysis of data. It enables efficient communication of information and helps in drawing statistical conclusions. Essentially, it's a way of providing a visual context to complex datasets, making the data easily understandable.

Test your knowledge with multiple choice flashcards

What is data representation in computer science?

What are some of the fundamental concepts to understand when dealing with data representation?

Why is data representation crucial in computer science?

Your score:

Smart Exams

Join the StudySmarter App and learn efficiently with millions of flashcards and more!

Learn with 441 data representation in computer science flashcards in the free studysmarter app.

Already have an account? Log in

Data representation in computer science refers to the methods used to express information in a computer system. It's how a computer interprets and functions with different information types, ranging from text and numbers to images, audio, and beyond.

When dealing with data representation, one should understand the binary system, bits and bytes, number systems like decimal and hexadecimal, and character encoding such as ASCII and Unicode.

Data representation forms the foundation of computer systems and affects hardware and software designs. It enables logic and arithmetic operations to be performed in the binary number system, and is integral to computer programming languages and compilers.

What is the relationship between data representation and the binary system in computer systems?

The core of data representation in computer systems is based on the binary system, which uses 0s and 1s, representing 'off' and 'on' states of electric current. These translate into a language that computer hardware can understand.

What are the larger sets into which binary digits or bits are grouped for efficiency and order?

Binary digits or bits are grouped into larger sets like bytes, kilobytes, MB, GB and TB. For instance, 8 bits make up a byte, and 1024 bytes make up a kilobyte.

How does data interpretation contribute to the functionality of computer systems and services?

Data interpretation is vital as it allows coded data to be accurately translated into a usable format for any computer process or system. It is a recurring necessity whenever a computing process has to deal with data.

Flashcards

of the users don't pass the Data Representation in Computer Science quiz! Will you pass the quiz?

How would you like to learn this content?

Free computer-science cheat sheet!

Everything you need to know on . A perfect summary so you can easily remember everything.

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Smart Note-Taking

Join over 22 million students in learning with our StudySmarter App

Sign up to highlight and take notes. It’s 100% free.

This is still free to read, it's not a paywall.

You need to register to keep reading, create a free account to save this explanation..

Save explanations to your personalised space and access them anytime, anywhere!

By signing up, you agree to the Terms and Conditions and the Privacy Policy of StudySmarter.

Entdecke Lernmaterial in der StudySmarter-App

Google Popup

Privacy Overview

Page Statistics

Table of contents.

  • Introduction to Functional Computer
  • Fundamentals of Architectural Design

Data Representation

  • Instruction Set Architecture : Instructions and Formats
  • Instruction Set Architecture : Design Models
  • Instruction Set Architecture : Addressing Modes
  • Performance Measurements and Issues
  • Computer Architecture Assessment 1
  • Fixed Point Arithmetic : Addition and Subtraction
  • Fixed Point Arithmetic : Multiplication
  • Fixed Point Arithmetic : Division
  • Floating Point Arithmetic
  • Arithmetic Logic Unit Design
  • CPU's Data Path
  • CPU's Control Unit
  • Control Unit Design
  • Concepts of Pipelining
  • Computer Architecture Assessment 2
  • Pipeline Hazards
  • Memory Characteristics and Organization
  • Cache Memory
  • Virtual Memory
  • I/O Communication and I/O Controller
  • Input/Output Data Transfer
  • Direct Memory Access controller and I/O Processor
  • CPU Interrupts and Interrupt Handling
  • Computer Architecture Assessment 3

Course Computer Architecture

Digital computers store and process information in binary form as digital logic has only two values "1" and "0" or in other words "True or False" or also said as "ON or OFF". This system is called radix 2. We human generally deal with radix 10 i.e. decimal. As a matter of convenience there are many other representations like Octal (Radix 8), Hexadecimal (Radix 16), Binary coded decimal (BCD), Decimal etc.

Every computer's CPU has a width measured in terms of bits such as 8 bit CPU, 16 bit CPU, 32 bit CPU etc. Similarly, each memory location can store a fixed number of bits and is called memory width. Given the size of the CPU and Memory, it is for the programmer to handle his data representation. Most of the readers may be knowing that 4 bits form a Nibble, 8 bits form a byte. The word length is defined by the Instruction Set Architecture of the CPU. The word length may be equal to the width of the CPU.

The memory simply stores information as a binary pattern of 1's and 0's. It is to be interpreted as what the content of a memory location means. If the CPU is in the Fetch cycle, it interprets the fetched memory content to be instruction and decodes based on Instruction format. In the Execute cycle, the information from memory is considered as data. As a common man using a computer, we think computers handle English or other alphabets, special characters or numbers. A programmer considers memory content to be data types of the programming language he uses. Now recall figure 1.2 and 1.3 of chapter 1 to reinforce your thought that conversion happens from computer user interface to internal representation and storage.

  • Data Representation in Computers

Information handled by a computer is classified as instruction and data. A broad overview of the internal representation of the information is illustrated in figure 3.1. No matter whether it is data in a numeric or non-numeric form or integer, everything is internally represented in Binary. It is up to the programmer to handle the interpretation of the binary pattern and this interpretation is called Data Representation . These data representation schemes are all standardized by international organizations.

Choice of Data representation to be used in a computer is decided by

  • The number types to be represented (integer, real, signed, unsigned, etc.)
  • Range of values likely to be represented (maximum and minimum to be represented)
  • The Precision of the numbers i.e. maximum accuracy of representation (floating point single precision, double precision etc)
  • If non-numeric i.e. character, character representation standard to be chosen. ASCII, EBCDIC, UTF are examples of character representation standards.
  • The hardware support in terms of word width, instruction.

Before we go into the details, let us take an example of interpretation. Say a byte in Memory has value "0011 0001". Although there exists a possibility of so many interpretations as in figure 3.2, the program has only one interpretation as decided by the programmer and declared in the program.

  • Fixed point Number Representation

Fixed point numbers are also known as whole numbers or Integers. The number of bits used in representing the integer also implies the maximum number that can be represented in the system hardware. However for the efficiency of storage and operations, one may choose to represent the integer with one Byte, two Bytes, Four bytes or more. This space allocation is translated from the definition used by the programmer while defining a variable as integer short or long and the Instruction Set Architecture.

In addition to the bit length definition for integers, we also have a choice to represent them as below:

  • Unsigned Integer : A positive number including zero can be represented in this format. All the allotted bits are utilised in defining the number. So if one is using 8 bits to represent the unsigned integer, the range of values that can be represented is 28 i.e. "0" to "255". If 16 bits are used for representing then the range is 216 i.e. "0 to 65535".
  • Signed Integer : In this format negative numbers, zero, and positive numbers can be represented. A sign bit indicates the magnitude direction as positive or negative. There are three possible representations for signed integer and these are Sign Magnitude format, 1's Compliment format and 2's Complement format .

Signed Integer – Sign Magnitude format: Most Significant Bit (MSB) is reserved for indicating the direction of the magnitude (value). A "0" on MSB means a positive number and a "1" on MSB means a negative number. If n bits are used for representation, n-1 bits indicate the absolute value of the number. Examples for n=8:

Examples for n=8:

0010 1111 = + 47 Decimal (Positive number)

1010 1111 = - 47 Decimal (Negative Number)

0111 1110 = +126 (Positive number)

1111 1110 = -126 (Negative Number)

0000 0000 = + 0 (Postive Number)

1000 0000 = - 0 (Negative Number)

Although this method is easy to understand, Sign Magnitude representation has several shortcomings like

  • Zero can be represented in two ways causing redundancy and confusion.
  • The total range for magnitude representation is limited to 2n-1, although n bits were accounted.
  • The separate sign bit makes the addition and subtraction more complicated. Also, comparing two numbers is not straightforward.

Signed Integer – 1’s Complement format: In this format too, MSB is reserved as the sign bit. But the difference is in representing the Magnitude part of the value for negative numbers (magnitude) is inversed and hence called 1’s Complement form. The positive numbers are represented as it is in binary. Let us see some examples to better our understanding.

1101 0000 = - 47 Decimal (Negative Number)

1000 0001 = -126 (Negative Number)

1111 1111 = - 0 (Negative Number)

  • Converting a given binary number to its 2's complement form

Step 1 . -x = x' + 1 where x' is the one's complement of x.

Step 2 Extend the data width of the number, fill up with sign extension i.e. MSB bit is used to fill the bits.

Example: -47 decimal over 8bit representation

As you can see zero is not getting represented with redundancy. There is only one way of representing zero. The other problem of the complexity of the arithmetic operation is also eliminated in 2’s complement representation. Subtraction is done as Addition.

More exercises on number conversion are left to the self-interest of readers.

  • Floating Point Number system

The maximum number at best represented as a whole number is 2 n . In the Scientific world, we do come across numbers like Mass of an Electron is 9.10939 x 10-31 Kg. Velocity of light is 2.99792458 x 108 m/s. Imagine to write the number in a piece of paper without exponent and converting into binary for computer representation. Sure you are tired!!. It makes no sense to write a number in non- readable form or non- processible form. Hence we write such large or small numbers using exponent and mantissa. This is said to be Floating Point representation or real number representation. he real number system could have infinite values between 0 and 1.

Representation in computer

Unlike the two's complement representation for integer numbers, Floating Point number uses Sign and Magnitude representation for both mantissa and exponent . In the number 9.10939 x 1031, in decimal form, +31 is Exponent, 9.10939 is known as Fraction . Mantissa, Significand and fraction are synonymously used terms. In the computer, the representation is binary and the binary point is not fixed. For example, a number, say, 23.345 can be written as 2.3345 x 101 or 0.23345 x 102 or 2334.5 x 10-2. The representation 2.3345 x 101 is said to be in normalised form.

Floating-point numbers usually use multiple words in memory as we need to allot a sign bit, few bits for exponent and many bits for mantissa. There are standards for such allocation which we will see sooner.

  • IEEE 754 Floating Point Representation

We have two standards known as Single Precision and Double Precision from IEEE. These standards enable portability among different computers. Figure 3.3 picturizes Single precision while figure 3.4 picturizes double precision. Single Precision uses 32bit format while double precision is 64 bits word length. As the name suggests double precision can represent fractions with larger accuracy. In both the cases, MSB is sign bit for the mantissa part, followed by Exponent and Mantissa. The exponent part has its sign bit.

It is to be noted that in Single Precision, we can represent an exponent in the range -127 to +127. It is possible as a result of arithmetic operations the resulting exponent may not fit in. This situation is called overflow in the case of positive exponent and underflow in the case of negative exponent. The Double Precision format has 11 bits for exponent meaning a number as large as -1023 to 1023 can be represented. The programmer has to make a choice between Single Precision and Double Precision declaration using his knowledge about the data being handled.

The Floating Point operations on the regular CPU is very very slow. Generally, a special purpose CPU known as Co-processor is used. This Co-processor works in tandem with the main CPU. The programmer should be using the float declaration only if his data is in real number form. Float declaration is not to be used generously.

  • Decimal Numbers Representation

Decimal numbers (radix 10) are represented and processed in the system with the support of additional hardware. We deal with numbers in decimal format in everyday life. Some machines implement decimal arithmetic too, like floating-point arithmetic hardware. In such a case, the CPU uses decimal numbers in BCD (binary coded decimal) form and does BCD arithmetic operation. BCD operates on radix 10. This hardware operates without conversion to pure binary. It uses a nibble to represent a number in packed BCD form. BCD operations require not only special hardware but also decimal instruction set.

  • Exceptions and Error Detection

All of us know that when we do arithmetic operations, we get answers which have more digits than the operands (Ex: 8 x 2= 16). This happens in computer arithmetic operations too. When the result size exceeds the allotted size of the variable or the register, it becomes an error and exception. The exception conditions associated with numbers and number operations are Overflow, Underflow, Truncation, Rounding and Multiple Precision . These are detected by the associated hardware in arithmetic Unit. These exceptions apply to both Fixed Point and Floating Point operations. Each of these exceptional conditions has a flag bit assigned in the Processor Status Word (PSW). We may discuss more in detail in the later chapters.

  • Character Representation

Another data type is non-numeric and is largely character sets. We use a human-understandable character set to communicate with computer i.e. for both input and output. Standard character sets like EBCDIC and ASCII are chosen to represent alphabets, numbers and special characters. Nowadays Unicode standard is also in use for non-English language like Chinese, Hindi, Spanish, etc. These codes are accessible and available on the internet. Interested readers may access and learn more.

1. Track your progress [Earn 200 points]

Mark as complete

2. Provide your ratings to this chapter [Earn 100 points]

National Academies Press: OpenBook

Computer Science: Reflections on the Field, Reflections from the Field (2004)

Chapter: 5 data, representation, and information, 5 data, representation, and information.

T he preceding two chapters address the creation of models that capture phenomena of interest and the abstractions both for data and for computation that reduce these models to forms that can be executed by computer. We turn now to the ways computer scientists deal with information, especially in its static form as data that can be manipulated by programs.

Gray begins by narrating a long line of research on databases—storehouses of related, structured, and durable data. We see here that the objects of research are not data per se but rather designs of “schemas” that allow deliberate inquiry and manipulation. Gray couples this review with introspection about the ways in which database researchers approach these problems.

Databases support storage and retrieval of information by defining—in advance—a complex structure for the data that supports the intended operations. In contrast, Lesk reviews research on retrieving information from documents that are formatted to meet the needs of applications rather than predefined schematized formats.

Interpretation of information is at the heart of what historians do, and Ayers explains how information technology is transforming their paradigms. He proposes that history is essentially model building—constructing explanations based on available information—and suggests that the methods of computer science are influencing this core aspect of historical analysis.

DATABASE SYSTEMS: A TEXTBOOK CASE OF RESEARCH PAYING OFF

Jim Gray, Microsoft Research

A small research investment helped produce U.S. market dominance in the $14 billion database industry. Government and industry funding of a few research projects created the ideas for several generations of products and trained the people who built those products. Continuing research is now creating the ideas and training the people for the next generation of products.

Industry Profile

The database industry generated about $14 billion in revenue in 2002 and is growing at 20 percent per year, even though the overall technology sector is almost static. Among software sectors, the database industry is second only to operating system software. Database industry leaders are all U.S.-based corporations: IBM, Microsoft, and Oracle are the three largest. There are several specialty vendors: Tandem sells over $1 billion/ year of fault-tolerant transaction processing systems, Teradata sells about $1 billion/year of data-mining systems, and companies like Information Resources Associates, Verity, Fulcrum, and others sell specialized data and text-mining software.

In addition to these well-established companies, there is a vibrant group of small companies specializing in application-specific databases—for text retrieval, spatial and geographical data, scientific data, image data, and so on. An emerging group of companies offer XML-oriented databases. Desktop databases are another important market focused on extreme ease of use, small size, and disconnected (offline) operation.

Historical Perspective

Companies began automating their back-office bookkeeping in the 1960s. The COBOL programming language and its record-oriented file model were the workhorses of this effort. Typically, a batch of transactions was applied to the old-tape-master, producing a new-tape-master and printout for the next business day. During that era, there was considerable experimentation with systems to manage an online database that could capture transactions as they happened. At first these systems were ad hoc, but late in that decade network and hierarchical database products emerged. A COBOL subcommittee defined a network data model stan-

dard (DBTG) that formed the basis for most systems during the 1970s. Indeed, in 1980 DBTG-based Cullinet was the leading software company.

However, there were some problems with DBTG. DBTG uses a low-level, record-at-a-time procedural language to access information. The programmer has to navigate through the database, following pointers from record to record. If the database is redesigned, as often happens over a decade, then all the old programs have to be rewritten.

The relational data model, enunciated by IBM researcher Ted Codd in a 1970 Communications of the Association for Computing Machinery article, 1 was a major advance over DBTG. The relational model unified data and metadata so that there was only one form of data representation. It defined a non-procedural data access language based on algebra or logic. It was easier for end users to visualize and understand than the pointers-and-records-based DBTG model.

The research community (both industry and university) embraced the relational data model and extended it during the 1970s. Most significantly, researchers showed that a non-procedural language could be compiled to give performance comparable to the best record-oriented database systems. This research produced a generation of systems and people that formed the basis for products from IBM, Ingres, Oracle, Informix, Sybase, and others. The SQL relational database language was standardized by ANSI/ISO between 1982 and 1986. By 1990, virtually all database systems provided an SQL interface (including network, hierarchical, and object-oriented systems).

Meanwhile the database research agenda moved on to geographically distributed databases and to parallel data access. Theoretical work on distributed databases led to prototypes that in turn led to products. Today, all the major database systems offer the ability to distribute and replicate data among nodes of a computer network. Intense research on data replication during the late 1980s and early 1990s gave rise to a second generation of replication products that are now the mainstays of mobile computing.

Research of the 1980s showed how to execute each of the relational data operators in parallel—giving hundred-fold and thousand-fold speedups. The results of this research began to appear in the products of several major database companies. With the proliferation of data mining in the 1990s, huge databases emerged. Interactive access to these databases requires that the system use multiple processors and multiple disks to read all the data in parallel. In addition, these problems require near-

linear time search algorithms. University and industrial research of the previous decade had solved these problems and forms the basis of the current VLDB (very large database) data-mining systems.

Rollup and drilldown data reporting systems had been a mainstay of decision-support systems ever since the 1960s. In the middle 1990s, the research community really focused on data-mining algorithms. They invented very efficient data cube and materialized view algorithms that form the basis for the current generation of business intelligence products.

The most recent round of government-sponsored research creating a new industry comes from the National Science Foundation’s Digital Libraries program, which spawned Google. It was founded by a group of “database” graduate students who took a fresh look at how information should be organized and presented in the Internet era.

Current Research Directions

There continues to be active and valuable research on representing and indexing data, adding inference to data search, compiling queries more efficiently, executing queries in parallel, integrating data from heterogeneous data sources, analyzing performance, and extending the transaction model to handle long transactions and workflow (transactions that involve human as well as computer steps). The availability of huge volumes of data on the Internet has prompted the study of data integration, mediation, and federation in which a portal system presents a unification of several data sources by pulling data on demand from different parts of the Internet.

In addition, there is great interest in unifying object-oriented concepts with the relational model. New data types (image, document, and drawing) are best viewed as the methods that implement them rather than by the bytes that represent them. By adding procedures to the database system, one gets active databases, data inference, and data encapsulation. This object-oriented approach is an area of active research and ferment both in academe and industry. It seems that in 2003, the research prototypes are mostly done and this is an area that is rapidly moving into products.

The Internet is full of semi-structured data—data that has a bit of schema and metadata, but is mostly a loose collection of facts. XML has emerged as the standard representation of semi-structured data, but there is no consensus on how such data should be stored, indexed, or searched. There have been intense research efforts to answer these questions. Prototypes have been built at universities and industrial research labs, and now products are in development.

The database research community now has a major focus on stream data processing. Traditionally, databases have been stored locally and are

updated by transactions. Sensor networks, financial markets, telephone calls, credit card transactions, and other data sources present streams of data rather than a static database. The stream data processing researchers are exploring languages and algorithms for querying such streams and providing approximate answers.

Now that nearly all information is online, data security and data privacy are extremely serious and important problems. A small, but growing, part of the database community is looking at ways to protect people’s privacy by limiting the ways data is used. This work also has implications for protecting intellectual property (e.g., digital rights management, watermarking) and protecting data integrity by digitally signing documents and then replicating them so that the documents cannot be altered or destroyed.

Case Histories

The U.S. government funded many database research projects from 1972 to the present. Projects at the University of California at Los Angeles gave rise to Teradata and produced many excellent students. Projects at Computer Corp. of America (SDD-1, Daplex, Multibase, and HiPAC) pioneered distributed database technology and object-oriented database technology. Projects at Stanford University fostered deductive database technology, data integration technology, query optimization technology, and the popular Yahoo! and Google Internet sites. Work at Carnegie Mellon University gave rise to general transaction models and ultimately to the Transarc Corporation. There have been many other successes from AT&T, the University of Texas at Austin, Brown and Harvard Universities, the University of Maryland, the University of Michigan, Massachusetts Institute of Technology, Princeton University, and the University of Toronto among others. It is not possible to enumerate all the contributions here, but we highlight three representative research projects that had a major impact on the industry.

Project INGRES

Project Ingres started at the University of California at Berkeley in 1972. Inspired by Codd’s paper on relational databases, several faculty members (Stonebraker, Rowe, Wong, and others) started a project to design and build a relational system. Incidental to this work, they invented a query language (QUEL), relational optimization techniques, a language binding technique, and interesting storage strategies. They also pioneered work on distributed databases.

The Ingres academic system formed the basis for the Ingres product now owned by Computer Associates. Students trained on Ingres went on

to start or staff all the major database companies (AT&T, Britton Lee, HP, Informix, IBM, Oracle, Tandem, Sybase). The Ingres project went on to investigate distributed databases, database inference, active databases, and extensible databases. It was rechristened Postgres, which is now the basis of the digital library and scientific database efforts within the University of California system. Recently, Postgres spun off to become the basis for a new object-relational system from the start-up Illustra Information Technologies.

Codd’s ideas were inspired by seeing the problems IBM and its customers were having with IBM’s IMS product and the DBTG network data model. His relational model was at first very controversial; people thought that the model was too simplistic and that it could never give good performance. IBM Research management took a gamble and chartered a small (10-person) systems effort to prototype a relational system based on Codd’s ideas. That system produced a prototype that eventually grew into the DB2 product series. Along the way, the IBM team pioneered ideas in query optimization, data independence (views), transactions (logging and locking), and security (the grant-revoke model). In addition, the SQL query language from System R was the basis for the ANSI/ISO standard.

The System R group went on to investigate distributed databases (project R*) and object-oriented extensible databases (project Starburst). These research projects have pioneered new ideas and algorithms. The results appear in IBM’s database products and those of other vendors.

Not all research ideas work out. During the 1970s there was great enthusiasm for database machines—special-purpose computers that would be much faster than general-purpose operating systems running conventional database systems. These research projects were often based on exotic hardware like bubble memories, head-per-track disks, or associative RAM. The problem was that general-purpose systems were improving at 50 percent per year, so it was difficult for exotic systems to compete with them. By 1980, most researchers realized the futility of special-purpose approaches and the database-machine community switched to research on using arrays of general-purpose processors and disks to process data in parallel.

The University of Wisconsin hosted the major proponents of this idea in the United States. Funded by the government and industry, those researchers prototyped and built a parallel database machine called

Gamma. That system produced ideas and a generation of students who went on to staff all the database vendors. Today the parallel systems from IBM, Tandem, Oracle, Informix, Sybase, and Microsoft all have a direct lineage from the Wisconsin research on parallel database systems. The use of parallel database systems for data mining is the fastest-growing component of the database server industry.

The Gamma project evolved into the Exodus project at Wisconsin (focusing on an extensible object-oriented database). Exodus has now evolved to the Paradise system, which combines object-oriented and parallel database techniques to represent, store, and quickly process huge Earth-observing satellite databases.

And Then There Is Science

In addition to creating a huge industry, database theory, science, and engineering constitute a key part of computer science today. Representing knowledge within a computer is one of the central challenges of computer science ( Box 5.1 ). Database research has focused primarily on this fundamental issue. Many universities have faculty investigating these problems and offer classes that teach the concepts developed by this research program.

COMPUTER SCIENCE IS TO INFORMATION AS CHEMISTRY IS TO MATTER

Michael Lesk, Rutgers University

In other countries computer science is often called “informatics” or some similar name. Much computer science research derives from the need to access, process, store, or otherwise exploit some resource of useful information. Just as chemistry is driven to large extent by the need to understand substances, computing is driven by a need to handle data and information. As an example of the way chemistry has developed, see Oliver Sacks’s book Uncle Tungsten: Memories of a Chemical Boyhood (Vintage Books, 2002). He describes his explorations through the different metals, learning the properties of each, and understanding their applications. Similarly, in the history of computer science, our information needs and our information capabilities have driven parts of the research agenda. Information retrieval systems take some kind of information, such as text documents or pictures, and try to retrieve topics or concepts based on words or shapes. Deducing the concept from the bytes can be difficult, and the way we approach the problem depends on what kind of bytes we have and how many of them we have.

Our experimental method is to see if we can build a system that will provide some useful access to information or service. If it works, those algorithms and that kind of data become a new field: look at areas like geographic information systems. If not, people may abandon the area until we see a new motivation to exploit that kind of data. For example, face-recognition algorithms have received a new impetus from security needs, speeding up progress in the last few years. An effective strategy to move computer science forward is to provide some new kind of information and see if we can make it useful.

Chemistry, of course, involves a dichotomy between substances and reactions. Just as we can (and frequently do) think of computer science in terms of algorithms, we can talk about chemistry in terms of reactions. However, chemistry has historically focused on substances: the encyclopedias and indexes in chemistry tend to be organized and focused on compounds, with reaction names and schemes getting less space on the shelf. Chemistry is becoming more balanced as we understand reactions better; computer science has always been more heavily oriented toward algorithms, but we cannot ignore the driving force of new kinds of data.

The history of information retrieval, for example, has been driven by the kinds of information we could store and use. In the 1960s, for example, storage was extremely expensive. Research projects were limited to text

materials. Even then, storage costs meant that a research project could just barely manage to have a single ASCII document available for processing. For example, Gerard Salton’s SMART system, one of the leading text retrieval systems for many years (see Salton’s book, The SMART Automatic Retrieval System , Prentice-Hall, 1971), did most of its processing on collections of a few hundred abstracts. The only collections of “full documents” were a collection of 80 extended abstracts, each a page or two long, and a collection of under a thousand stories from Time Magazine , each less than a page in length. The biggest collection was 1400 abstracts in aeronautical engineering. With this data, Salton was able to experiment on the effectiveness of retrieval methods using suffixing, thesauri, and simple phrase finding. Salton also laid down the standard methodology for evaluating retrieval systems, based on Cyril Cleverdon’s measures of “recall” (percentage of the relevant material that is retrieved in response to a query) and “precision” (the percentage of the material retrieved that is relevant). A system with perfect recall finds all relevant material, making no errors of omission and leaving out nothing the user wanted. In contrast, a system with perfect precision finds only relevant material, making no errors of commission and not bothering the user with stuff of no interest. The SMART system produced these measures for many retrieval experiments and its methodology was widely used, making text retrieval one of the earliest areas of computer science with agreed-on evaluation methods. Salton was not able to do anything with image retrieval at the time; there were no such data available for him.

Another idea shaped by the amount of information available was “relevance feedback,” the idea of identifying useful documents from a first retrieval pass in order to improve the results of a later retrieval. With so few documents, high precision seemed like an unnecessary goal. It was simply not possible to retrieve more material than somebody could look at. Thus, the research focused on high recall (also stimulated by the insistence by some users that they had to have every single relevant document). Relevance feedback helped recall. By contrast, the use of phrase searching to improve precision was tried but never got much attention simply because it did not have the scope to produce much improvement in the running systems.

The basic problem is that we wish to search for concepts, and what we have in natural language are words and phrases. When our documents are few and short, the main problem is not to miss any, and the research at the time stressed algorithms that found related words via associations or improved recall with techniques like relevance feedback.

Then, of course, several other advances—computer typesetting and word processing to generate material and cheap disks to hold it—led to much larger text collections. Figure 5.1 shows the decline in the price of

types of data representation in computer science

FIGURE 5.1 Decline in the price of disk space, 1950 to 2004.

disk space since the first disks in the mid-1950s, generally following the cost-performance trends of Moore’s law.

Cheaper storage led to larger and larger text collections online. Now there are many terabytes of data on the Web. These vastly larger volumes mean that precision has now become more important, since a common problem is to wade through vastly too many documents. Not surprisingly, in the mid-1980s efforts started on separating the multiple meanings of words like “bank” or “pine” and became the research area of “sense disambiguation.” 2 With sense disambiguation, it is possible to imagine searching for only one meaning of an ambiguous word, thus avoiding many erroneous retrievals.

Large-scale research on text processing took off with the availability of the TREC (Text Retrieval Evaluation Conference) data. Thanks to the National Institute of Standards and Technology, several hundred megabytes of text were provided (in each of several years) for research use. This stimulated more work on query analysis, text handling, searching

algorithms, and related areas; see the series titled TREC Conference Proceedings, edited by Donna Harmon of NIST.

Document clustering appeared as an important way to shorten long search results. Clustering enables a system to report not, say, 5000 documents but rather 10 groups of 500 documents each, and the user can then explore the group or groups that seem relevant. Salton anticipated the future possibility of such algorithms, as did others. 3 Until we got large collections, though, clustering did not find application in the document retrieval world. Now one routinely sees search engines using these techniques, and faster clustering algorithms have been developed.

Thus the algorithms explored switched from recall aids to precision aids as the quantity of available data increased. Manual thesauri, for example, have dropped out of favor for retrieval, partly because of their cost but also because their goal is to increase recall, which is not today’s problem. In terms of finding the concepts hinted at by words and phrases, our goals now are to sharpen rather than broaden these concepts: thus disambiguation and phrase matching, and not as much work on thesauri and term associations.

Again, multilingual searching started to matter, because multilingual collections became available. Multilingual research shows a more precise example of particular information resources driving research. The Canadian government made its Parliamentary proceedings (called Hansard ) available in both French and English, with paragraph-by-paragraph translation. This data stimulated a number of projects looking at how to handle bilingual material, including work on automatic alignment of the parallel texts, automatic linking of similar words in the two languages, and so on. 4

A similar effect was seen with the Brown corpus of tagged English text, where the part of speech of each word (e.g., whether a word is a noun or a verb) was identified. This produced a few years of work on algorithms that learned how to assign parts of speech to words in running text based on statistical techniques, such as the work by Garside. 5

One might see an analogy to various new fields of chemistry. The recognition that pesticides like DDT were environmental pollutants led to a new interest in biodegradability, and the Freon propellants used in aerosol cans stimulated research in reactions in the upper atmosphere. New substances stimulated a need to study reactions that previously had not been a top priority for chemistry and chemical engineering.

As storage became cheaper, image storage was now as practical as text storage had been a decade earlier. Starting in the 1980s we saw the IBM QBIC project demonstrating that something could be done to retrieve images directly, without having to index them by text words first. 6 Projects like this were stimulated by the availability of “clip art” such as the COREL image disks. Several different projects were driven by the easy access to images in this way, with technology moving on from color and texture to more accurate shape processing. At Berkeley, for example, the “Blobworld” project made major improvements in shape detection and recognition, as described in Carson et al. 7 These projects demonstrated that retrieval could be done with images as well as with words, and that properties of images could be found that were usable as concepts for searching.

Another new kind of data that became feasible to process was sound, in particular human speech. Here it was the Defense Advanced Research Projects Agency (DARPA) that took the lead, providing the SWITCH-BOARD corpus of spoken English. Again, the availability of a substantial file of tagged information helped stimulate many research projects that used this corpus and developed much of the technology that eventually went into the commercial speech recognition products we now have. As with the TREC contests, the competitions run by DARPA based on its spoken language data pushed the industry and the researchers to new advances. National needs created a new technology; one is reminded of the development of synthetic rubber during World War II or the advances in catalysis needed to make explosives during World War I.

Yet another kind of new data was geo-coded data, introducing a new set of conceptual ideas related to place. Geographical data started showing up in machine-readable form during the 1980s, especially with the release of the Dual Independent Map Encoding (DIME) files after the 1980

census and the Topologically Integrated Geographic Encoding and Referencing (TIGER) files from the 1990 census. The availability, free of charge, of a complete U.S. street map stimulated much research on systems to display maps, to give driving directions, and the like. 8 When aerial photographs also became available, there was the triumph of Microsoft’s “Terraserver,” which made it possible to look at a wide swath of the world from the sky along with correlated street and topographic maps. 9

More recently, in the 1990s, we have started to look at video search and retrieval. After all, if a CD-ROM contains about 300,000 times as many bytes per pound as a deck of punched cards, and a digitized video has about 500,000 times as many bytes per second as the ASCII script it comes from, we should be about where we were in the 1960s with video today. And indeed there are a few projects, most notably the Informedia project at Carnegie Mellon University, that experiment with video signals; they do not yet have ways of searching enormous collections, but they are developing algorithms that exploit whatever they can find in the video: scene breaks, closed-captioning, and so on.

Again, there is the problem of deducing concepts from a new kind of information. We started with the problem of words in one language needing to be combined when synonymous, picked apart when ambiguous, and moved on to detecting synonyms across multiple languages and then to concepts depicted in pictures and sounds. Now we see research such as that by Jezekiel Ben-Arie associating words like “run” or “hop” with video images of people doing those actions. In the same way we get again new chemistry when molecules like “buckyballs” are created and stimulate new theoretical and reaction studies.

Defining concepts for search can be extremely difficult. For example, despite our abilities to parse and define every item in a computer language, we have made no progress on retrieval of software; people looking for search or sort routines depend on metadata or comments. Some areas seem more flexible than others: text and naturalistic photograph processing software tends to be very general, while software to handle CAD diagrams and maps tends to be more specific. Algorithms are sometimes portable; both speech processing and image processing need Fourier transforms, but the literature is less connected than one might like (partly

because of the difference between one-dimensional and two-dimensional transforms).

There are many other examples of interesting computer science research stimulated by the availability of particular kinds of information. Work on string matching today is often driven by the need to align sequences in either protein or DNA data banks. Work on image analysis is heavily influenced by the need to deal with medical radiographs. And there are many other interesting projects specifically linked to an individual data source. Among examples:

The British Library scanning of the original manuscript of Beowulf in collaboration with the University of Kentucky, working on image enhancement until the result of the scanning is better than reading the original;

The Perseus project, demonstrating the educational applications possible because of the earlier Thesaurus Linguae Graecae project, which digitized all the classical Greek authors;

The work in astronomical analysis stimulated by the Sloan Digital Sky Survey;

The creation of the field of “forensic paleontology” at the University of Texas as a result of doing MRI scans of fossil bones;

And, of course, the enormous amount of work on search engines stimulated by the Web.

When one of these fields takes off, and we find wide usage of some online resource, it benefits society. Every university library gained readers as their catalogs went online and became accessible to students in their dorm rooms. Third World researchers can now access large amounts of technical content their libraries could rarely acquire in the past.

In computer science, and in chemistry, there is a tension between the algorithm/reaction and the data/substance. For example, should one look up an answer or compute it? Once upon a time logarithms were looked up in tables; today we also compute them on demand. Melting points and other physical properties of chemical substances are looked up in tables; perhaps with enough quantum mechanical calculation we could predict them, but it’s impractical for most materials. Predicting tomorrow’s weather might seem a difficult choice. One approach is to measure the current conditions, take some equations that model the atmosphere, and calculate forward a day. Another is to measure the current conditions, look in a big database for the previous day most similar to today, and then take the day after that one as the best prediction for tomorrow. However, so far the meteorologists feel that calculation is better. Another complicated example is chess: given the time pressure of chess tournaments

against speed and storage available in computers, chess programs do the opening and the endgame by looking in tables of old data and calculate for the middle game.

To conclude, a recipe for stimulating advances in computer science is to make some data available and let people experiment with it. With the incredibly cheap disks and scanners available today, this should be easier than ever. Unfortunately, what we gain with technology we are losing to law and economics. Many large databases are protected by copyright; few motion pictures, for example, are old enough to have gone out of copyright. Content owners generally refuse to grant permission for wide use of their material, whether out of greed or fear: they may have figured out how to get rich off their files of information or they may be afraid that somebody else might have. Similarly it is hard to get permission to digitize in-copyright books, no matter how long they have been out of print. Jim Gray once said to me, “May all your problems be technical.” In the 1960s I was paying people to key in aeronautical abstracts. It never occurred to us that we should be asking permission of the journals involved (I think what we did would qualify as fair use, but we didn’t even think about it). Today I could scan such things much more easily, but I would not be able to get permission. Am I better off or worse off?

There are now some 22 million chemical substances in the Chemical Abstracts Service Registry and 7 million reactions. New substances continue to intrigue chemists and cause research on new reactions, with of course enormous interest in biochemistry both for medicine and agriculture. Similarly, we keep adding data to the Web, and new kinds of information (photographs of dolphins, biological flora, and countless other things) can push computer scientists to new algorithms. In both cases, synthesis of specific instances into concepts is a crucial problem. As we see more and more kinds of data, we learn more about how to extract meaning from it, and how to present it, and we develop a need for new algorithms to implement this knowledge. As the data gets bigger, we learn more about optimization. As it gets more complex, we learn more about representation. And as it gets more useful, we learn more about visualization and interfaces, and we provide better service to society.

HISTORY AND THE FUNDAMENTALS OF COMPUTER SCIENCE

Edward L. Ayers, University of Virginia

We might begin with a thought experiment: What is history? Many people, I’ve discovered, think of it as books and the things in books. That’s certainly the explicit form in which we usually confront history. Others, thinking less literally, might think of history as stories about the past; that would open us to oral history, family lore, movies, novels, and the other forms in which we get most of our history.

All these images are wrong, of course, in the same way that images of atoms as little solar systems are wrong, or pictures of evolution as profiles of ever taller and more upright apes and people are wrong. They are all models, radically simplified, that allow us to think about such things in the exceedingly small amounts of time that we allot to these topics.

The same is true for history, which is easiest to envision as technological progress, say, or westward expansion, of the emergence of freedom—or of increasing alienation, exploitation of the environment, or the growth of intrusive government.

Those of us who think about specific aspects of society or nature for a living, of course, are never satisfied with the stories that suit the purposes of everyone else so well.

We are troubled by all the things that don’t fit, all the anomalies, variance, and loose ends. We demand more complex measurement, description, and fewer smoothing metaphors and lowest common denominators.

Thus, to scientists, atoms appear as clouds of probability; evolution appears as a branching, labyrinthine bush in which some branches die out and others diversify. It can certainly be argued that past human experience is as complex as anything in nature and likely much more so, if by complexity we mean numbers of components, variability of possibilities, and unpredictability of outcomes.

Yet our means of conveying that complexity remain distinctly analog: the story, the metaphor, the generalization. Stories can be wonderfully complex, of course, but they are complex in specific ways: of implication, suggestion, evocation. That’s what people love and what they remember.

But maybe there is a different way of thinking about the past: as information. In fact, information is all we have. Studying the past is like studying scientific processes for which you have the data but cannot run the experiment again, in which there is no control, and in which you can never see the actual process you are describing and analyzing. All we have is information in various forms: words in great abundance, billions of numbers, millions of images, some sounds and buildings, artifacts.

The historian’s goal, it seems to me, should be to account for as much of the complexity embedded in that information as we can. That, it appears, is what scientists do, and it has served them well.

And how has science accounted for ever-increasing amounts of complexity in the information they use? Through ever more sophisticated instruments. The connection between computer science and history could be analogous to that between telescopes and stars, microscopes and cells. We could be on the cusp of a new understanding of the patterns of complexity in human behavior of the past.

The problem may be that there is too much complexity in that past, or too much static, or too much silence. In the sciences, we’ve learned how to filter, infer, use indirect evidence, and fill in the gaps, but we have a much more literal approach to the human past.

We have turned to computer science for tasks of more elaborate description, classification, representation. The digital archive my colleagues and I have built, the Valley of the Shadow Project, permits the manipulation of millions of discrete pieces of evidence about two communities in the era of the American Civil War. It uses sorting mechanisms, hypertextual display, animation, and the like to allow people to handle the evidence of this part of the past for themselves. This isn’t cutting-edge computer science, of course, but it’s darned hard and deeply disconcerting to some, for it seems to abdicate responsibility, to undermine authority, to subvert narrative, to challenge story.

Now, we’re trying to take this work to the next stage, to analysis. We have composed a journal article that employs an array of technologies, especially geographic information systems and statistical analysis in the creation of the evidence. The article presents its argument, evidence, and historiographical context as a complex textual, tabular, and graphical representation. XML offers a powerful means to structure text and XSL an even more powerful means to transform it and manipulate its presentation. The text is divided into sections called “statements,” each supported with “explanation.” Each explanation, in turn, is supported by evidence and connected to relevant historiography.

Linkages, forward and backward, between evidence and narrative are central. The historiography can be automatically sorted by author, date, or title; the evidence can be arranged by date, topic, or type. Both evidence and historiographical entries are linked to the places in the analysis where they are invoked. The article is meant to be used online, but it can be printed in a fixed format with all the limitations and advantages of print.

So, what are the implications of thinking of the past in the hardheaded sense of admitting that all we really have of the past is information? One implication might be great humility, since all we have for most

of the past are the fossils of former human experience, words frozen in ink and images frozen in line and color. Another implication might be hubris: if we suddenly have powerful new instruments, might we be on the threshold of a revolution in our understanding of the past? We’ve been there before.

A connection between history and social science was tried before, during the first days of accessible computers. Historians taught themselves statistical methods and even programming languages so that they could adopt the techniques, models, and insights of sociology and political science. In the 1950s and 1960s the creators of the new political history called on historians to emulate the precision, explicitness, replicability, and inclusivity of the quantitative social sciences. For two decades that quantitative history flourished, promising to revolutionize the field. And to a considerable extent it did: it changed our ideas of social mobility, political identification, family formation, patterns of crime, economic growth, and the consequences of ethnic identity. It explicitly linked the past to the present and held out a history of obvious and immediate use.

But that quantitative social science history collapsed suddenly, the victim of its own inflated claims, limited method and machinery, and changing academic fashion. By the mid-1980s, history, along with many of the humanities and social sciences, had taken the linguistic turn. Rather than software manuals and codebooks, graduate students carried books of French philosophy and German literary interpretation. The social science of choice shifted from sociology to anthropology; texts replaced tables. A new generation defined itself in opposition to social scientific methods just as energetically as an earlier generation had seen in those methods the best means of writing a truly democratic history. The first computer revolution largely failed.

The first effort at that history fell into decline in part because historians could not abide the distance between their most deeply held beliefs and what the statistical machinery permitted, the abstraction it imposed. History has traditionally been built around contingency and particularity, but the most powerful tools of statistics are built on sampling and extrapolation, on generalization and tendency. Older forms of social history talked about vague and sometimes dubious classifications in part because that was what the older technology of tabulation permitted us to see. It has become increasingly clear across the social sciences that such flat ways of describing social life are inadequate; satisfying explanations must be dynamic, interactive, reflexive, and subtle, refusing to reify structures of social life or culture. The new technology permits a new cross-fertilization.

Ironically, social science history faded just as computers became widely available, just as new kinds of social science history became feasible. No longer is there any need for white-coated attendants at huge mainframes

and expensive proprietary software. Rather than reducing people to rows and columns, searchable databases now permit researchers to maintain the identities of individuals in those databases and to represent entire populations rather than samples. Moreover, the record can now include things social science history could only imagine before the Web: completely indexed newspapers, with the original readable on the screen; completely searchable letters and diaries by the thousands; and interactive maps with all property holders identified and linked to other records. Visualization of patterns in the data, moreover, far outstrips the possibilities of numerical calculation alone. Manipulable histograms, maps, and time lines promise a social history that is simultaneously sophisticated and accessible. We have what earlier generations of social science historians dreamed of: a fast and widely accessible network linked to cheap and powerful computers running common software with well-established standards for the handling of numbers, texts, and images. New possibilities of collaboration and cumulative research beckon. Perhaps the time is right to reclaim a worthy vision of a disciplined and explicit social scientific history that we abandoned too soon.

What does this have to do with computer science? Everything, it seems to me. If you want hard problems, historians have them. And what’s the hardest problem of all right now? The capture of the very information that is history. Can computer science imagine ways to capture historical information more efficiently? Can it offer ways to work with the spotty, broken, dirty, contradictory, nonstandardized information we work with?

The second hard problem is the integration of this disparate evidence in time and space, offering new precision, clarity, and verifiability, as well as opening new questions and new ways of answering them.

If we can think of these ways, then we face virtually limitless possibilities. Is there a more fundamental challenge or opportunity for computer science than helping us to figure out human society over human time?

This page intentionally left blank.

Computer Science: Reflections on the Field, Reflections from the Field provides a concise characterization of key ideas that lie at the core of computer science (CS) research. The book offers a description of CS research recognizing the richness and diversity of the field. It brings together two dozen essays on diverse aspects of CS research, their motivation and results. By describing in accessible form computer science’s intellectual character, and by conveying a sense of its vibrancy through a set of examples, the book aims to prepare readers for what the future might hold and help to inspire CS researchers in its creation.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Praxis Core Math

Course: praxis core math   >   unit 1, data representations | lesson.

  • Data representations | Worked example
  • Center and spread | Lesson
  • Center and spread | Worked example
  • Random sampling | Lesson
  • Random sampling | Worked example
  • Scatterplots | Lesson
  • Scatterplots | Worked example
  • Interpreting linear models | Lesson
  • Interpreting linear models | Worked example
  • Correlation and Causation | Lesson
  • Correlation and causation | Worked example
  • Probability | Lesson
  • Probability | Worked example

What are data representations?

  • How much of the data falls within a specified category or range of values?
  • What is a typical value of the data?
  • How much spread is in the data?
  • Is there a trend in the data over time?
  • Is there a relationship between two variables?

What skills are tested?

  • Matching a data set to its graphical representation
  • Matching a graphical representation to a description
  • Using data representations to solve problems

How are qualitative data displayed?

  • A vertical bar chart lists the categories of the qualitative variable along a horizontal axis and uses the heights of the bars on the vertical axis to show the values of the quantitative variable. A horizontal bar chart lists the categories along the vertical axis and uses the lengths of the bars on the horizontal axis to show the values of the quantitative variable. This display draws attention to how the categories rank according to the amount of data within each. Example The heights of the bars show the number of students who want to study each language. Using the bar chart, we can conclude that the greatest number of students want to study Mandarin and the least number of students want to study Latin.
  • A pictograph is like a horizontal bar chart but uses pictures instead of the lengths of bars to represent the values of the quantitative variable. Each picture represents a certain quantity, and each category can have multiple pictures. Pictographs are visually interesting, but require us to use the legend to convert the number of pictures to quantitative values. Example Each represents 40 ‍   students. The number of pictures shows the number of students who want to study each language. Using the pictograph, we can conclude that twice as many students want to study French as want to study Latin.
  • A circle graph (or pie chart) is a circle that is divided into as many sections as there are categories of the qualitative variable. The area of each section represents, for each category, the value of the quantitative data as a fraction of the sum of values. The fractions sum to 1 ‍   . Sometimes the section labels include both the category and the associated value or percent value for that category. Example The area of each section represents the fraction of students who want to study that language. Using the circle graph, we can conclude that just under 1 2 ‍   the students want to study Mandarin and about 1 3 ‍   want to study Spanish.

How are quantitative data displayed?

  • Dotplots use one dot for each data point. The dots are plotted above their corresponding values on a number line. The number of dots above each specific value represents the count of that value. Dotplots show the value of each data point and are practical for small data sets. Example Each dot represents the typical travel time to school for one student. Using the dotplot, we can conclude that the most common travel time is 10 ‍   minutes. We can also see that the values for travel time range from 5 ‍   to 35 ‍   minutes.
  • Histograms divide the horizontal axis into equal-sized intervals and use the heights of the bars to show the count or percent of data within each interval. By convention, each interval includes the lower boundary but not the upper one. Histograms show only totals for the intervals, not specific data points. Example The height of each bar represents the number of students having a typical travel time within the corresponding interval. Using the histogram, we can conclude that the most common travel time is between 10 ‍   and 15 ‍   minutes and that all typical travel times are between 5 ‍   and 40 ‍   minutes.

How are trends over time displayed?

How are relationships between variables displayed.

  • (Choice A)   A
  • (Choice B)   B
  • (Choice C)   C
  • (Choice D)   D
  • (Choice E)   E
  • Your answer should be
  • an integer, like 6 ‍  
  • a simplified proper fraction, like 3 / 5 ‍  
  • a simplified improper fraction, like 7 / 4 ‍  
  • a mixed number, like 1   3 / 4 ‍  
  • an exact decimal, like 0.75 ‍  
  • a multiple of pi, like 12   pi ‍   or 2 / 3   pi ‍  
  • a proper fraction, like 1 / 2 ‍   or 6 / 10 ‍  
  • an improper fraction, like 10 / 7 ‍   or 14 / 8 ‍  

Things to remember

  • When matching data to a representation, check that the values are graphed accurately for all categories.
  • When reporting data counts or fractions, be clear whether a question asks about data within a single category or a comparison between categories.
  • When finding the number or fraction of the data meeting a criteria, watch for key words such as or , and , less than , and more than .

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Javatpoint Logo

  • Computer Fundamentals

Computer Network

Control System

  • Interview Q

COA Tutorial

Basic co and design, computer instructions, digital logic circuits, map simplification, combinational circuits, flip - flops, digital components, register transfer, micro-operations, memory organization.

JavaTpoint

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Interview Questions

Company Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Artificial Intelligence

AWS Tutorial

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

Machine Learning

DevOps Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

  • Python for Data Science
  • Data Analysis
  • Machine Learning

Deep Learning

  • Deep Learning Interview Questions
  • ML Projects
  • ML Interview Questions
  • Probabilistic Models in Machine Learning
  • Difference Between Azure SQL Database and Azure SQL Managed Instances
  • Components of Time Series Data
  • TOPSIS method for Multiple-Criteria Decision Making (MCDM)
  • Multi-plot grid in Seaborn
  • Apache Pig Installation on Windows and Case Study
  • Seaborn Kdeplot – A Comprehensive Guide
  • How to get real-time Mutual Funds Information using Python?
  • Levene’s Test in R Programming
  • Grid Plot in Python using Seaborn
  • Animated Data Visualization using Plotly Express
  • Anscombe’s quartet
  • Data Manipulation in Python using Pandas
  • 6 Misconceptions About Web Scraping
  • Pandas Series Index() Methods
  • Understanding different Box Plot with visualization
  • Box plot and Histogram exploration on Iris data
  • Basic Python Charts
  • Data Analysis and Visualization with Python | Set 2

Types of Data Science

In the digital age, the importance of data cannot be overstated. It has become the lifeblood of organizations, driving strategic decisions, operational efficiencies, and technological innovations. This is where data science steps in – a field that blends statistical techniques, algorithmic design, and technology to analyze and interpret complex data. Data science is not a monolith; it encompasses various disciplines, each with its unique focus and methodologies. From understanding past behaviors to predicting future trends, and automating decision-making processes, data science offers a comprehensive toolkit for navigating the complexities of modern-day data. As we delve in this article and we’ll explore its various types, shedding light on how they contribute to extracting value from data, and why individuals and organizations alike are increasingly leaning towards adopting data science practices.

Data_science

Why Choose Data Science?

The allure of data science lies in its capacity to turn vast amounts of raw data into actionable insights. In a world where data is continuously generated at an unprecedented rate, the ability to sift through this data, identify patterns, and make informed decisions is invaluable. Data science equips individuals and organizations with the analytical tools required to address complex problems, optimize operations, and foster innovation. Whether it’s improving customer experience, enhancing operational efficiency, or driving product development, data science plays a pivotal role in helping businesses gain a competitive edge in their respective industries.

Popular Data Science Types

Descriptive analytics.

Descriptive analytics acts as the foundation of data science. It focuses on summarizing historical data to understand what has happened. Through the use of statistical methods and visualization techniques, descriptive analytics provides a clear picture of past behaviors and trends, enabling organizations to grasp the essence of their data at a glance.

Diagnostic Analytics

Where descriptive analytics outlines the ‘what,’ diagnostic analytics delves into the ‘why.’ It involves a deeper analysis of data to understand the causes behind observed phenomena. Diagnostic analytics employs techniques such as correlation analysis and root cause analysis to identify the factors driving outcomes, offering insights into the underlying reasons for past performance.

Predictive Analytics

Predictive analytics looks to the future, using historical data to make predictions about unknown future events. It incorporates statistical models and machine learning algorithms to forecast trends, behaviors, and outcomes. This type of analytics is instrumental in decision-making processes, allowing businesses to anticipate market shifts, consumer behavior, and potential risks.

Prescriptive Analytics

Prescriptive analytics goes a step further by not only predicting future outcomes but also recommending actions to achieve desired results. It uses optimization and simulation algorithms to provide guidance on decision-making, offering solutions to complex problems and strategies for navigating future challenges.

Machine Learning and Artificial Intelligence (AI)

Machine learning and AI represent the cutting-edge of data science, focusing on the development of algorithms that improve automatically through experience. These technologies enable machines to mimic human intelligence, learn from data patterns, and make decisions with minimal human intervention. Machine learning and AI are revolutionizing industries by enhancing predictive analytics, automating complex processes, and driving innovation.

Big Data Analytics

Big data analytics refers to the processing and analysis of vast datasets that are too large or complex for traditional data-processing software. It leverages advanced analytics techniques to uncover hidden patterns, correlations, and insights, enabling organizations to make sense of massive volumes of data from various sources.

Data Engineering

Data engineering provides the infrastructure and tools necessary for collecting, storing, and analyzing data. It focuses on the practical aspects of data preparation and architecture, ensuring that data is accessible, reliable, and in a format suitable for analysis. Data engineers play a crucial role in building and maintaining the backbone of data science projects.

Natural Language Processing (NLP)

NLP is a branch of data science that enables computers to understand, interpret, and generate human language. It applies algorithms to text and speech data to facilitate communication between humans and machines, enabling applications such as sentiment analysis, chatbots, and language translation.

Deep learning is a subset of machine learning that utilizes neural networks with multiple layers to model complex patterns in data. It excels at identifying patterns in unstructured data sets, such as images, sound, and text, driving advancements in fields like computer vision and speech recognition.

Computer Vision

Computer vision focuses on enabling machines to interpret and understand the visual world. It applies deep learning models to analyze images and videos, allowing computers to recognize objects, faces, and scenes. Computer vision technologies are transforming industries by enabling new capabilities in areas such as automated inspection, surveillance, and augmented reality.

FAQ – Types of Data Science

How do i start a career in data science.

Begin by learning the basics of programming (Python or R), statistics, and machine learning. Online courses, bootcamps, and degree programs can provide a structured learning path.

What industries use data science?

Virtually every industry can benefit from data science, including finance, healthcare, retail, technology, manufacturing, and more.

Is data science only for big companies?

No, businesses of all sizes can leverage data science to gain insights, improve decision-making, and enhance competitiveness.

How does data science differ from data analytics?

Data science encompasses a broader spectrum, including creating algorithms, predictive models, and working with big data, whereas data analytics often focuses more on extracting insights from existing data sets.

In conclusion, data science is a multifaceted field that offers powerful tools for analyzing and making sense of data. From understanding past events to predicting future trends and automating decision-making, data science types cater to a broad range of needs and applications. As data continues to grow in volume, variety, and velocity, the role of data science in unlocking its value cannot be overstated. Whether you’re an individual looking to dive into this dynamic field or an organization aiming to harness the power of data, understanding the different types of data science is a crucial first step towards achieving your objectives.

Please Login to comment...

Similar reads.

  • Data Science
  • 10 Best Free Note-Taking Apps for Android - 2024
  • 10 Best VLC Media Player Alternatives in 2024 (Free)
  • 10 Best Free Time Management and Productivity Apps for Android - 2024
  • 10 Best Adobe Illustrator Alternatives in 2024
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. Data Representation

    types of data representation in computer science

  2. the different types of data that are available in this infographtion

    types of data representation in computer science

  3. PPT

    types of data representation in computer science

  4. Data Representation: Definitions, Forms and Solved Examples

    types of data representation in computer science

  5. PPT

    types of data representation in computer science

  6. PPT

    types of data representation in computer science

VIDEO

  1. DATA REPRESENTATION || Computer awareness|| By ARIHANT class 5th #jkssb #ssc #jkpsi #jkssbvlw

  2. Computer Logic and Gates Data Representation

  3. Data Representation

  4. 13.1 User Defined Data Types

  5. MCS-12 (Computer Organization and Assembly Language Programming)Block01 Unit-2 DATA REPRESENTATION#4

  6. What is Alphanumeric

COMMENTS

  1. How do computers represent data?

    How do computers store and process information? In this article, you will learn about the basic units of digital data, such as bits and bytes, and how they are used to represent different types of information. This is a foundational topic for anyone who wants to learn about computer science and programming. Khan Academy is a free online platform that offers courses in various subjects for ...

  2. Data Representation in Computer: Number Systems, Characters

    A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory. Before discussing data representation of numbers, let ...

  3. What are the different ways of Data Representation?

    A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions. ... When the row is placed in ascending or descending order is known as arrayed data. Types of Graphical Data Representation. Bar Chart.

  4. Data representation

    Data representation. Computers use binary - the digits 0 and 1 - to store data. A binary digit, or bit, is the smallest unit of data in computing. It is represented by a 0 or a 1. Binary numbers are made up of binary digits (bits), eg the binary number 1001. The circuits in a computer's processor are made up of billions of transistors.

  5. PDF Number Systems and Number Representation

    Computer Science 217: Introduction to Programming Systems. Q: Why do computer programmers ... • Finite representation of rational (floatingpoint) numbers-Why? • A power programmer must know number systems and data representation to fully understand C's . primitive data types. Primitive values and. the operations on them. Agenda. Number ...

  6. Data (computer science)

    Various types of data which can be visualized through a computer device. In computer science, data (treated as singular, plural, or as a mass noun) is any sequence of one or more symbols; datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represented using the binary number system of ones (1) and zeros (0), instead of analog ...

  7. Data Representation

    We also cover the basics of digital circuits and logic gates, and explain how they are used to represent and process data in computer systems. Our guide includes real-world examples and case studies to help you master data representation principles and prepare for your computer science exams. Check out the links below:

  8. Numbers

    Numbers - Data Representation - Computer Science Field Guide. In this section, we will look at how computers represent numbers. To begin with, we'll revise how the base 10 number system that we use every day works, and then look at binary, which is base 2. After that, we'll look at some other charactertistics of numbers that computers must deal ...

  9. PDF Data Representation

    Data Representation Data Representation Eric Roberts CS 106A February 10, 2016 Claude Shannon Claude Shannon was one of the pioneers who shaped computer science in its early years. In his master's thesis, Shannon showed how it was possible to use Boolean logic and switching circuits to perform arithmetic calculations. That work led

  10. PDF Lecture Notes on Data Representation

    L9.2 Data Representation The constructor for elements of recursive types is fold, while unfold destructs elements. 'e: [ˆ :˝= ]˝ 'folde: ˆ :˝ 'e: ˆ :˝ 'unfolde: [ˆ :˝= ]˝ This "unfolding" of the recursion seems like a strange operation, and it is. For example, for all other data constructors the components have a smaller

  11. Decoding Computation Through Data Representation

    Primitive data types: Computers deal with binary data at the most basic level. In most programming languages, integers, floating-point numbers, characters, and Booleans are foundational data types. Their representation involves bit patterns in memory, with specifics such as endian-ness, precision, and overflow/underflow considerations.

  12. Data representation

    The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes. Powers of ten and powers of two.

  13. Data Representation in Computer Science

    Data representation in computer science is the process of encrypting data for information security. Data representation in computer science refers to the methods used to express information in a computer system. It's how a computer interprets and functions with different information types, ranging from text and numbers to images, audio, and ...

  14. PDF Data Representation

    A computer is a tool that processes information. We know from experience that it can handle text documents, music, video, and a wide variety of other types of les. The way that the computer handles all these disparate formats is to store them all in a common encoding that the computer can work with e ciently. This encoding, which is the native

  15. PDF Number Systems and Number Representation

    Unsigned Data Types: Java vs. C Java has type • int • Can represent signed integers C has type: • signed int • Can represent signed integers • int • Same as signed int • unsigned int • Can represent only unsigned integers To understand C, must consider representation of both unsigned and signed integers 20

  16. Data Representation

    Mantissa, Significand and fraction are synonymously used terms. In the computer, the representation is binary and the binary point is not fixed. For example, a number, say, 23.345 can be written as 2.3345 x 101 or 0.23345 x 102 or 2334.5 x 10-2. The representation 2.3345 x 101 is said to be in normalised form.

  17. 5 Data, Representation, and Information

    The preceding two chapters address the creation of models that capture phenomena of interest and the abstractions both for data and for computation that reduce these models to forms that can be executed by computer.We turn now to the ways computer scientists deal with information, especially in its static form as data that can be manipulated by programs.

  18. PDF Data Representation

    events, things, and ideas. Data can be a name, a number, the colors in a photograph, or the notes in a musical composition. • Data Representation refers to the form in which data is stored, processed, and transmitted. • Devices such as smartphones, iPods, and computers store data in digital formats that can be handled by electronic circuitry.

  19. PDF Chapter 9 Graphs: Definition, Applications, Representation

    Representation 9.1Graphs and Relations Graphs (sometimes referred to as networks) offer a way of expressing relationships between pairs of items, and are one of the most important abstractions in computer science. Question 9.1. What makes graphs so special? What makes graphs special is that they represent relationships. As you will (or might have)

  20. Data Representation: Definition, Types, Examples

    Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain.

  21. Data representations

    Data representations are useful for interpreting data and identifying trends and relationships. When working with data representations, pay close attention to both the data values and the key words in the question. When matching data to a representation, check that the values are graphed accurately for all categories.

  22. Data Types in Programming

    Data types are the basis of programming languages. There are various kind of data types available according to the various kind of data available. Data types are of 3 types. Primitive Data type: int, float, char, bool. Composite Data Types: string, array, pointers.

  23. Data Representation in Computer Organization

    Data can be anything like a number, a name, notes in a musical composition, or the color in a photograph. Data representation can be referred to as the form in which we stored the data, processed it and transmitted it. In order to store the data in digital format, we can use any device like computers, smartphones, and iPads.

  24. Types of Data Science

    Data science encompasses a broader spectrum, including creating algorithms, predictive models, and working with big data, whereas data analytics often focuses more on extracting insights from existing data sets. In conclusion, data science is a multifaceted field that offers powerful tools for analyzing and making sense of data.