1 Introduction to C

1.3 Data types and variables

1.3.1 Symbols and keywords

From code listing , we see examples of C syntax. The following is a list of some basic symbols.

Basic symbols

Table 1: Basic symbols
Symbol Description
// line comment, compiler will ignore the line
/* ... */ block comment, compiler will ignore what in between
# start of preprocessor element
; statement terminator
, list separator
() parenthesis of function parameter/argument list and algebraic expressions
{} scope of a program block

A program block consists of a sequence of statements. A statement can be a declaration, an assignment, a function call, a selection, or a repetition statement.

Keywords (reserved words)

C has 32 keywords. We put them into five categories:

Table 2: Keywords
Category Keyword
Basic data types char, int, float, double, short, long, signed, unsigned, void
Define data types typedef, struct, union, enum
Modifiers const, auto, static, extern, volatile, register
Flow control if, else, switch, case, default, goto, for, while, do, break, continue
Function return, sizeof

1.3.2 Data types

There are four basic and build-up concepts on data representation in programming: data type -> variable -> data structure -> algorithm. Any programming language has to address these concepts. We will study these concepts in C. First, we look into data type.

What is a data type?

A data type (or simply type) defines

  • how a certain type of data values is represented in programs,
  • the number of bytes used to represent data values and how a data value is represented in binary format and stored in memory, and
  • what operations are applied to the data values and how data values are operated in the operations.

How a data type is used?

Programmers use data type to inform the compiler about what type of data is used, stored and how much space it requires in the memory, and what operations can be applied to the data values. The compiler then generates instructions to allocate memory blocks for the data. At runtime, the instructions instantiate the data memory blocks and write/read data value to/from the memory blocks.

Data types in C

C provides a set of basic (also called primitive, primary, or fundamental) data types, specified by keywords: char for character type, int for integer type, float for single precision floating number type, and double for double precision floating number type. Arithmetic operations can be applied to the basic data types. Some of these operations are supported by processor instructions.

C use keywords short, long, signed and unsigned as modifiers for int type, and unsigned as a modifier for char type, to represent more basic data types. For example long int represents a long integer type. The keyword void is used to indicate any data type.

C provides methods to define extended (also referred to as non-primitive, secondary, or derived) data types using keywords typedef, struct, union, enum together with pointers and arrays. A hierarchy of extended data types can be built bottom up starting from the basic data types. We will learn the methods of constructing the extended data types in Lesson 3.

Data type size

Each data type has a size, i.e. the number bytes needed to store a data value of the type in memory. Each addressable memory cell has 1 byte consisting of 8 bits. A bit is the unit of data storage, a byte (8 bits) is the unit of addressable data storage, i.e., the size of a memory cell. A word is the unit of transferring data between CPU registers and memory, i.e., the size of a register.

A data object of a certain type consists of one or more bytes in a region of data storage in the execution environment. The contents of data objects represent the data values of the type.

The keyword sizeof is used to get the size of a data type. For example, sizeof(char) gives 1, meaning that the char type uses 1 byte, and sizeof(float) gives 4, meaning that the float type uses 4 bytes.

Because a data type has a size, the number of different values it represents is limited. For example, the single character type char has 1 byte (8 bits), it can present at most 28 different values. So a data type always has a range of data values it represents.

Memory block

When a data type has more than one bytes, it needs contiguous memory cells to store the value of the type. For example, float has 4 bytes, it uses 4 memory cells with continuous addresses. A group of contiguous memory cells is referred to as a memory block. A memory block is usually defined by the lowest address of its memory cells and the number of cells (number of bytes).

Basic data types in C

Table 3 shows the commonly used basic data types and their sizes and ranges. Note that the size of int is system/platform dependent. It has size 2 for 16-bit systems, or size 4 for 32-bit and 64-bit systems. In this course, we will use 4 as the default size for int type.**

Table 3: Examples of basic data types
Data type /Keyword Size in bytes Range
chat 1 -128 to 127
unsigned char 1 0 to 255
int 4 -231+1 to 231-1
unsigned int 4 0 to 232-1
float 4 3.4E-38 to 3.4E+38
double 8 1.7E-308 to 1.7E+308

Let’s look into the char and int types.

char type

The char type has 1 byte, it represents integers from -127 to 127. C uses ASCII (American Standard Code for Information Interchange) (supplementary link) to map the integers from 0 to 127 to characters. The following is the ASCII table, in which each row represents the code of a character in Dec (decimal), Oct (octal), Hex (hexadecimal), Bin (binary), character, and description.

ASCII table
Dec Oct Hex Bin         Char    Description
  0 000 00  00000000    NUL     "null" character
  1 001 01  00000001    SOH     start of header
  2 002 02  00000010    STX     start of text
  3 003 03  00000011    ETX     end of text
  4 004 04  00000100    EOT     end of transmission
  5 005 05  00000101    ENQ     enquiry
  6 006 06  00000110    ACK     acknowledgment
  7 007 07  00000111    BEL     bell
  8 010 08  00001000    BS      backspace
  9 011 09  00001001    HT      horizontal tab
 10 012 0A  00001010    LF      line feed
 11 013 0B  00001011    VT      vertical tab
 12 014 0C  00001100    FF      form feed
 13 015 0D  00001101    CR      carriage return
 14 016 0E  00001110    SO      shift out
 15 017 0F  00001111    SI      shift in
 16 020 10  00010000    DLE     data link escape
 17 021 11  00010001    DC1     device control 1 (XON)
 18 022 12  00010010    DC2     device control 2
 19 023 13  00010011    DC3     device control 3 (XOFF)
 20 024 14  00010100    DC4     device control 4
 21 025 15  00010101    NAK     negative acknowledgement
 22 026 16  00010110    SYN     synchronous idle
 23 027 17  00010111    ETB     end of transmission block
 24 030 18  00011000    CAN     cancel
 25 031 19  00011001    EM      end of medium
 26 032 1A  00011010    SUB     substitute
 27 033 1B  00011011    ESC     escape
 28 034 1C  00011100    FS      file separator
 29 035 1D  00011101    GS      group separator
 30 036 1E  00011110    RS      request to send/record separator
 31 037 1F  00011111    US      unit separator
 32 040 20  00100000    SP      space
 33 041 21  00100001    !       exclamation mark
 34 042 22  00100010    "       double quote
 35 043 23  00100011    #       number sign
 36 044 24  00100100    $       dollar sign
 37 045 25  00100101    %       percent
 38 046 26  00100110    &       ampersand
 39 047 27  00100111    '       single quote
 40 050 28  00101000    (       left/opening parenthesis
 41 051 29  00101001    )       right/closing parenthesis
 42 052 2A  00101010    *       asterisk
 43 053 2B  00101011    +       plus
 44 054 2C  00101100    ,       comma
 45 055 2D  00101101    -       minus or dash
 46 056 2E  00101110    .       dot
 47 057 2F  00101111    /       forward slash
 48 060 30  00110000    0    
 49 061 31  00110001    1    
 50 062 32  00110010    2    
 51 063 33  00110011    3    
 52 064 34  00110100    4    
 53 065 35  00110101    5    
 54 066 36  00110110    6    
 55 067 37  00110111    7    
 56 070 38  00111000    8    
 57 071 39  00111001    9    
 58 072 3A  00111010    :       colon
 59 073 3B  00111011    ;       semi-colon
 60 074 3C  00111100    <       less than
 61 075 3D  00111101    =       equal sign
 62 076 3E  00111110    >       greater than
 63 077 3F  00111111    ?       question mark
 64 100 40  01000000    @       "at" symbol
 65 101 41  01000001    A    
 66 102 42  01000010    B    
 67 103 43  01000011    C    
 68 104 44  01000100    D    
 69 105 45  01000101    E    
 70 106 46  01000110    F    
 71 107 47  01000111    G    
 72 110 48  01001000    H    
 73 111 49  01001001    I    
 74 112 4A  01001010    J    
 75 113 4B  01001011    K    
 76 114 4C  01001100    L    
 77 115 4D  01001101    M    
 78 116 4E  01001110    N    
 79 117 4F  01001111    O    
 80 120 50  01010000    P    
 81 121 51  01010001    Q    
 82 122 52  01010010    R    
 83 123 53  01010011    S    
 84 124 54  01010100    T    
 85 125 55  01010101    U    
 86 126 56  01010110    V    
 87 127 57  01010111    W    
 88 130 58  01011000    X    
 89 131 59  01011001    Y    
 90 132 5A  01011010    Z    
 91 133 5B  01011011    [       left/opening bracket
 92 134 5C  01011100    \       back slash
 93 135 5D  01011101    ]       right/closing bracket
 94 136 5E  01011110    ^       caret/circumflex
 95 137 5F  01011111    _       underscore
 96 140 60  01100000    `    
 97 141 61  01100001    a    
 98 142 62  01100010    b    
 99 143 63  01100011    c    
100 144 64  01100100    d    
101 145 65  01100101    e    
102 146 66  01100110    f    
103 147 67  01100111    g    
104 150 68  01101000    h    
105 151 69  01101001    i    
106 152 6A  01101010    j    
107 153 6B  01101011    k    
108 154 6C  01101100    l    
109 155 6D  01101101    m    
110 156 6E  01101110    n    
111 157 6F  01101111    o    
112 160 70  01110000    p    
113 161 71  01110001    q    
114 162 72  01110010    r    
115 163 73  01110011    s    
116 164 74  01110100    t    
117 165 75  01110101    u    
118 166 76  01110110    v    
119 167 77  01110111    w    
120 170 78  01111000    x    
121 171 79  01111001    y    
122 172 7A  01111010    z    
123 173 7B  01111011    {       left/opening brace
124 174 7C  01111100    |       vertical bar
125 175 7D  01111101    }       right/closing brace
126 176 7E  01111110    ~       tilde
127 177 7F  01111111    DEL     delete 

For example, character A has ASCII code value 65 in decimal (Dec), 101 in octal (Oct), 41 in hexadecimal (Hex), and 1000001 in binary (Bin).

We see that character a has ASCII code 97. The difference of a’s and A’s ASCII code values is 32. So we can get the code of a by adding 32 to the code of A, i.e., 97 = 65+32. Similarly, we can get the code of A by subtracting 32 from the code of a, i.e., 65 = 92-32. This is how letter case conversion works.

You should know the conversions between decimal, octal, hexadecimal, and binary representations. The following example shows the conversions of decimal 65 to binary, octal and hexadecimal representations.

Conversions of number representation in different bases:

65 (decimal, base 10) = 6*101 + 5*100

1000001 (binary, base 2) = 1*26 + 0*25 + 0*24 + 0*23 + 0*22 + 0*21 + 1*20 = 65

101 (octal, base 8) = 1*82 + 0*81 + 1*80 = 65

41 (hexadecimal, base 16) = 4*161 + 1*160 = 65

Decimal to binary conversion:

Quotient  remainder
65 / 2    1   the least significant bit, or right most bit
32 / 2    0
16 / 2    0
8  / 2    0
4  / 2    0
2  / 2    0
1  / 2    1   the most significant bit, or left most bit
0
=> 1000001  

Binary to octal:

Starting from the right side, each group of three binary bits converts to an octal bit.
1 000 001   
1   0   1

Binary to hexadecimal:

Starting from the right side, each group of four binary bits converts to a Hex bit.
100 0001 
  4    1


C supports the four character value representations in programming. For example, you can write A in source code programs as follows.
'A'     -- character `A`
65      -- Dec
0101    -- Oct
0x41    -- Hex

The compiler will convert any of these char representations to its binary format in machine representation. For example, in C programming, statements char c = 'A';, char c = 65;, char c = 0101;, and char c = 0x41; are equivalent. Note that with the octal expression, the digits have to be 0 to 7. For example, 08 is not a valid octal expression. Similarly, the digits of hexadecimal expressions have to be 0, 1, …, 9, A (or a), B (or b), C (or c), D (or d), E (or e), F (or f). For example, 0x4Aa is a valid hexadecimal expression and 0x4AG is not. C does not support binary expression in source code programming.

int type

The int data type is platform/system dependent. It has 2 bytes in 16-bit systems, and 4 bytes in 32-bit and 62-bit systems. We use 32-bit system as our default system, then the int data type has 4 bytes. It represents integers from -231+1 to 231-1. The first bit from left is a sign bit, 0 for positive and 1 for negative. For example, decimal integer 1890259661 is represented in 4 bytes as 01110000 10101011 00010010 11001101. Decimal integer -1890259661 is represented in 4 bytes as 11110000 10101011 00010010 11001101. Their difference is the first bit on the left side.

Little-endian and Big-endian

How do you store the 4 bytes of an int type number in a memory block of size 4? There are two ways to place the 4 bytes: little-endian and big-endian. Little-endian stores the least significant byte in the smallest address memory cell. Big-endian stores the most significant byte in the smallest address cells. The little-endian method is commonly used.

Figure 1 shows how the little-endian and big-endian arrangement of 01110000 10101011 00010010 11001101 in memory, where the smallest memory cell address of the memory block is 1000.

Figure 1: Little-endian and big-endian
Figure 1: Little-endian and big-endian

C supports decimal, octal and hexadecimal expressions for the int type value representation in programming. the following example shows the conversion and the presentations of integer 1890259661.

1890259661                                   (decimal)     C representation: 1890259661 
= 01110000 10101011 00010010 11001101        (binary)
= 01 110 000 101 010 110 001 001 011 001 101 (binary)
=  1   6   0   5   2   6   1   1   3   1   5 (octal)       C representation: 01605261131  
= 0111 0000 1010 1011 0001 0010 1100 1101    (binary)
=    7    0    A    B    1    2    C    D    (hexadecimal) C representation: Ox70AB12CD

In C programming, statements int a = 1890259661;, int a = 01605261131;, and int a = Ox70AB12CD; are equivalent. That is, the compiler will convert the right side number to the same binary number to store in the allocated memory block.

float and double types

C supports the floating/real numbers by two types float and double. The float type has 4 bytes for single precision floating point numbers, which is defined by IEEE 754 standard (supplementaryl link)). The double type has 8 bytes for double precision floating point numbers, defined by IEEE 754 standard (supplementary link). In this course, you are not required to know the detailed bit patterns and operations of float and double types.

In C programming, a floating number can be represented in either floating format or scientific format. For example, 31.4 is the floating point number format, it can also be represented in scientific format as 0.314e2, meaning that 0.314*102. Statements float f = 31.4; and float f = 0.314e2; are equivalent.

1.3.3 Variables

The concepts of variables

A variables is a name (also called identifier) used in programming to represent a data object, which stores data values at runtime. A variables is allocated a memory block at compile time. A variable is instanced as an object in memory at runtime.

Specifically, a variable is a name used in source code programming, referring to a data object of a certain type. It tells the compiler to allocate a memory block for the variable, and use it to set/get values to/from its memory block. The variable’s memory block becomes instanced with an absolute location in computer memory at runtime.

C variable and type

C is a typed programming language. That means a C variable must have a data type. A variable has to be declared a type and initialized (i.e., assigned a value) before it can be used. The variable declaration tells the compiler to assign a relative memory block for the variable. A relative memory block is represented as an offset from a starting point (the scope of the variable) and size (number of bytes) of the block. The value assignment statement tells the compiler to generate instructions to write a value to the memory block. At runtime, the instructions write the value of the data at the absolute memory locations when they are executed.

For example, to use an integer value 10 in a program, we can declare an int variable and initialize it to 10 by statement int x = 10;. This statement first declares a variable named x and then sets its value to 10. It tells the compiler to allocate a relative memory block for x and generate instructions to write value 10 to the memory block. The compiler uses a table to remember variable name x and its relative memory block. If a later statement using x, for example printf("%d", x);, the compiler will get the relative memory location of x from the table, and generate instruction to read the value from the memory block.

Example:

int a;    // let compiler allocate 4 bytes memory for variable a
char c;   // let compiler allocate 1 byte memory for variable c
float f;  // let compiler allocate 4 bytes memory for variable f
a = 2;    // let compiler generate instructions to store value 2 to variable a
c = 'a';  // let compiler generate instructions to store 97 (0111001) to variable c
f = 1.41; // let compiler convert 1.41 to 32 bits single precision float number and store it to f

C allows to declare and/or initialize several variables of the same type in one line separated by comma in one statement. For example, int a=1,b=2,c; is a valid statement.

scope

A scope consists of a sequence of statements, in which identifiers (variable/function names) are declared and used. Generally, a scope consists of a sequence of statements enclosed by a pair of block symbols {}, i.e., starting from { and ending at }. Particularly, the global scope is the whole program, not enclosed in {}. Scopes can be separated, i.e., {…}…{…}, can be nested, i.e., {…{…}…}.

Each variable has a scope. A variable can only be used/accessed by statements after the variable declaration within the scope, including its nested scopes.

A local variable is a variable declared within a code block enclosed by {}, and can be used in the block after its declaration within the block. For example, variables declared in a function are local variables, which can only be used in the function.

A global variable is a variable declared not in any function, so it can be used by any function.

Compilers bind a variable with its scope. The relative memory location of a variable is relative with the beginning of a scope. Two variables in separate scopes can use the same variable name. For example, in code listing, variable a is a global variable and accessed in the main function by assigning value 1 to a. Variable b in the main function is a local variable. Variable names x and y are local variables used in both add and minus functions.

Variable name convention

C has restrictions on variable naming. C variable names must start with a letter, followed by letters, underscores and numbers. Variable names are case sensitive. For program readability, C programming has two naming convention styles: underscore_style and camelCaseStyle. The camelCaseStyle is also used in C++ and Java programming. The underscore_style was used in classical C programmings. We will use the underscore_style in course code examples.

There are different C programming styles used by C programmers, communities and organizations. For example, C Style and Coding Standards.

1.3.4 Constants

Constants are fixed data values in programs. For example, 3.1415926 is the constant Pi for computing circumference and area of circles. Constants can be directly used in source code programs. However, if a constant is used many times in source code, it is not convenient to type the constant value every time it is used. It’s better to use a simple name to represent such a constant, and replace it by an actual constant later on.

Define constants by macro

C preprocessor provides a method to define a constant by name and to replace the name by the constant in the preprocessing step.

Example:

#define PI 3.1415926 // this defines macro PI as 3.1415926, then PI can be used in statements.
float r = 4; 
float cf = 2*PI*r; 
float area = PI*r*r; 

In the preprocessing step of compiling, every appearance of PI in the program will be replaced by 3.1415926. In the compilation step, each occurrence of 3.1415926 will be converted to the single precision representation.

Define constants by read-only variables

C provides an alternative method to define constants, known as constant variables or read-only variables. It uses keyword const in variable declarations. Such constant variables can only be declared and initiated in one statement. A constant variable’s value can not be changed by assignment later on.

Example:

const float pi = 3.1415926;
float r = 4;
float cf = 2*pi*r;
float area = pi*r*r;

In the above example, variable pi is declared as a constant (read-only) variable. It is not allowed to change the value of pi in the program. For example, if we add statement pi = 3.14; in the program, it will not pass the compiling. Using this method, the value 3.1415926 will only be converted to single precision representation once in compiling.

1.3.5 Exercises

Self-quiz

Take a moment to review what you have read above. When you feel ready, take this self-quiz which does not count towards your grade but will help you to gauge your learning. Answer the questions posed below.

Go back