# Storage principle of C # data type in memory

Time：2021-6-11

In C #, the type of variable belongs to reference type, value type, and the conversion between them is difficult to understand, which is related to the storage structure of type in memory. This paper combs the process through the relationship between memory, stack, heap, value type, reference type, and the boxing operation generated by mutual conversion, Put aside all kinds of layers of technology fog, explore its real essence. If you have any questions about the process or mistakes in the description of the process, you are welcome to make corrections in the comments area. Let’s learn and make progress together!

## Memory

Physical structure of memory

Before talking about data structure, let’s review the physical structure of memory,The physical structure of memory is relatively simpleMost people have seen memory modules:

Abstract memory module model:

Memory is actually a kind of electronic component called memory IC. There are a large number of pins (IC pins) in memory IC, such as power supply, address signal, data signal, control signal, etc., which are used for input and output. VCC and GND are power supply, A0 ~ A9 are address signal pins, d0 ~ D7 are data signal pins, RD and WR are control signal pins. After the power supply is connected to VCC and GND, signals such as 0 or 1 can be sent to other pins. In most cases, a DC voltage of + 5V means 1, and 0V means 0.

How much data can the memory IC store? There are eight data signal pins (d0-d7), indicating that 8 bits (= 1 byte) of data can be input and output at one time. In addition, there are 10 address signal pins a0-a9, which means that 1024 addresses from 100000000 to 1111 can be specified. The address is used to represent the storage place of data, so we can conclude that the memory IC can store 1024 1 byte data. Because 1024 = 1K, the capacity of the memory IC is 1KB.

The computer we use now has at least 512M of memory. This is equivalent to 512000 (512MB) ÷ 1KB = 512000k) 1KB memory IC. Of course, it is unlikely that so many memory ICs can be put into a computer. Usually, there are more address signal pins in the memory IC used by the computer, so that tens of megabytes of data can be stored in a memory IC. Therefore, with only a few memory ICs, the capacity of 512MB can be achieved. As shown in the figure above, 1GB memory module has more pins.

The realization of memory writingLet’s continue with the 1KB memory IC. First, let’s assume that we want to write 1 byte of data to the memory IC. In order to achieve this goal, VCC can be connected with + 5V power supply, GND can be connected with 0V power supply, and A0 ~ A9 address signal can be used to specify the storage place of data, then the value of data can be input to the data signal of d0 ~ D7, and WR (write = short for write) signal can be set to 1. After performing these operations, you can write data in the memory IC (as shown in figure a).

The realization of reading memoryTo read out the data, we only need to specify the storage place of the data through the address signals a0-a9, and then set the Rd (read = short for read out) signal to 1. After performing these operations, the data stored in the specified address will be output to the data signal pins (B in the figure below) of d0 ~ D7.

In addition, signals such as WR and RD that allow IC operation are called control signals. When WR and RD are both 0, both write and read operations cannot be performed.

There are a lot of places that can store 8-bit data in the memory IC. These places can be specified by the address, and then the data can be read and written.

Logical model of memory

The logical model of memory can be simply understood as a building with data stored on each floor. In this building, one floor can store one byte of data, and the floor number represents the address. At the same time, we don’t need to pay too much attention to the power supply and control signal of memory IC. When the memory is 1KB, it means a building with 1024 floors (the data type space of programming language is not reflected here)

In the program, you can specify the size of the space occupied by the data type (the number of floors occupied). The program defines three variables. The actual storage space of the memory is as follows:

``````//Defining variables
byte a=123; short b=123; int c=123;``````

Among them, the byte is the unit of binary data, and the commonly used byte is an 8-bit byte, that is, a binary number containing 8 bits. Therefore, 4 bytes is 32 bits. A signed byte represents 2 ^ 7 = 128, and an unsigned byte represents 2 ^ 8 = 256.

Address, which is used to mark the location of storage resources, is represented by a string of binary data in the computer, including:

1. Physical address:The address loaded into the memory address register,The real address of the memory unit. The memory addresses transmitted on the front-end bus are all physical memory addresses, and the number starts from 0 to the highest end of available physical memory. These numbers are mapped to the actual memory module by the nortbridge chip. The physical address isA specific number that is ultimately used on the bus, no conversion, no paging, no privilege level checking.

2. Logical address:The address generated by the CPU. Logical address is used internally and programmatically, and is not unique. The generation of logical address depends on compiler (source code is compiled into machine code).

The figure below shows the basic architecture of CPU and computer. We use this figure to illustrate how physical / logical addresses are resolved in CPU and computer

• All arithmetic logic units in CPU see logical address

• When the CPU needs to write or read data from memory, MMU will convert the logical address to the corresponding physical address

• Control logic sends data, operation request and physical address to bus, which is divided into read request and write request (write request writes data to memory, read request reads data from memory and sends data to CPU)

• MMU is responsible for the conversion between logical address and physical address

• The operating system is responsible for establishing the mapping relationship between logical address and physical address

Windows uses the virtual addressing system technology to map the available memory address of the program to the actual address in the hardware memory. These tasks are completely managed by the windows background. The actual result is that each process on the 32-bit processor can use 4GB of memory, There is 2G of user mode memory (no matter how much hard disk space there is on the computer (more on 64 bit processors)). This 4GB memory, 2G user mode memory, is also called virtual address space, or virtual memory

Of course, this also involves a lot of memory management related knowledge, such as continuous memory allocation, discontinuous memory allocation (virtual memory) for memory management and optimization. Interested students can further explore.

The allocation of program in memory

When an EXE program (the content is reallocation information, variable group and function group) is clicked, the program will be loaded into the virtual memory and converted from the virtual memory address to the actual memory address. Virtual memory generates two additional groups for the program: stack and heap.

Stacks are memory arrays, is a last in, first out data structure (FIFO is called queue). Stack also becomes stack and thread stack. Each running program corresponds to a process (or several, but a process can only correspond to one application). In a process, there can be one or more threads, and each thread has a “private space”, It is called “thread stack” and its size is 1m. The stack stores the following types of data:

• Values of certain types of variables (value types)
• The current execution environment of the program
• Parameters passed to the method

This part of the memory area allocation and release does not need programmer management

A heap is an area of memoryIn the heap, large blocks of memory can be allocated to store certain types of data objects, which is called managed heap in C # and managed by CLR. Unlike stack, memory in the heap can be saved and removed in the order of tasks. At the same time, because of this feature, the heap storage space is discontinuous, which needs GC to deal with.

There are no stack and heap groups in the. EXE file. The memory space required by the stack and heap is allocated when the EXE file is loaded into the memory and starts to run. Therefore, the program in memory is composed of four parts: memory space for variables (global variables, static variables, constants), memory space for functions, memory space for stacks, and memory space for heaps. Of course, in memory, loading the memory space of windows and other operating systems is another matter.

As shown in the figure below:

Stack and heap are similar in that their memory space is in theThe application is allocated when the program runs. However, in the use of memory, there are some differences between the two.The code that stores and discards data in stack is generated automatically by compilerTherefore, the participation of programmers is not required. Every time a function is called, the memory space of the stack data will be allocated and released automatically after the function is processed. On the contrary,Heap memory space, according to the programmer’s program, to clearly apply for allocation or release. According to the different programming languages, there are many ways to write programs to allocate and release the heap memory space.

For example, the C language requires the programmer to call the corresponding method function to manually apply for allocation and release, and C + + needs to use operators to apply for allocation and release. Of course, the difficulty of C and C + + is that if the memory space of the heap is not explicitly released in the program, the memory space will remain even after processing. This phenomenon is called a memory leak. Both C # and Java use GC automatic garbage collection mechanism to deal with related problems, so that programmers do not need to care about heap management in memory.
https://docs.microsoft.com/zh-cn/dotnet/standard/garbage-collection/

Now let’s take a concrete look at how data is allocated and stored in the stack and heap

## Stack and heap

Why do you use heap as memory space when you have stack in memory. Because the stack works by allocating memory variables before releasing them (first in, then out), and by filling them from top to bottom (high memory address to low memory address). However, many variables do not exist alone, and may be nested with other variables, which conflicts with the life cycle of variables. In order to solve this problem, the design of heap is to allocate from bottom to top to ensure that the first in last out rule in the stack does not conflict with the life cycle of variables. Why does the stack structure need to be designed when the heap can solve the conflict? All variables stored in the heap will degrade the performance of the application.

When introducing the storage principle of data in stack and heap, we need to introduce the data types in C #

C # data hierarchy

The data types are as follows:

Data types are as follows:Value types and reference types. The value type is stored in the stack. The reference type first stores the corresponding reference address in the stack, and then allocates space in the heap (managed heap) to store data.

``````//Create an object
Student st; // Declare a student reference object
st=new Student();``````

When the object reference of ST is declared, the corresponding reference address will be stored in the stack (it takes up 4 bytes of space, and the address is empty information at this time, because the corresponding instance object has not been created yet). Here, only the information of the reference address is stored, not the corresponding student object. Next, the second line of code, the memory in the heap will allocate memory space for the student object, Suppose that the instance of the student object is 32 bytes, and the CLR needs to search for oneUnused and continuousTo store the instance of the object (the size is 32 * 8 bytes. At the same time, you need to pick up the pointer to allocate theThe instance address is assigned to the st variable）If not, a garbage collection forced by GC will be involved at this time. If there is not enough space after garbage collection, an out of memory exception will be thrown.

The above example tells us that the process of creating an object reference is more complex than that of creating a value variable, and the performance degradation cannot be avoided. In order to improve the performance of simple and commonly used types, CLR provides a lightweight type named “value type”. The value type instance variable does not contain a pointer to the instance. On the contrary, the variable contains the fields of the instance itself. Because the variable already contains the fields of the instance, the fields in the operation instance do not need to retrieve the pointer (int32a = new int(); So value types also have instance objects. Instances of value types are not controlled by the garbage collector. Therefore, the use of value type relieves the pressure of managed heap, reduces the number of garbage collection in the lifetime of application, and improves the performance.

Here is an example to illustrate the difference between reference type and value type:

``````//Reference type - class type
class StudentRef{
public int age;
}
//Value type - structure type
struct StudentVal{
public int age;
}

static void ValueTypeDemo(){
StudentRef r1=new StudentRef(); // Allocate on the heap
StudentVal v1=new StudentVal(); // Allocate on stack
r1.age=18;                      // Pick up pointer
v1.age=18; 	                    // Modify on stack
Console.WriteLine(r1.age);      //  Display "18"
Console.WriteLine(v1.age);      //  Also show "18"
//The separator******//
StudentRef r2=r1;               //  Copy reference only
StudentVal v2=v1;               //  Assign and copy members on the stack
r1.age =20;                     //  Both r1.age and r2.age are changed to 20
v1.age =21;                     //  The value of v1.age will be changed to 21, and the value of v2.age will not be changed to 18
}``````

As shown in the figure: (the object pointer is used to associate objects, and the synchronous index is used to complete synchronization (such as thread synchronization))

Let’s look at the storage of more complex objects

``````//Define a base class for a student
class Student{
public void Study(){
Console. Writeline ("learn!");
}
public virtual int Credit(int x，int y){
Console. Writeline (\$"Total Credits: {x + y}, compulsory: {x}, elective: {y}");
return x+y;
}
public static void Play(string s){
Console. Writeline ("play: + s)";
}
}
//Freshmen
class Freshman：Student{
//Methods of rewriting credits
public override int Credit(int x，int y){
Console. Writeline (\$"total credits of Freshmen: {x + y}");
return x+y;
}
}
//Instantiate Xiaoming object:
public void XiaoMing(){
int score;                 // 1. Total credits
//Xiao Ming is a freshman
Student xm=new Freshman(); // 2. Instantiate objects
score=xm.Credit(30,5);     // 3. Call the virtual method and actually call the method after subclass rewriting
xm.Study();                // 4. Call instance method
XM. Play ("game")// 5. Call static methods. No matter how many instance objects there are in memory, there is only one static member
//The second instance object: Xiaohua
// Student xh=new Freshman(); // Example 2 of freshman shown in the figure
}``````

As shown in the figure:

explain:

• The process starts, the CLR is loaded into it, the managed heap is initialized, the thread is created (the stack space of 1M), the thread has already executed some code, and after calling the XiaoMing () method, the JIT compiler converts the IL code in the method to the native CPU instruction, defines the type defined in the loading method, and CLR extracts the relevant data to the memory data area. And initialize / create some data structures to represent the type itself

• Before the xiaoming() method is executed, parameter types and variable types have been created (common types are loaded first)

• Type object pointer and synchronous block index are members of each object. Each type object contains a method table. In the method table, each method defined by a type has a corresponding record entry

Student type has three method record items, while freshman type has only one. Because of the inheritance relationship, freshman type has a special field to refer to the base type (other types have the same), which enables JIT compiler to trace the class hierarchy (all the way back to object)

#### boxing and unboxing

Packing

Value types are “lighter” than reference types because they are not allocated as objects in the managed heap, are not recycled by GC, and are not referenced through pointers. But many times, you need to get a reference to an instance of a value type. For example:

``````//Value type
struct Point{
public Int 32 x,y;
}
pulic sealed class Program{
public static void Main(){
ArrayList a =new ArrayList();
Point p;          // Allocate a point (do not allocate space in the heap)
for(Int 32 i=0;i<10;i++){
p.x=p.y=i;   // Initializes a member of a value type
}
}
}``````

Each iteration initializes the value class field of a point and stores the point in the arrarlist. Add method of ArrayList:

``public virtual Int32 Add(Object value);``

The add method gets an object parameter, that is, the add method gets a reference (or pointer) to an object on the managed heap as a parameter. But the previous code passed P, that is, point, which is a value type. For code to work correctly, the point value type must be converted to a real, heap managed object, and a reference to that object must be obtained.

To convert a value type to a reference type, you need to use the boxing mechanism. What happened in this mechanism is summarized as follows:

• Allocates memory in the managed heap. The amount of memory allocated is the amount of memory required for each field of the value type, plus the amount of memory required for all objects in the managed heap to have two additional members (type object pointer and synchronous block index).
• Value type fields are copied to the newly allocated heap memory.
• Returns the address of the object. Now the address is an object reference; The value type becomes the reference type.

Note: because ArrayList treats all inserted data as object type, when we use ArrayList to process data, we may report type mismatch error, that is, ArrayList is not type safe. You can use C # generics to specify type safety.

Unpacking

After packing and unpacking, the code is as follows

``Point p=(Point) a[0];``

Gets the reference (or pointer) contained in element 0 of ArrayList, trying to put it into instance P of point value type. Therefore, all the fields in the boxed point object must be copied to the value type variable p, which is on the thread stack. CLR completes the copy in two steps

• Gets the address of each point field in the boxed point object. This process is called unpacking
• Copy the value contained in the field from the heap to the stack based value type instance

Unpacking is not to reverse the packing process directly. The cost of unpacking is much lower than that of packing. Unpacking is actually the process of obtaining a pointer to the original value type (data field) contained in an object. In fact, the pointer points to the unboxed part of the boxed instance. So unlike boxing, unpacking does not require copying any bytes in memory.

Boxing and unboxing obviously have a negative impact on the speed and memory consumption of the application, so you should pay attention to when the compiler generates code to perform these operations automatically. And try to write the code manually to minimize this situation.