libStatGen Software 1
MemoryMap.h
1/*
2 * Copyright (C) 2010 Regents of the University of Michigan
3 *
4 * This program is free software: you can redistribute it and/or modify
5 * it under the terms of the GNU General Public License as published by
6 * the Free Software Foundation, either version 3 of the License, or
7 * (at your option) any later version.
8 *
9 * This program is distributed in the hope that it will be useful,
10 * but WITHOUT ANY WARRANTY; without even the implied warranty of
11 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 * GNU General Public License for more details.
13 *
14 * You should have received a copy of the GNU General Public License
15 * along with this program. If not, see <http://www.gnu.org/licenses/>.
16 */
17
18#ifndef __MEMORYMAP_H
19#define __MEMORYMAP_H
20#include <sys/types.h>
21#include <fcntl.h>
22
23#if defined(_WIN32)
24#include <windows.h>
25#endif
26
27///
28/// There are a pair of related data structures in the operating system,
29/// and also a few simple algorithms that explain why your processes are
30/// waiting forever.
31///
32/// The symptom you have is that they are getting little or no CPU time,
33/// as shown in the command 'top'. The machine will appear to have
34/// available CPU time (look at the Cpu(s): parameter - if less than 100%,
35/// you have available CPU). The real key, however, is to look at the
36/// 'top' column with the label 'S' - that is the status of the process,
37/// and crucial to understanding what is going on.
38///
39/// In your instance, the 'S' column for your karma jobs is 'D', which
40/// means it is waiting for data. This is because the process is doing
41/// something that is waiting for the filesystem to return data to it.
42/// Usually, this is because of a C call like read() or write(), but it
43/// also happens in large processes where memory was copied to disk and
44/// re-used for other purposes (this is called paging).
45///
46/// So, a bit of background on the operating system... there is a CPU
47/// secheduler that takes a list of waiting processes, and picks one to
48/// run - if the job is waiting for the disk, there is no point in picking
49/// it to run, since it is blocked, waiting for the disk to return data.
50/// The scheduler marks the process with 'D' and moves on to the next
51/// process to schedule.
52///
53/// In terms of data structures that we care about for this example, there
54/// are two that we care about. First is a linear list of disk buffers
55/// that are stored in RAM and controlled by the operating system. This
56/// is usually called the disk buffer pool. Usually, when a program asks
57/// for data from the disk, this list can be scanned quickly to see if the
58/// data is already in RAM - if so, no disk operation needs to take place.
59///
60/// Now in the case of the normal Unix read() and write() calls, when the
61/// operating system is done finding the page, it copies the data into a
62/// buffer to be used by the process that requested it (in the case of a
63/// read() - a write() is the opposite). This copy operation is slow and
64/// inefficient, but gets the job done.
65///
66/// So overall, you gain some efficiency in a large memory system by
67/// having this disk buffer pool data structure, since you aren't
68/// re-reading the disk over and over to get the same data that you
69/// already have in RAM. However, it is less efficient than it might be
70/// because of the extra buffer copying.
71///
72/// Now we come to memory mapped files, and karma. The underlying system
73/// call of interest to us is mmap(), and is in MemoryMap.cpp. What it
74/// does and how it works are important to understanding the benefits of
75/// it, and frankly, most people don't care about it because it is
76/// seemingly complex.
77///
78/// Two things are important to know: firstly, there is a data structure
79/// in the CPU called the page table, which is mostly contained in the CPU
80/// hardware itself. All memory accesses for normal user processes like
81/// karma go through this hardware page table. Secondly, it is very fast
82/// for the operating system to put together a page table that 'connects'
83/// a bunch of memory locations in your user programs address space to the
84/// disk buffer pool pages.
85///
86/// The combination of those two facts mean that you can implement a 'zero
87/// copy' approach to reading data, which means that the data that is in
88/// the disk buffer pool is directly readable by the program without the
89/// operating system ever having to actually copy the data, like it does
90/// for read() or write().
91///
92/// So the benefit of mmap() is that when the underlying disk pages are
93/// already in the disk buffer pool, a hardware data structure gets built,
94/// then the program returns, and the data is available at full processor
95/// speed with no intervening copy of the data, or waiting for disk or
96/// anything else. It is as near to instantaneous as you can possibly
97/// get. This works whether it is 100 bytes or 100 gigabytes.
98///
99/// So, the last part of the puzzle is why your program winds up in 'D'
100/// (data wait), and what to do about it.
101///
102/// The disk buffer pool is a linear list of blocks ordered by the time
103/// and date of access. A process runs every once in awhile to take the
104/// oldest of those pages, and free them, during which it also has to
105/// update the hardware page tables of any processes referencing them.
106///
107/// So on wonderland, most file access (wget, copy, md5sum, anything else)
108/// is constantly putting new fresh pages at the front of the list, and
109/// karma index files, having been opened awhile ago, are prime candidates
110/// for being paged out. The reason they get paged out as far as I know
111/// is that in any given second of execution, nowhere near the entire
112/// index is getting accessed... so at some point, at least one page gets
113/// sent back to disk (well, flushed from RAM). Once that happens, a
114/// cascading effect happens, where the longer it waits, the older the
115/// other pages get, then the more that get reclaimed, and the slower it
116/// gets, until karma is at a standstill, waiting for pages to be brought
117/// back into RAM.
118///
119/// Now in an ideal world, karma would rapidly recover, and it can...
120/// sometimes. The problem is that your karma job is accessing data all
121/// over that index, and it is essentially looking like a pure random I/O
122/// to the underlying filesystem. There is about a 10 to 1 performance
123/// difference between accessing the disk sequentially as compared to
124/// randomly.
125///
126/// So to make karma work better, the first thing I do when starting karma
127/// is force it to read all of the disk pages in order. This causes the
128/// entire index to be forced into memory in order, so it is forcing
129/// sequential reads, which is the best case possible. There are
130/// problems, for example if three karma jobs start at once, the disk I/O
131/// is no longer as purely sequential as we would like. Also, if the
132/// filesystem is busy taking care of other programs, even if karma thinks
133/// it is forcing sequential I/O, the net result looks more random. This
134/// happens when the system is starting to break down (thrashing) and it
135/// will certainly stall, or look very very slow, or crash.
136///
137/// The upshot of all of this is that when a single reference is shared,
138/// it is more likely that all the pages will be in the disk buffer pool
139/// to begin with, and thereby reduce startup time to nearly zero. It is
140/// also the ideal situation in terms of sharing the same reference among
141/// say 24 copies of karma on wonderland - the only cost is the hardware
142/// page table that gets set up to point to all of the disk buffers.
143///
144/// As I mentioned a paragraph back, the pages can still get swapped out,
145/// even with dozens of karma jobs running. A workaround I created is a
146/// program in utilities called mapfile - it simply repeatedly accesses
147/// the data in sequential order to help ensure that all of the pages are
148/// at the head of the disk buffer pool, and therefore less likely to get
149/// swapped out.
150///
151/// The benefit of such a program (mapfile) is greater on wonderland,
152/// where a lot of processes are competing for memory and disk buffers.
153///
154///
156{
157#if defined(_WIN32)
158 HANDLE file_handle;
159 HANDLE map_handle;
160 DWORD page_size;
161#else
162 int fd;
163 size_t page_size;
164#endif
165 off_t offset;
166 size_t mapped_length;
167 size_t total_length;
168 bool useMemoryMapFlag;
169public:
170
171 void *data;
172
173 MemoryMap();
174
175 virtual ~MemoryMap();
176
177 void debug_print();
178
179 void constructor_clear();
180
181 void destructor_clear();
182
183 virtual bool allocate();
184
185 /// open a previously created mapped vector
186 ///
187 /// useMemoryMapFlag will determine whether it
188 /// uses mmap() or malloc()/read() to populate
189 /// the memory
190 virtual bool open(const char * file, int flags = O_RDONLY);
191
192 /// create the memory mapped file on disk
193 ///
194 /// a file will be created on disk with the header
195 /// filled in. The caller must now populate elements
196 /// using (*this).set(index, value).
197 //
198 virtual bool create(const char * file, size_t size);
199
200 /// store in allocated memory (malloc), not mmap:
201 ///
202 /// This is for code that needs to more flexibly
203 /// the case when an mmap() file _might_ be available,
204 /// but if it is not, we want to load it as a convenience
205 /// to the user. GenomeSequence::populateDBSNP does exactly this.
206 //
207 virtual bool create(size_t size);
208
209 bool close();
210 void test();
211 size_t length()
212 {
213 return mapped_length;
214 }
215
216 char operator[](unsigned int index)
217 {
218 return ((char *)data)[index];
219 };
220 int prefetch(); // force pages into RAM
221
222 //
223 // set or unset use of mmap() call in ::open().
224 // This flag must be set before ::open() is called,
225 // if it is called afterwards, it has no effect.
226 //
227 void useMemoryMap(bool flag=true)
228 {
229 useMemoryMapFlag = flag;
230 }
231};
232
233#endif
There are a pair of related data structures in the operating system, and also a few simple algorithms...
Definition: MemoryMap.h:156
virtual bool open(const char *file, int flags=O_RDONLY)
open a previously created mapped vector
Definition: MemoryMap.cpp:156
virtual bool create(const char *file, size_t size)
create the memory mapped file on disk
Definition: MemoryMap.cpp:243