菜鸟学Linux 第111篇笔记 Memory

建议查看原文(因为复制版的格式可能有问题)

原文出自 Winthcloud 链接:

内容总览

内存子系统组件

Memory提升

Viewing system calls

Strategies for using memory

Tunning page allocation

Tuning overcommit

Slab cache

ARP cache

Page cache

调优策略(redhat 6 performance tuning guide官方文档)

进程间通信相关的调优

内存子系统组件

slab allocator

buddy system

kswapd

pdflush

mmu

虚拟化环境

PA --> HA --> MA

虚拟机转换:PA-->HA

GuestOS, OS

Shadow PT

Memory提升

lbs 

Hugetlbfs

查看是否启用Hugetlbfs

cat /proc/meminfo | grep Huge

启用大页面(永久有效)

/etc/sysctl.conf

添加 vm.nr_hugepages = n

即时启用

sysctl -w vm.nr_hugepages=n

Configure hugetlbfs if needed by application

创建hugepage并挂载

mkdir /hugepages

mount -t hugetlbfs none /hugepages

Viewing system calls

Trace every system call made by a program

strace -o /tmp/strace.out -p PID

grep mmap /tmp/strace.out

Summarize system calls

strace -c -p PID or

strace -c COMMAND

Strategies for using memory

Reduce overhead for tiny memory objects

Slab cache

cat /proc/slabinfo

Reduce or defer service time for slower subsystems

Filesystem metadata: buffer cache(slab cache)

Disk IO: page cache

Interprocess communications: shared memory

Network IO: buffer cache, arp cache, connection tracking

使用buffer cache 缓存文件元数据

使用page cache缓存Disk IO

使用shm完成进程间通信

使用buffer cache, arp cache和connection tracking提升网络IO性能

Considerations when tunning memory

How should pages be reclaimed to avoid pressure?

Larger writes are usually more efficient due to re-sorting

Tunning page allocation

Set using 

vm.min_free_kbytes

Tuning vm.min_free_kbytes only be necessary when an application regularly

needs to allocate a large block of memory, then frees that same memory

It may well be the case that the system has too little disk bandwith, too

little CPU power, or too little memory to handle its load.

Consequences

Reduces service time for demand paging

Memory is not available for other useage

Can cause pressure on ZONE_NORMAL

内存耗尽会使系统崩溃

Tuning overcommit

Set using

cat /proc/sys/vm/overcommit_memory

vm.overcommit_memory

0 = heuristic overcommit

1 = always overcommit

2 = commit all RAM plus a percentage of swap (may be > 100)

vm.overcommit_ratio

specified the percentage of physical memory allowed to be 

overcommited when the vm.overcommit_memory set to 2

View Committed_AS in /proc/meminfo

An estimate of how much RAM is required to avoid an out of memory (OOM)

condition for the current workload on a system

OOM

Overcommit Of Memory

Slab cache

Tiny kernel objects are stored in slab

Extra overhead of tracking is better than using 1 page/object

Example: filesystem metadata(dentry and inode caches)

Monitoring

/proc/slabinfo

slabtop

vmstat -m

Tuning a particular slab cache

echo "cache_name limit batchcount shared" > /proc/slabinfo

limit the maximum number of objects that will be cached for each CPU

batchcount the maximum number of global cache objects that will be 

  trasferred to the per-CPU cache when it becomes empty

shared the sharing behavior for Symmetric MultiProcessing(SMP) systems

ARP cache

ARP entries map hardware address to protocol address

cached in /proc/net/arp

By default, the cache is limited to 512 entries as a soft limit 

and 1024 entries as a hard limit

Garbage collection removes stale or older entries

Insufficient ARP cache leads to

Intermittent timeouts between hosts

ARP thrashing

Too much ARP cache puts pressure on ZONE_NORMAL

List entries

ip neighbor list

Flush cache

ip neighbor flush dev ethX

Tuning ARP cache

Adjust where the gc will leave arp table alone

net.ipv4.neigh.default.gc_thresh1

default 128

Soft upper limit

net.ipv4.neigh.default.gc_thresh2

default 512

Becomes hard limit after 5 seconds

Hard upper limit

net.ipv4.neigh.default.gc_thresh3

Garbage collection frequency in seconds

net.ipv4.neigh.default.gc_interval

Page cache

A large percentage of paging activity is due to I/O requests

File reads: each page of file read from disk into memory

These pages form the page cache

Page cache is always checked for IO requests

Drectory reads

Reading and writing regular files

Reading and writing via block device files, DISK IO

Accessing memory mapped files, mmap

Accessing swapped out pages

Page in the page cache are associated with file data

Tuning page cache

View page cache allocation in /proc/meminfo

Tune length/size of memory

vm.lowmen_reserve_ratio

vm.vfs_cache_pressure

Tune arrival/completion rate

vm.page-cluster

vm.zone_reclaim_mode

vm.lowmen_reserve_ratio

For some specialised workloads on highmem machines it is dangerous for the 

kernel to allow process memory to be allocated from the "lowmem" zone

Linux page allocator has a mechanism which prevents allocations which could

use highmem from using too much lowmem

The 'lowmem_reserve_ratio' tunable determines how aggressive the kernel is 

in defending these lower zones

If you have a machine which uses highmem or ISA DMA and Your applications

are using mlock(), or if you are running with no swap then you probably 

should change the lowmem_reserve_ratio setting

vfs_cache_pressure

Controls the tendency of the kernel to reclaim the memory which is used for

caching of directory and inode objects

At the default value of vfs_cache_pressure=100 the kernel will attempt to 

reclaim dentries and inodes at a "fair" rate with respect to pagecache and 

swapcache reclaim

Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry 

and inode caches

When vfs_cache_pressure=0, the kernel will never reclaim dentries and 

inodes due to memory pressure and this can easily lead to out-of-memory

conditions

Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to 

reclaim dentries and inodes.

page-cluster

page-cluster controls the number of pages which are written to swap in 

single attempt

It is a logarithmic value-setting it to zero means "1 page", setting it to

1 means "2 pages", setting it to 2 means "4 pages", etc

The default value is three (eight pages at a time)

There may be some small benefits in tuning this to a different value if 

your workload is swap-intensive

zone_reclaim_mode

Zone_reclaim_mode allows someone to set more or less aggressive approaches

to reclaim memory when a zone runs out of memory

If it is set to zero then no zone reclaim occurs

Allocations will be satisfied from other zones/nodes in the system

This is value ORed together of

1 = Zone reclaim on

2 = Zone reclaim writes dirty pages out

4 = Zone reclaim swaps pages

Anonymous pages

Anonymous pages can be another large  consumer of data

Are not associated with a file, but instead contain:

Program data - arrays, head allocations, etc

Anonymous memory regions

Dirty memory mapped process private pages

IPC shared memory regions pages

View summary usage

grep Anon /proc/meminfo

cat /proc/PID/statm

Anonymous pages = RSS - Shared

Anonymous pages are eligible for swap

调优策略

硬件调优: 硬件选型

软件调优: 内核调优 /proc, /sys

  应用调优

内核调优

1. 进程管理,CPU

2. 内存调优

3. I/O 调优

4. 文件系统

5. 网络子系统

调优思路

1. 查看各项性能指标,定位瓶颈

2. 调优

红帽官方提供一份文档  redhat 6 performance tuning guide 可以搜索到

进程间通信相关的调优

ipcs (interprocess communication facilities)

进程间通信管理命令

ipcs

ipcrm

shared memory

kernel.shmmni

Specifies the maximum number of shared memory segments 

system-wide, default = 4096

kernel.shmall

Specifies the total amount of shared memory, in pages, that

can be used at one time on the system, default=2097152

This should be at least kernel.shammax/PAGE_SIZE

kernel.shmmax

Specifies the maximum size of a shared memory segment that 

can be created

messages

kernel.msgmnb

Specifies the maximum number of bytes in single message 

queue, default = 16384

kernel.msgmni

Specifies the maximum number of message queue identifiers, 

default=16

kernel.msgmax

Specifies the maximum size of a message that can be passed 

between processes

This memory cannot be swapped, default=8192