[PATCH 0/4] powernv: kvm: numa fault improvement
Liu ping fan
kernelfans at gmail.com
Wed Feb 26 14:09:19 EST 2014
Sorry to update lately. It takes a long time to apply for test machine
and then, I hit a series of other bugs which I could not resolve
easily. And for now, I have some high priority task, and will come
back to this topic when time is available.
Besides this, I had do some basic test for numa-fault and no
numa-fault test for HV guest, it shows that 10% drop in performance
when numa-fault is on. (Test with $pg_access_random 60 4 200, and
guest has 10GB mlocked pages ).
I thought this is caused based on the following factors: cache-miss,
tlb-miss, guest->host exit and hw-thread cooperate to exit from guest
state. Hope my patches to be helpful to reduce the cost of
guest->host exit and hw-thread cooperate to exit.
My test case launches 4 threads on guest( as 4 hw-threads ), and each
of them has random access to PAGE_ALIGN area.
Hope from some suggestion about the test case, so when I had time, I
could improve and finish the test.
Thanks,
Fan
--- test case: usage: pg_random_access secs fork_num mem_size---
#include <ctype.h>
#include <errno.h>
#include <libgen.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <time.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/timerfd.h>
#include <time.h>
#include <stdint.h> /* Definition of uint64_t */
#include <poll.h>
/* */
#define CMD_STOP 0x1234
#define SHM_FNAME "/numafault_shm"
#define PAGE_SIZE (1<<12)
/* the protocol defined on the shm */
#define SHM_CMD_OFF 0x0
#define SHM_CNT_OFF 0x1
#define SHM_MESSAGE_OFF 0x2
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
void __inline__ random_access(void *region_start, int len)
{
int *p;
int num;
num = random();
num &= ~(PAGE_SIZE - 1);
num &= (len - 1);
p = region_start + num;
*p = 0x654321;
}
static int numafault_body(int size_MB)
{
/* since MB is always align on PAGE_SIZE, so it is ok to test
fault on page */
int size = size_MB*1024*1024;
void *region_start = malloc(size);
unsigned long *pmap;
int shm_fid;
unsigned long cnt = 0;
pid_t pid = getpid();
char *dst;
char buf[128];
shm_fid = shm_open(SHM_FNAME, O_RDWR, S_IRUSR | S_IWUSR);
ftruncate(shm_fid, 2*sizeof(long));
pmap = mmap(NULL, 2*sizeof(long), PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
if (!pmap) {
printf("child fail to setup mmap of shm\n");
return -1;
}
while (*(pmap+SHM_CMD_OFF) != CMD_STOP){
random_access(region_start, size);
cnt++;
}
__atomic_fetch_add((pmap+SHM_CNT_OFF), cnt, __ATOMIC_SEQ_CST);
dst = (char *)(pmap+SHM_MESSAGE_OFF);
//tofix, need lock
sprintf(buf, "child [%i] cnt=%u\n\0", pid, cnt);
strcat(dst, buf);
munmap(pmap, 2*sizeof(long));
shm_unlink(SHM_FNAME);
fprintf(stdout, "[%s] cnt=%lu\n", pid, cnt);
fflush(stdout);
exit(0);
}
int main(int argc, char **argv)
{
int i;
pid_t pid;
int shm_fid;
unsigned long *pmap;
int fork_num;
int size;
char *dst_info;
struct itimerspec new_value;
int fd;
struct timespec now;
uint64_t exp, tot_exp;
ssize_t s;
struct pollfd pfd;
int elapsed;
if (argc != 4){
fprintf(stderr, "%s wait-secs [secs elapsed before parent
asks the children to exit]\n \
fork-num [child num]\n \
size [memory region covered by each child in MB]\n",
argv[0]);
exit(EXIT_FAILURE);
}
elapsed = atoi(argv[1]);
fork_num = atoi(argv[2]);
size = atoi(argv[3]);
printf("fork %i child process to test mem %i MB for a period: %i sec\n",
fork_num, size, elapsed);
fd = timerfd_create(CLOCK_REALTIME, 0);
if (fd == -1)
handle_error("timerfd_create");
shm_fid = shm_open(SHM_FNAME, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
ftruncate(shm_fid, PAGE_SIZE);
pmap = mmap(NULL, PAGE_SIZE, PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
if (!pmap) {
printf("fail to setup mmap of shm\n");
return -1;
}
memset(pmap, 0, 2*sizeof(long));
//wmb();
for (i = 0; i < fork_num; i++){
switch (pid = fork())
{
case 0: /* child */
numafault_body(size);
exit(0);
case -1: /* error */
err (stderr, "fork failed: %s\n", strerror (errno));
break;
default: /* parent */
printf("fork child [%i]\n", pid);
}
}
if (clock_gettime(CLOCK_REALTIME, &now) == -1)
handle_error("clock_gettime");
/* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line */
new_value.it_value.tv_sec = now.tv_sec + elapsed;
new_value.it_value.tv_nsec = now.tv_nsec;
new_value.it_interval.tv_sec = 0;
new_value.it_interval.tv_nsec = 0;
if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1)
handle_error("timerfd_settime");
pfd.fd = fd;
pfd.events = POLLIN;
pfd.revents = 0;
/* -1: infinite wait */
poll(&pfd, 1, -1);
/* ask children to stop and get back cnt */
*(pmap + SHM_CMD_OFF) = CMD_STOP;
wait(NULL);
dst_info = (char *)(pmap + SHM_MESSAGE_OFF);
printf(dst_info);
printf("total cnt:%lu\n", *(pmap + SHM_CNT_OFF));
munmap(pmap, PAGE_SIZE);
shm_unlink(SHM_FNAME);
}
On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf <agraf at suse.de> wrote:
>
> On 15.01.2014, at 07:36, Liu ping fan <kernelfans at gmail.com> wrote:
>
>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf at suse.de> wrote:
>>>
>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans at gmail.com> wrote:
>>>
>>>> This series is based on Aneesh's series "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>
>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>> (for which, I still try to get a machine to show nums)
>>>>
>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>> which is well known.
>>>
>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>
>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>> do_numa_page() to do the numa fault check. This incurs the overhead
>> when exiting from rmode to vmode. My idea is that in
>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>
>>>> If my suppose is correct, will CCing kvm at vger.kernel.org from next version.
>>>
>>> This translates to me as "This is an RFC"?
>>>
>> Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> So I hope at least, from the theory, it is correct.
>
> Paul, could you please give this some thought and maybe benchmark it?
>
>
> Alex
>
More information about the Linuxppc-dev
mailing list