summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-04-22Merge tag 'armsoc-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "Here's the usual "low-priority fixes that didn't make it into the last few -rcs, with a twist: We had a fixes pull request that I didn't send in time to get into 4.0, so we'll send some of them to Greg for -stable as well. Contents here is as usual not all that controversial: - a handful of randconfig fixes from Arnd, in particular for older Samsung platforms - Exynos fixes, !SMP building, DTS updates for MMC and lid switch - Kbuild fix to create output subdirectory for DTB files - misc minor fixes for OMAP" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (23 commits) ARM: at91/dt: sama5d3 xplained: add phy address for macb1 kbuild: Create directory for target DTB ARM: mvebu: Disable CPU Idle on Armada 38x ARM: DRA7: Enable Cortex A15 errata 798181 ARM: dts: am57xx-beagle-x15: Add thermal map to include fan and tmp102 ARM: dts: DRA7: Add bandgap and related thermal nodes bus: ocp2scp: SYNC2 value should be changed to 0x6 ARM: dts: am4372: Add "ti,am437x-ocp2scp" as compatible string for OCP2SCP ARM: OMAP2+: remove superfluous NULL pointer check ARM: EXYNOS: Fix build breakage cpuidle on !SMP ARM: dts: fix lid and power pin-functions for exynos5250-spring ARM: dts: fix mmc node updates for exynos5250-spring ARM: OMAP4: remove dead kconfig option OMAP4_ERRATA_I688 MAINTAINERS: add OMAP defconfigs under OMAP SUPPORT ARM: OMAP1: PM: fix some build warnings on 1510-only Kconfigs ARM: cns3xxx: don't export static symbol ARM: S3C24XX: avoid a Kconfig warning ARM: S3C24XX: fix header file inclusions ARM: S3C24XX: fix building without PM_SLEEP ARM: S3C24XX: use SAMSUNG_WAKEMASK for s3c2416 ...
2015-04-22rbd: rbd_wq comment is obsoleteIlya Dryomov
After the switch to blk-mq rbd_wq processes requests, not devices. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22libceph: announce support for straw2 bucketsIlya Dryomov
Sync up feature bits and enable CEPH_FEATURE_CRUSH_V4. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22crush: straw2 bucket type with an efficient 64-bit crush_ln()Ilya Dryomov
This is an improved straw bucket that correctly avoids any data movement between items A and B when neither A nor B's weights are changed. Said differently, if we adjust the weight of item C (including adding it anew or removing it completely), we will only see inputs move to or from C, never between other items in the bucket. Notably, there is not intermediate scaling factor that needs to be calculated. The mapping function is a simple function of the item weights. The below commits were squashed together into this one (mostly to avoid adding and then yanking a ~6000 lines worth of crush_ln_table): - crush: add a straw2 bucket type - crush: add crush_ln to calculate nature log efficently - crush: improve straw2 adjustment slightly - crush: change crush_ln to provide 32 more digits - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2 - crush/mapper: fix divide-by-0 in straw2 (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need to create a proper compat.h in ceph.git) Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53, 32a1ead92efcd351822d22a5fc37d159c65c1338, 6289912418c4a3597a11778bcf29ed5415117ad9, 35fcb04e2945717cf5cfe150b9fa89cb3d2303a1, 6445d9ee7290938de1e4ee9563912a6ab6d8ee5f, b5921d55d16796e12d66ad2c4add7305f9ce2353. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22crush: ensuring at most num-rep osds are selectedIlya Dryomov
Crush temporary buffers are allocated as per replica size configured by the user. When there are more final osds (to be selected as per rule) than the replicas, buffer overlaps and it causes crash. Now, it ensures that at most num-rep osds are selected even if more number of osds are allowed by the rule. Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0, 234b066ba04976783d15ff2abc3e81b6cc06fb10. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22crush: drop unnecessary include from mapper.cIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22ceph: fix uninline data functionYan, Zheng
For CEPH_OSD_CMPXATTR_MODE_U64, OSD expects the u64 to be encoded as string in object's xattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-22ceph: rename snapshot supportYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-22ceph: fix null pointer dereference in send_mds_reconnect()Yan, Zheng
sb->s_root can be null when umounting Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-22Merge tag 'kvm-arm-for-4.1-take2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM changes for v4.1, take #2: Rather small this time: - a fix for a nasty bug with virtual IRQ injection - a fix for irqfd
2015-04-22KVM: arm/arm64: check IRQ number on userland injectionAndre Przywara
When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently only check it against a fixed limit, which historically is set to 127. With the new dynamic IRQ allocation the effective limit may actually be smaller (64). So when now a malicious or buggy userland injects a SPI in that range, we spill over on our VGIC bitmaps and bytemaps memory. I could trigger a host kernel NULL pointer dereference with current mainline by injecting some bogus IRQ number from a hacked kvmtool: ----------------- .... DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1) DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1) DEBUG: IRQ #114 still in the game, writing to bytemap now... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ffffffc07652e000 [00000000] *pgd=00000000f658b003, *pud=00000000f658b003, *pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027 Hardware name: FVP Base (DT) task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000 PC is at kvm_vgic_inject_irq+0x234/0x310 LR is at kvm_vgic_inject_irq+0x30c/0x310 pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145 ..... So this patch fixes this by checking the SPI number against the actual limit. Also we remove the former legacy hard limit of 127 in the ioctl code. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> CC: <stable@vger.kernel.org> # 4.0, 3.19, 3.18 [maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__, as suggested by Christopher Covington] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-04-22KVM: arm: irqfd: fix value returned by kvm_irq_map_gsiEric Auger
irqfd/arm curently does not support routing. kvm_irq_map_gsi is supposed to return all the routing entries associated with the provided gsi and return the number of those entries. We should return 0 at this point. Signed-off-by: Eric Auger <eric.auger@linaro.org> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-04-22watchdog: stmp3xxx_rtc_wdt: fix broken email addressWolfram Sang
My Pengutronix address is not valid anymore, redirect people to the Pengutronix kernel team. Reported-by: Harald Geyer <harald@ccbib.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Robert Schwebel <r.schwebel@pengutronix.de> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: pnx4008_wdt: fix broken email addressWolfram Sang
My Pengutronix address is not valid anymore, redirect people to the Pengutronix kernel team. Reported-by: Harald Geyer <harald@ccbib.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Robert Schwebel <r.schwebel@pengutronix.de> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: octeon: use fixed length string for register namesAaro Koskinen
Use fixed length string for register names. This saves 416 bytes in text size. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: octeon: fix some trivial coding style issuesAaro Koskinen
Fix some trivial coding style issues to reduce noise from static analyzers. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: octeon: convert to WATCHDOG_CORE APIAaro Koskinen
Convert OCTEON watchdog to WATCHDOG_CORE API. This enables support for multiple watchdogs on OCTEON boards. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: cadence: Remove Kconfig dependency on ARCHMichal Simek
Remove Kconfig dependency and enable driver for all ARCHs. Signed-off-by: Michal Simek <michal.simek@xilinx.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22ARM: msm: add watchdog entries to DT timer binding docMathieu Olivari
The watchdog has been reworked to use the same DT node as the timer. This change is updating the device tree doc accordingly. Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22ARM: qcom: add description of KPSS WDT for IPQ8064Mathieu Olivari
Add the watchdog related entries to the Krait Processor Sub-system (KPSS) timer IPQ8064 devicetree section. Also, add a fixed-clock description of SLEEP_CLK, which will do for now. Signed-off-by: Josh Cartwright <joshc@codeaurora.org> Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: qcom: use timer devicetree bindingMathieu Olivari
MSM watchdog configuration happens in the same register block as the timer, so we'll use the same binding as the existing timer. The qcom-wdt will now be probed when devicetree has an entry compatible with "qcom,kpss-timer" or "qcom-scss-timer". Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22watchdog: bcm281xx: Remove use of seq_printf return valueJoe Perches
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Guenter Roeck <linux~roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2015-04-22modpost: don't emit section mismatch warnings for compiler optimizationsPaul Gortmaker
Currently an allyesconfig build [gcc-4.9.1] can generate the following: WARNING: vmlinux.o(.text.unlikely+0x3864): Section mismatch in reference from the function cpumask_empty.constprop.3() to the variable .init.data:nmi_ipi_mask which comes from the cpumask_empty usage in arch/x86/kernel/nmi_selftest.c. Normally we would not see a symbol entry for cpumask_empty since it is: static inline bool cpumask_empty(const struct cpumask *srcp) however in this case, the variant of the symbol gets emitted when GCC does constant propagation optimization. Fix things up so that any locally optimized constprop variants don't warn when accessing variables that live in the __init sections. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: expand pattern matching to support substring matchesPaul Gortmaker
Currently the match() function supports a leading * to match any prefix and a trailing * to match any suffix. However there currently is not a combination of both that can be used to target matches of whole families of functions that share a common substring. Here we expand the *foo and foo* match to also support *foo* with the goal of targeting compiler generated symbol names that contain strings like ".constprop." and ".isra." Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: do not try to match the SHT_NUL section.Quentin Casasnovas
Trying to match the SHT_NUL section isn't useful and causes build failures on parisc and mn10300 since the addition of section strict white-listing and __ex_table sanitizing. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: 050e57fd5936 ("modpost: add strict white-listing when referencing....") Fixes: 52dc0595d540 ("modpost: handle relocations mismatch in __ex_table.") Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: fix extable entry size calculation.Quentin Casasnovas
As Guenter pointed out, we were never really calculating the extable entry size because the pointer arithmetic was simply wrong. We want to check we're handling the second relocation in __ex_table to infer an entry size, but we were using (void*) pointers instead of Elf_Rel[a]* ones. This fixes the problem by moving that check in the caller (since we can deal with different types of relocations) and add is_second_extable_reloc() to make the whole thing more readable. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: fix inverted logic in is_extable_fault_address().Quentin Casasnovas
As Guenter pointed out, we want to assert that extable_entry_size has been discovered and not the other way around. Moreover, this sanity check is only valid when we're not dealing with the first relocation in __ex_table, since we have not discovered the extable entry size at that point. This was leading to a divide-by-zero on some architectures and make the build fail. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reported-by: Guenter Roeck <linux@roeck-us.net> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: handle -ffunction-sectionsRusty Russell
52dc0595d540 introduced OTHER_TEXT_SECTIONS for identifying what sections could validly have __ex_table entries. Unfortunately, it wasn't tested with -ffunction-sections, which some architectures use. Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22modpost: Whitelist .text.fixup and .exception.textThierry Reding
32-bit and 64-bit ARM use these sections to store executable code, so they must be whitelisted in modpost's table of valid text sections. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-22dmaengine: dw: don't prompt for DW_DMAC_COREVinod Koul
DW_DMAC_CORE is slected by PCI or Platform driver, so this symbol shouldn't be user selectable, so remove the prompt Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2015-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc fixes from David Miller: 1) ldc_alloc_exp_dring() can be called from softints, so use GFP_ATOMIC. From Sowmini Varadhan. 2) Some minor warning/build fixups for the new iommu-common code on certain archs and with certain debug options enabled. Also from Sowmini Varadhan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Use GFP_ATOMIC in ldc_alloc_exp_dring() as it can be called in softirq context sparc64: Use M7 PMC write on all chips T4 and onward. iommu-common: rename iommu_pool_hash to iommu_hash_common iommu-common: fix x86_64 compiler warnings
2015-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Just a few fixes trickling in at this point. 1) If we see an attached socket on an skb in the ipv4 forwarding path, bail. This can happen due to races with FIB rule addition, and deletion, and we should just drop such frames. From Sebastian Pöhn. 2) pppoe receive should only accept packets destined for this hosts's MAC address. From Joakim Tjernlund. 3) Handle checksum unwrapping properly in ppp receive properly when it's encapsulated in UDP in some way, fix from Tom Herbert. 4) Fix some bugs in mv88e6xxx DSA driver resulting from the conversion from register offset constants to mnenomic macros. From Vivien Didelot. 5) Fix handling of HCA max message size in mlx4 adapters, from Eran Ben ELisha" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAP tcp: add memory barriers to write space paths altera tse: Error-Bit on tx-avalon-stream always set. net: dsa: mv88e6xxx: use PORT_DEFAULT_VLAN net: dsa: mv88e6xxx: fix setup of port control 1 ppp: call skb_checksum_complete_unset in ppp_receive_frame net: add skb_checksum_complete_unset pppoe: Lacks DST MAC address check ip_forward: Drop frames with attached skb->sk
2015-04-21Merge tag 'omap-for-v4.1/prcm-dts-mfd-syscon-fix' of ↵Olof Johansson
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/late Merge "urgent omap boot fix for v4.1 if MFD_SYSCON is not set" from Tony Lindgren: Urgent pull request for v4.1 to booting for custom kernel .config files that do not have MFD_SYSCON set. Omaps now have a dependency to MFD_SYSCON for system control module generic register area and some clocks with the changes done in omap-for-v4.1/prcm-dts branch. This can be pulled on top of omap-for-v4.1/prcm-dts, or into fixes for v4.1. We already do have a slight MFD_SYSCON dependency for REGULATOR_PBIAS for dual voltage MMC cards on the first MMC bus for many devices, so from that point of view this can also be merged separately from omap-for-v4.1/prcm-dts. * tag 'omap-for-v4.1/prcm-dts-mfd-syscon-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: OMAP2+: Fix booting with configs that don't have MFD_SYSCON Signed-off-by: Olof Johansson <olof@lixom.net>
2015-04-22ACPI / EC: fix NULL pointer dereference in acpi_ec_remove_query_handler()Chris Bainbridge
Use list_for_each_entry_safe for iterating because handler may be freed in the loop. BUG: unable to handle kernel NULL pointer dereference at 000000000000002c IP: [<ffffffff814d69c8>] acpi_ec_put_query_handler+0x7/0x1a Call Trace: acpi_ec_remove_query_handler+0x87/0x97 acpi_smbus_hc_remove+0x2a/0x44 [sbshc] acpi_device_remove+0x7b/0x9a __device_release_driver+0x7e/0x110 driver_detach+0xb0/0xc0 bus_remove_driver+0x54/0xe0 driver_unregister+0x2b/0x60 acpi_bus_unregister_driver+0x10/0x12 acpi_smb_hc_driver_exit+0x10/0x12 [sbshc] SyS_delete_module+0x1b8/0x210 system_call_fastpath+0x12/0x6a Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-21Merge branch 'parisc-4.1-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "The patch by Guenter Roeck fixes the build on parisc which got broken because of commit f24ffde43237 ("parisc: expose number of page table levels on Kconfig level") and the patch from Matthew Wilcox converts our code to use the generic scatterlist.h header file" * 'parisc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Replace PT_NLEVELS with CONFIG_PGTABLE_LEVELS parisc: Eliminate sg_virt_addr() and private scatterlist.h
2015-04-22md/raid5: don't do chunk aligned read on degraded array.Eric Mei
When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei <eric.mei@seagate.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: allow the stripe_cache to grow and shrink.NeilBrown
The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: change ->inactive_blocked to a bit-flag.NeilBrown
This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: move max_nr_stripes management into grow_one_stripe and ↵NeilBrown
drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: pass gfp_t arg to grow_one_stripe()NeilBrown
This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: introduce configuration option rmw_levelMarkus Stockhausen
Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid5: activate raid6 rmw featureMarkus Stockhausen
Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: xor_syndrome() for SSE2Markus Stockhausen
The second and (last) optimized XOR syndrome calculation. This version supports right and left side optimization. All CPUs with architecture older than Haswell will benefit from it. It should be noted that SSE2 movntdq kills performance for memory areas that are read and written simultaneously in chunks smaller than cache line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR functions. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: xor_syndrome() for generic intMarkus Stockhausen
Start the algorithms with the very basic one. It is left and right optimized. That means we can avoid all calculations for unneeded pages above the right stop offset. For pages below the left start offset we still need the syndrome multiplication but without reading data pages. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: improve test programMarkus Stockhausen
It is always helpful to have a test tool in place if we implement new data critical algorithms. So add some test routines to the raid6 checker that can prove if the new xor_syndrome() works as expected. Run through all permutations of start/stop pages per algorithm and simulate a xor_syndrome() assisted rmw run. After each rmw check if the recovery algorithm still confirms that the stripe is fine. Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md/raid6 algorithms: delta syndrome functionsMarkus Stockhausen
v3: s-o-b comment, explanation of performance and descision for the start/stop implementation Implementing rmw functionality for RAID6 requires optimized syndrome calculation. Up to now we can only generate a complete syndrome. The target P/Q pages are always overwritten. With this patch we provide a framework for inplace P/Q modification. In the first place simply fill those functions with NULL values. xor_syndrome() has two additional parameters: start & stop. These will indicate the first and last page that are changing during a rmw run. That makes it possible to avoid several unneccessary loops and speed up calculation. The caller needs to implement the following logic to make the functions work. 1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source blocks inside P/Q between (and including) start and end. 2) modify any block with start <= block <= stop 3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of source blocks into P/Q between (and including) start and end. Pages between start and stop that won't be changed should be filled with a pointer to the kernel zero page. The reasons for not taking NULL pages are: 1) Algorithms cross the whole source data line by line. Thus avoid additional branches. 2) Having a NULL page avoids calculating the XOR P parity but still need calulation steps for the Q parity. Depending on the algorithm unrolling that might be only a difference of 2 instructions per loop. The benchmark numbers of the gen_syndrome() functions are displayed in the kernel log. Do the same for the xor_syndrome() functions. This will help to analyze performance problems and give an rough estimate how well the algorithm works. The choice of the fastest algorithm will still depend on the gen_syndrome() performance. With the start/stop page implementation the speed can vary a lot in real life. E.g. a change of page 0 & page 15 on a stripe will be harder to compute than the case where page 0 & page 1 are XOR candidates. To be not to enthusiatic about the expected speeds we will run a worse case test that simulates a change on the upper half of the stripe. So we do: 1) calculation of P/Q for the upper pages 2) continuation of Q for the lower (empty) pages Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: handle expansion/resync case with stripe batchingshli@kernel.org
expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: handle io error of batch listshli@kernel.org
If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22RAID5: batch adjacent full stripe writeshli@kernel.org
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22raid5: track overwrite disk countshli@kernel.org
Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>