summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-02-02udpv6: Add lockless sendmsg() supportVlad Yasevich
This commit adds the same functionaliy to IPv6 that commit 903ab86d195cca295379699299c5fc10beba31c7 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:48 2011 +0000 udp: Add lockless transmit path added to IPv4. UDP transmit path can now run without a socket lock, thus allowing multiple threads to send to a single socket more efficiently. This is only used when corking/MSG_MORE is not used. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Introduce udpv6_send_skb()Vlad Yasevich
Now that we can individually construct IPv6 skbs to send, add a udpv6_send_skb() function to populate the udp header and send the skb. This allows udp_v6_push_pending_frames() to re-use this function as well as enables us to add lockless sendmsg() support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: introduce ipv6_make_skbVlad Yasevich
This commit is very similar to commit 1c32c5ad6fac8cee1a77449f5abf211e911ff830 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:47 2011 +0000 inet: Add ip_make_skb and ip_finish_skb It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb. The job of ip6_make_skb is to collect messages into an ipv6 packet and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit the generated skb. Together they replicated the job of ip6_push_pending_frames() while also provide the capability to be called independently. This will be needed to add lockless UDP sendmsg support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Append sending data to arbitrary queueVlad Yasevich
Add the ability to append data to arbitrary queue. This will be needed later to implement lockless UDP sends. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: pull cork initialization into its own function.Vlad Yasevich
Pull IPv6 cork initialization into its own function that can be re-used. IPv6 specific cork data did not have an explicit data structure. This patch creats eone so that just ipv6 cork data can be as arguemts. Also, since IPv6 tries to save the flow label into inet_cork_full tructure, pass the full cork. Adjust ip6_cork_release() to take cork data structures. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net: dctcp: loosen requirement to assert ECT(0) during 3WHSFlorian Westphal
One deployment requirement of DCTCP is to be able to run in a DC setting along with TCP traffic. As Glenn Judd's NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [1] (tba) explains, one way to solve this on switch side is to split DCTCP and TCP traffic in two queues per switch port based on the DSCP: one queue soley intended for DCTCP traffic and one for non-DCTCP traffic. For the DCTCP queue, there's the marking threshold K as explained in commit e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") for RED marking ECT(0) packets with CE. For the non-DCTCP queue, there's f.e. a classic tail drop queue. As already explained in e3118e8359bb, running DCTCP at scale when not marking SYN/SYN-ACK packets with ECT(0) has severe consequences as for non-ECT(0) packets, traversing the RED marking DCTCP queue will result in a severe reduction of connection probability. This is due to the DCTCP queue being dominated by ECT(0) traffic and switches handle non-ECT traffic in the RED marking queue after passing K as drops, where K is usually a low watermark in order to leave enough tailroom for bursts. Splitting DCTCP traffic among several queues (ECN and non-ECN queue) is being considered a terrible idea in the network community as it splits single flows across multiple network paths. Therefore, commit e3118e8359bb implements this on Linux as ECT(0) marked traffic, as we argue that marking all packets of a DCTCP flow is the only viable solution and also doesn't speak against the draft. However, recently, a DCTCP implementation for FreeBSD hit also their mainline kernel [2]. In order to let them play well together with Linux' DCTCP, we would need to loosen the requirement that ECT(0) has to be asserted during the 3WHS as not implemented in FreeBSD. This simplifies the ECN test and lets DCTCP work together with FreeBSD. Joint work with Daniel Borkmann. [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595f Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net-timestamp: no-payload only sysctlWillem de Bruijn
Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. Add a sysctl that optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW. The policy is checked when timestamps are generated in the stack. It is possible for timestamps with data to be reported after the sysctl is set, if these were queued internally earlier. No vulnerability is immediately known that exploits knowledge gleaned from packet headers, but it may still be preferable to allow administrators to lock down this path at the cost of possible breakage of legacy applications. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (v1 -> v2) - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW) (rfc -> v1) - document the sysctl in Documentation/sysctl/net.txt - fix access control race: read .._OPT_TSONLY only once, use same value for permission check and skb generation. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net-timestamp: no-payload optionWillem de Bruijn
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit timestamps, this loops timestamps on top of empty packets. Doing so reduces the pressure on SO_RCVBUF. Payload inspection and cmsg reception (aside from timestamps) are no longer possible. This works together with a follow on patch that allows administrators to only allow tx timestamping if it does not loop payload or metadata. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (rfc -> v1) - add documentation - remove unnecessary skb->len test (thanks to Richard Cochran) Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02NFC: nci: Change NCI state machine to LISTEN_ACTIVEChristophe Ricard
When receiving an interface activation notification, if the RF interface is NCI_RF_INTERFACE_NFCEE_DIRECT, we need to ignore the following parameters and change the NCI state machine to NCI_LISTEN_ACTIVE. According to the NCI specification, the parameters should be 0 and shall be ignored. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add RF NFCEE action notification supportChristophe Ricard
The NFCC sends an NCI_OP_RF_NFCEE_ACTION_NTF notification to the host (DH) to let it know that for example an RF transaction with a payment reader is done. For now the notification handler is empty. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: Forward NFC_EVT_TRANSACTION to user spaceChristophe Ricard
NFC_EVT_TRANSACTION is sent through netlink in order for a specific application running on a secure element to notify userspace of an event. Typically the secure element application counterpart on the host could interpret that event and act upon it. Forwarded information contains: - SE host generating the event - Application IDentifier doing the operation - Applications parameters Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add HCI over NCI protocol supportChristophe Ricard
According to the NCI specification, one can use HCI over NCI to talk with specific NFCEE. The HCI network is viewed as one logical NFCEE. This is needed to support secure element running HCI only firmwares embedded on an NCI capable chipset, like e.g. the st21nfcb. There is some duplication between this piece of code and the HCI core code, but the latter would need to be abstracted even more to be able to use NCI as a logical transport for HCP packets. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Support logical connections managementChristophe Ricard
In order to communicate with an NFCEE, we need to open a logical connection to it, by sending the NCI_OP_CORE_CONN_CREATE_CMD command to the NFCC. It's left up to the drivers to decide when to close an already opened logical connection. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add NFCEE enabling and disabling supportChristophe Ricard
NFCEEs can be enabled or disabled by sending the NCI_OP_NFCEE_MODE_SET_CMD command to the NFCC. This patch provides an API for drivers to enable and disable e.g. their NCI discoveredd secure elements. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add NFCEE discover supportChristophe Ricard
NFCEEs (NFC Execution Environment) have to be explicitly discovered by sending the NCI_OP_NFCEE_DISCOVER_CMD command. The NFCC will respond to this command by telling us how many NFCEEs are connected to it. Then the NFCC sends a notification command for each and every NFCEE connected. Here we implement support for sending NCI_OP_NFCEE_DISCOVER_CMD command, receiving the response and the potential notifications. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add dynamic logical connections supportChristophe Ricard
The current NCI core only support the RF static connection. For other NFC features such as Secure Element communication, we may need to create logical connections to the NFCEE (Execution Environment. In order to track each logical connection ID dynamically, we add a linked list of connection info pointers to the nci_dev structure. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02Bluetooth: Remove mgmt_rp_read_local_oob_ext_data structJohan Hedberg
This extended return parameters struct conflicts with the new Read Local OOB Extended Data command definition. To avoid the conflict simply rename the old "extended" version to the normal one and update the code appropriately to take into account the two possible response PDU sizes. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Merge branch 'locks-3.20' of git://git.samba.org/jlayton/linux into for-3.20J. Bruce Fields
Christoph's block pnfs patches have some minor dependencies on these lock patches.
2015-02-02Bluetooth: Add restarting to service discoveryJakub Pawlowski
When using LE_SCAN_FILTER_DUP_ENABLE, some controllers would send advertising report from each LE device only once. That means that we don't get any updates on RSSI value, and makes Service Discovery very slow. This patch adds restarting scan when in Service Discovery, and device with filtered uuid is found, but it's not in RSSI range to send event yet. This way if device moves into range, we will quickly get RSSI update. Signed-off-by: Jakub Pawlowski <jpawlowski@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Bluetooth: Add le_scan_restart work for LE scan restartingJakub Pawlowski
Currently there is no way to restart le scan, and it's needed in service scan method. The way it work: it disable, and then enable le scan on controller. During the restart, we must remember when the scan was started, and it's duration, to later re-schedule the le_scan_disable work, that was stopped during the stop scan phase. Signed-off-by: Jakub Pawlowski <jpawlowski@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-01bridge: offload bridge port attributes to switch asic if feature flag setRoopa Prabhu
This patch adds support to set/del bridge port attributes in hardware from the bridge driver. With this, when the user sends a bridge setlink message with no flags or master flags set, - the bridge driver ndo_bridge_setlink handler sets settings in the kernel - calls the swicthdev api to propagate the attrs to the switchdev hardware You can still use the self flag to go to the switch hw or switch port driver directly. With this, it also makes sure a notification goes out only after the attributes are set both in the kernel and hw. The patch calls switchdev api only if BRIDGE_FLAGS_SELF is not set. This is because the offload cases with BRIDGE_FLAGS_SELF are handled in the caller (in rtnetlink.c). Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01swdevice: add new apis to set and del bridge port attributesRoopa Prabhu
This patch adds two new api's netdev_switch_port_bridge_setlink and netdev_switch_port_bridge_dellink to offload bridge port attributes to switch port (The names of the apis look odd with 'switch_port_bridge', but am more inclined to change the prefix of the api to something else. Will take any suggestions). The api's look at the NETIF_F_HW_SWITCH_OFFLOAD feature flag to pass bridge port attributes to the port device. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellinkRoopa Prabhu
bridge flags are needed inside ndo_bridge_setlink/dellink handlers to avoid another call to parse IFLA_AF_SPEC inside these handlers This is used later in this series Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01ipv4: tcp: get rid of ugly unicast_sockEric Dumazet
In commit be9f4a44e7d41 ("ipv4: tcp: remove per net tcp_sock") I tried to address contention on a socket lock, but the solution I chose was horrible : commit 3a7c384ffd57e ("ipv4: tcp: unicast_sock should not land outside of TCP stack") addressed a selinux regression. commit 0980e56e506b ("ipv4: tcp: set unicast_sock uc_ttl to -1") took care of another regression. commit b5ec8eeac46 ("ipv4: fix ip_send_skb()") fixed another regression. commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate") was another shot in the dark. Really, just use a proper socket per cpu, and remove the skb_orphan() call, to re-enable flow control. This solves a serious problem with FQ packet scheduler when used in hostile environments, as we do not want to allocate a flow structure for every RST packet sent in response to a spoofed packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01Bluetooth: Fix OOB data present for BR/EDR Secure Connections Only modeMarcel Holtmann
When using Secure Connections Only mode, then only P-256 OOB data is valid and should be provided. In case userspace provides P-192 and P-256 OOB data, then the P-192 values will be set to zero. However the present value of the IO capability exchange still mentioned that both values would be available. Fix this by telling the controller clearly that only the P-256 OOB data is present. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose remote OOB information as debugfs entryMarcel Holtmann
For debugging purposes it is good to know which OOB data is actually currently loaded for each controller. So expose that list via debugfs. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose hardware error code as debugfs entryMarcel Holtmann
When the Hardware Error event is send by the controller, the Bluetooth core stores the error code. Expose it via debugfs so it can be retrieved later on. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose debug keys usage setting via debugfsMarcel Holtmann
To allow easier debugging when debug keys are generated, provide debugfs entry for checking the setting of debug keys usage. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Track changes from HCI Write Simple Pairing Debug Mode commandMarcel Holtmann
When the HCI Write Simple Pairing Debug Mode command has been issued, the result needs to be tracked and stored. The hdev->ssp_debug_mode variable is already present, but was never updated when the mode in the controller was actually changed. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose Secure Simple Pairing debug mode setting in debugfsMarcel Holtmann
The value of the ssp_debug_mode should be accessible via debugfs to be able to determine if a BR/EDR controller generates debugs keys or not. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-01-31ipv4: icmp: use percpu allocationEric Dumazet
Get rid of nr_cpu_ids and use modern percpu allocation. Note that the sockets themselves are not yet allocated using NUMA affinity. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-31tcp: use SACK RTTs for CCKenneth Klette Jonassen
Current behavior only passes RTTs from sequentially acked data to CC. If sender gets a combined ACK for segment 1 and SACK for segment 3, then the computed RTT for CC is the time between sending segment 1 and receiving SACK for segment 3. Pass the minimum computed RTT from any acked data to CC, i.e. time between sending segment 3 and receiving SACK for segment 3. Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-31Bluetooth: Allow remote OOB data to only provide P-192 or P-256 valuesMarcel Holtmann
In case the remote only provided P-192 or P-256 data for OOB pairing, then make sure that the data value pointers are correctly set. That way the core can provide correct information when remote OOB data present information have to be communicated. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-01-31Bluetooth: Fix OOB data present value for SMP pairingMarcel Holtmann
Before setting the OOB data present flag with SMP pairing, check the newly introduced present tracking that actual OOB data values have been provided. The existence of remote OOB data structure does not actually mean that the correct data values are available. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-01-31Bluetooth: Fix OOB data present value for BR/EDR Secure ConnectionsMarcel Holtmann
When BR/EDR Secure Connections has been enabled, the OOB data present value can take 2 additional values. The host has to clearly provide details about if P-192 OOB data, P-256 OOB data or a combination of P-192 and P-256 OOB data is present. In case BR/EDR Secure Connections is not enabled or not supported, then check that P-192 OOB data is actually present and return the correct value based on that. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-01-31Bluetooth: Store OOB data present value for each set of remote OOB dataMarcel Holtmann
Instead of doing complex calculation every time the OOB data is used, just calculate the OOB data present value and store it with the OOB data raw values. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-01-30irda: use msecs_to_jiffies for conversionsNicholas Mc Guire
This is only an API consolidation and should make things more readable it replaces var * HZ / 1000 constructs by msecs_to_jiffies(var). Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30net: Fix vlan_get_protocol for stacked vlanToshiaki Makita
vlan_get_protocol() could not get network protocol if a skb has a 802.1ad vlan tag or multiple vlans, which caused incorrect checksum calculation in several drivers. Fix vlan_get_protocol() to retrieve network protocol instead of incorrect vlan protocol. As the logic is the same as skb_network_protocol(), create a common helper function __vlan_get_protocol() and call it from existing functions. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30net: mark some potential candidates __read_mostlyDaniel Borkmann
They are all either written once or extremly rarely (e.g. from init code), so we can move them to the .data..read_mostly section. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30net: sctp: fix passing wrong parameter header to param_type2af in ↵Saran Maruti Ramanara
sctp_process_param When making use of RFC5061, section 4.2.4. for setting the primary IP address, we're passing a wrong parameter header to param_type2af(), resulting always in NULL being returned. At this point, param.p points to a sctp_addip_param struct, containing a sctp_paramhdr (type = 0xc004, length = var), and crr_id as a correlation id. Followed by that, as also presented in RFC5061 section 4.2.4., comes the actual sctp_addr_param, which also contains a sctp_paramhdr, but this time with the correct type SCTP_PARAM_IPV{4,6}_ADDRESS that param_type2af() can make use of. Since we already hold a pointer to addr_param from previous line, just reuse it for param_type2af(). Fixes: d6de3097592b ("[SCTP]: Add the handling of "Set Primary IP Address" parameter to INIT") Signed-off-by: Saran Maruti Ramanara <saran.neti@telus.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30netlink: fix wrong subscription bitmask to group mapping inPablo Neira
The subscription bitmask passed via struct sockaddr_nl is converted to the group number when calling the netlink_bind() and netlink_unbind() callbacks. The conversion is however incorrect since bitmask (1 << 0) needs to be mapped to group number 1. Note that you cannot specify the group number 0 (usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP since this is rejected through -EINVAL. This problem became noticeable since 97840cb ("netfilter: nfnetlink: fix insufficient validation in nfnetlink_bind") when binding to bitmask (1 << 0) in ctnetlink. Reported-by: Andre Tomt <andre@tomt.net> Reported-by: Ivan Delalande <colona@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30netfilter: nft_lookup: add missing attribute validation for NFTA_LOOKUP_SET_IDPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-30netfilter: nft_compat: add ebtables supportArturo Borrero
This patch extends nft_compat to support ebtables extensions. ebtables verdict codes are translated to the ones used by the nf_tables engine, so we can properly use ebtables target extensions from nft_compat. This patch extends previous work by Giuseppe Longo <giuseppelng@gmail.com>. Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-30netfilter: nf_tables: fix leaks in error path of nf_tables_newchain()Pablo Neira Ayuso
Release statistics and module refcount on memory allocation problems. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-30xprtrdma: Update the GFP flags used in xprt_rdma_allocate()Chuck Lever
Reflect the more conservative approach used in the socket transport's version of this transport method. An RPC buffer allocation should avoid forcing not just FS activity, but any I/O. In particular, two recent changes missed updating xprtrdma: - Commit c6c8fe79a83e ("net, sunrpc: suppress allocation warning ...") - Commit a564b8f03986 ("nfs: enable swap on NFS") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Clean up after adding regbuf managementChuck Lever
rpcrdma_{de}register_internal() are used only in verbs.c now. MAX_RPCRDMAHDR is no longer used and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate zero pad separately from rpcrdma_bufferChuck Lever
Use the new rpcrdma_alloc_regbuf() API to shrink the amount of contiguous memory needed for a buffer pool by moving the zero pad buffer into a regbuf. This is for consistency with the other uses of internally registered memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_repChuck Lever
The rr_base field is currently the buffer where RPC replies land. An RPC/RDMA reply header lands in this buffer. In some cases an RPC reply header also lands in this buffer, just after the RPC/RDMA header. The inline threshold is an agreed-on size limit for RDMA SEND operations that pass from server and client. The sum of the RPC/RDMA reply header size and the RPC reply header size must be less than this threshold. The largest RDMA RECV that the client should have to handle is the size of the inline threshold. The receive buffer should thus be the size of the inline threshold, and not related to RPCRDMA_MAX_SEGS. RPC replies received via RDMA WRITE (long replies) are caught in rq_rcv_buf, which is the second half of the RPC send buffer. Ie, such replies are not involved in any way with rr_base. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_reqChuck Lever
The rl_base field is currently the buffer where each RPC/RDMA call header is built. The inline threshold is an agreed-on size limit to for RDMA SEND operations that pass between client and server. The sum of the RPC/RDMA header size and the RPC header size must be less than or equal to this threshold. Increasing the r/wsize maximum will require MAX_SEGS to grow significantly, but the inline threshold size won't change (both sides agree on it). The server's inline threshold doesn't change. Since an RPC/RDMA header can never be larger than the inline threshold, make all RPC/RDMA header buffers the size of the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_reqChuck Lever
Because internal memory registration is an expensive and synchronous operation, xprtrdma pre-registers send and receive buffers at mount time, and then re-uses them for each RPC. A "hardway" allocation is a memory allocation and registration that replaces a send buffer during the processing of an RPC. Hardway must be done if the RPC send buffer is too small to accommodate an RPC's call and reply headers. For xprtrdma, each RPC send buffer is currently part of struct rpcrdma_req so that xprt_rdma_free(), which is passed nothing but the address of an RPC send buffer, can find its matching struct rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof. That means that hardway currently has to replace a whole rpcrmda_req when it replaces an RPC send buffer. This is often a fairly hefty chunk of contiguous memory due to the size of the rl_segments array and the fact that both the send and receive buffers are part of struct rpcrdma_req. Some obscure re-use of fields in rpcrdma_req is done so that xprt_rdma_free() can detect replaced rpcrdma_req structs, and restore the original. This commit breaks apart the RPC send buffer and struct rpcrdma_req so that increasing the size of the rl_segments array does not change the alignment of each RPC send buffer. (Increasing rl_segments is needed to bump up the maximum r/wsize for NFS/RDMA). This change opens up some interesting possibilities for improving the design of xprt_rdma_allocate(). xprt_rdma_allocate() is now the one place where RPC send buffers are allocated or re-allocated, and they are now always left in place by xprt_rdma_free(). A large re-allocation that includes both the rl_segments array and the RPC send buffer is no longer needed. Send buffer re-allocation becomes quite rare. Good send buffer alignment is guaranteed no matter what the size of the rl_segments array is. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>