irdma RDMA FreeBSD* driver for Intel(R) Ethernet Controller E810
================================================================
January 14, 2022

Contents
========

- Prerequisites
- Building and Installation
- Testing
- Configuration
- Interoperability
- Known Issues
- Support

================================================================================


Prerequisites
-------------

- FreeBSD version 12.2, 13.0 or later.
- Kernel configuration:
    Please add the following kernel configuration options:
        include GENERIC
        options OFED
        options OFED_DEBUG_INIT
        options COMPAT_LINUXKPI
        options SDP
        options IPOIB_CM

        nodevice ice
- For the irdma driver to work, an if_ice module with RDMA interface
  is required. The interface is available in if_ice version 0.28.2 or later.
  The RDMA interface may be turned on or off by using tunable of if_ice module:
    hw.ice.irdma
  It may be modified by putting:
    hw.ice.irdma=1
  to /boot/loader.conf file. Reboot is needed for the change to take effect.
  The RDMA interface is turned on by default (the value is 1).

Building and Installation
-------------------------

1. Untar ice-<version>.tar.gz and irdma-<version>.tar.gz
    tar -xf ice-<version>.tar.gz
    tar -xf irdma-<version>.tar.gz
2. Install the if_ice driver:
    cd ice-<version>/ directory
    make
    make install
3. Install the irdma driver:
    cd irdma-<version>/src/
    make clean
    make ICE_DIR=$PATH_TO_ICE/ice-<version>/
    make install

Testing
-------
1. To load the irdma driver, call:
     kldload irdma
   If if_ice is not already loaded, the system will load it on its own.
   Please check whether the value of
     sysctl hw.ice.irdma
   is 1, in case of irdma driver not loading. To change the value put:
     hw.ice.irdma=1
   to /boot/loader.conf file and reboot.
2. To validate the load of the driver, check:
     sysctl -a | grep infiniband
   A number of sys.class.infiniband should appear.
3. Each interface of the card may work in either iWARP or RoCEv2 mode.
   To enable RoCEv2 compatibility, add:
     dev.irdma<interface_number>.roce_enable=1
   where <interface_number> is a desired ice interface number on which
   RoCEv2 protocol needs to be enabled to file:
     /boot/loader.conf

   for instance:
     dev.irdma0.roce_enable=0
     dev.irdma1.roce_enable=1
   will keep iWARP mode on ice0, and enable RoCEv2 mode on interface ice1.
   The RoCEv2 mode is the default.

   To check irdma roce_enable status execute command:
     sysctl dev.irdma<interface_number>.roce_enable
   for instance:
     sysctl dev.irdma2.roce_enable
   with returned value of '0' indicate the iWARP mode, and the value of '1'
   indicate the RoCEv2 mode.

   Note: An interface configured in one mode will not be able to connect
   to a node configured in another mode.
   Note: RoCEv2 requires a proper configuration of DCB in order to ensure
   lossless Ethernet. No properly configured DCB may lead to significant
   performance loss or connectivity issues. See DCB configuration section
   for an example of how to configure DCB in FreeBSD system.
4. Enable flow control in the ice driver:
     sysctl dev.ice.<interface_num>.fc=3
   Enable flow control on the switch your system is connected to. See your
   switch documentation for details.
   Note: FC setting and PFC are mutually exclusive, if both are set only
   one of them will actually work.
5. The source code for krping software is provided with the kernel in
   /usr/src/sys/contrib/rdma/krping/. To compile the software, change
   directory to /usr/src/sys/modules/rdma/krping/ and invoke the following:
     make clean
     make
     make install
6. Start krping server on one machine:
    echo size=64,count=1,port=6601,addr=100.0.0.189,server > /dev/krping
7. Connect client from another machine:
    echo size=64,count=1,port=6601,addr=100.0.0.189,client > /dev/krping

==============================================================================


Configuration
-------------

The following sysctl options are available:
- dev.irdma<interface_number>.debug
    defines level of debug messages.
    Typical value: 1 for errors only, 0x7fffffff for full debug.
- dev.irdma<interface_number>.roce_enable
    enables RoCEv2 protocol usage on <interface_numer> interface.
    By default RoCEv2 protocol is used.
- dev.irdma<interface_number>.dcqcn_enable
    enables the DCQCN algorithm for RoCEv2.
    Note: "roce_enable" must also be set, for this sysctl to take effect.
    Note: The change may be set at any time, but it will be applied only to
          newly created QPs.
- dev.irdma<interface_number>.dcqcn_cc_cfg_valid
    indicates that all DCQCN parameters are valid and should be updated
    in registers or QP context.
    Note: "roce_enable" must also be set, for this tunable to take effect.
- dev.irdma<interface_number>.dcqcn_min_dec_factor
    The minimum factor by which the current transmit rate can be
    changed when processing a CNP. Value is given as a percentage
    (1-100).
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_min_rate_MBps
    The minimum value, in Mbits per second, for rate to limit.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_F
    The number of times to stay in each stage of bandwidth recovery.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_T
    The number of microseconds that should elapse before increasing the
    CWND in DCQCN mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_B
    The number of bytes to transmit before updating CWND in DCQCN mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_rai_factor
    The number of MSS to add to the congestion window in additive
    increase mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_hai_factor
    The number of MSS to add to the congestion window in hyperactive
    increase mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_rreduce_mperiod
    The minimum time between 2 consecutive rate reductions for a single
    flow. Rate reduction will occur only if a CNP is received during
    the relevant time interval.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set, for this
          tunable to take effect.


DCB Configuration in FreeBSD
----------------------------
    In order for RoCEv2 traffic to work without any problems a DCB should be
    configured on ice driver.
    DCB allows for a host to accept configuration from its link partner
    (willing mode), or for the host to set its own configuration (non-willing).

    DCB on E810 devices is intended to be used with a switch or non-willing
    partner with DCBX/LLDP that will handle DCB configuration.

    Note: E810 family devices do not support both FW DCB and non-willing mode
          (e.g. the firmware will not try to configure the partner).

    Note: FreeBSD 13.0 or earlier does not have SW DCB support which means it
          cannot currently support both SW DCB and willing modes because there
          is no software to accept the configuration and handle negotiation
          and adapter configuration.

    Currently, with FW LLDP agent (DCBX) enabled the driver supports "willing"
    mode or "non-willing" mode otherwise. The DCB configuration may be limited
    in the latter case.

    Note: The kernel needs to have https://reviews.freebsd.org/D31485
          (iflib: Allow drivers to determine which queue to TX on) applied in
          order to support the DCB.

    Configuration of FW LLDP Agent:
        sysctl dev.ice.<iface_num>.fw_lldp_agent=1

        1 enables FW-LLDP, 0 disables FW-LLDP.
        For the ice driver to be able to send lldp packets, you need to disable
        the lldp filter:
            kenv hw.ice.debug.enable_tx_lldp_filter=0

    View/Edit DCB ETS Settings:
        sysctl dev.ice.<iface_num>.ets_min_rate

            In "willing" mode (fw_lldp_agent=1), displays the current ETS
            bandwidth table. In "non-willing" mode, displays and allows setting
            the table.
            The sysctl accepts an input that consists of a comma-separated list
            of numbers [0-100], that must all add up to 100. These correspond
            to the minimum bandwidth allocations allowed for each traffic
            class.

            For instance:
                sysctl dev.ice.<iface_num>.ets_min_rate=30,10,10,10,10,10,10,10
            This configures every traffic class but TC 0 to a minimum of 10%
            bandwidth; TC 0 instead has 30% minimum bandwidth

            Note: When setting ets_min_rate, only non-0 values are allowed for
            TCs that are in use in up2tc_map. Therefore, the up2tc_map setting
            shall be done before setting ets_min_rate.

    User Priority to Traffic Class Mapping:
        sysctl dev.ice.<iface_num>.up2tc_map

            In "willing" mode (fw_lldp_agent=1), displays the current ETS
            priority assignment table. In "non-willing" mode, displays and
            allows setting the table.
            Input must be in this format: 0,1,2,3,4,5,6,7
            Where the first number is the TC for UP0, second number is the TC
            for UP1, etc.

    Priority Flow Control Configuration:
        sysctl.dev.ice.<iface_num>.pfc

            In "willing" mode (fw_lldp_agent=1), displays the current Priority
            Flow Control configuration. In "non-willing" mode, displays and
            allows setting the configuration.

            Input/Output is in this format: 0xff
            Where each bit of the hexadecimal number indicate enablement of
            corresponding Traffic Class.
            For instance:
                sysctl.dev.ice.<iface_num>.pfc=0x81
            indicates the PFC is enabled on TC0 and TC7.
            Settings for disabled TCs with this sysctl are ignored.

This sysctl shall only write the new configuration when the adapter is in non-willing mode
    Debug sysctls
        sysctl dev.ice.<iface_num>.debug.local_dcbx_cfg
        sysctl dev.ice.<iface_num>.debug.remote_dcbx_cfg
        sysctl.dev.ice.<iface_num>.debug.pf_vsi_cfg

        Print out more information when the ICE_DBG_DCB flag is set
        (debug_mask=0x400) in the ice driver.


Interoperability
----------------

Known Issues
------------
- The krping is unable to bind to an address belonging to vLAN interface.
  This appears to be a problem in rdma_copy_addr of ib_addr.c

Support
-------
For general information, go to the Intel support website at:
www.intel.com/support/

or the Intel Wired Networking project hosted by Sourceforge at:
http://sourceforge.net/projects/e1000

If an issue is identified with the released source code on a supported
kernel with a supported adapter, email the specific information related to the
issue to e1000-rdma@lists.sourceforge.net



================================================================================


License
-------

This software is available to you under a choice of one of two
licenses. You may choose to be licensed under the terms of the GNU
General Public License (GPL) Version 2, available from the file
COPYING in the main directory of this source tree, or the
OpenFabrics.org BSD license below:

  Redistribution and use in source and binary forms, with or
  without modification, are permitted provided that the following
  conditions are met:

  - Redistributions of source code must retain the above
    copyright notice, this list of conditions and the following
    disclaimer.

  - Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following
    disclaimer in the documentation and/or other materials
    provided with the distribution.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================================================


Trademarks
----------

(c) Intel Corporation. Intel, the Intel logo, and other Intel marks are
trademarks of Intel Corporation or its subsidiaries. 
Other names and brands may be claimed as the property of others. 


