Setting up

Installation

Clone the LHAT repository locally from https://github.com/openearth/lhat

>> git clone https://github.com/openearth/lhat.git

Navigate to the directory where you cloned the repository and create a conda environment from the yml file.

>> conda env create -f environment.yml

Once the environment is created, activate it and import lhat. Ensure your working directory is the same root folder of the cloned repository.

Activate the conda environment

>> conda activate lhat

Import LHAT as so below:

>>> import lhat

Run the example script in your command line

>> python example.py

Parameterising LHAT

The LHAT tool requires some parameters. The following arguments are necessary:

  • Name of project

  • Coordinate Referencing System (crs)

  • Path to where your landslide point dataset is (accepts JSON or .shp format)

  • A random state (necessary for reproducability of data)

  • Bounding box for clipping public assets

  • inputs (dictionary)

  • no_data values (can be a list or single value)

  • Pixel resolution (important for the retrieval of online datasets)

  • Kernel size (default 3x3): necessary for defining an area as ‘landslide’, since

a landslide does not occur as locally as a point but as an area affected.

Note

Not all input data have an online source. For those that do not, using the ‘online’ option will return nothing.

The following code snippet can be used for the initial parameterisation, also available in example.py that is placed in the root of the lhat repository.

Example of parameterising inputs
from lhat import IO as io

project = io.inputs(

    # Define a project name. This will be the name of the folder in which
    # your results are stored in
    project_name = 'jamaica_test',

    # The crs defined here will dictate which crs your input data is reprojected
    # to, as well as your final result.
    crs = 'epsg:3450',

    # Provide a path to your landslide points. This is COMPULSORY for the model
    # to work.
    landslide_points = './Projects/jamaica-test/Input/dummy-landslides.json',

    # Defining a random state (any integer) allows results to be reproducible
    random_state = 101,

    # A bounding box is required when taking inputs from online sources such as
    # geoservers. Use EPSG:4326 coordinates.
    bbox = [[-77.73174142, 18.02046626],
            [-77.1858101, 18.02046626],
            [-77.1858101, 18.34868174],
            [-77.73174142, 18.34868174],
            [-77.73174142, 18.02046626]],

    # The following are inputs that are possible to use within LHAT.
    # 3 choices for filepaths are: your_file_path, 'online', None.
    #       your_file_path = path to the respective file in string
    #       'online'       = an online, typically global source is relied on instead.
    #                        For datasets that are calculated from another dataset
    #                        such as slope/aspect/roughness, leave as 'online'.
    #       None           = None as an argument means that the dataset is NOT
    #                        considered as input into the model.
    #
    # Data type is critical to define as categorical and numerical data undergo
    # different data treatments.
    #
    # For 'reference', take care that if an online dataset is used as the reference,
    # bbox arguments define the grid extent, while the pixel_size argument below
    # defines the resolution of your reference (and therefore, your output) dataset.
    inputs = {
        'dem': {'filepath': 'online',
                'data_type': 'numerical'},
        'slope': {'filepath': 'online',
                  'data_type': 'numerical'},
        'aspect': {'filepath': 'online',
                    'data_type': 'numerical'},
        'lithology': {'filepath': 'online',
                        'data_type': 'categorical'},
        'prox_road': {'filepath': ".\Projects\jamaica-test\Input\prox_roads.tif",
                      'data_type': 'numerical'},
        'prox_river': {'filepath': ".\Projects\jamaica-test\Input\prox_rivers.tif",
                        'data_type': 'numerical'},
        'reference': 'dem'
        },

    no_data = -9999,  # Optional argument to define no_data value. Propogates
                        # for all processing of input files.

    pixel_size = 1000,    # Optional argument to define pixel size.
                          # Pixel size is only important for online datasets

    kernel_size = 3     # Define kernel size. Take into consideration pixel size
                        # and full extent of landslide-prone areas.
    )

Array harmonisation

Once the inputs have been defined, the tool harmonises all the input datasets into a stack of arrays by reprojecting and resampling them into the same grid size. The resampling is performed using nearest neighbour, and all datasets are reprojected into the crs defined in project.io.inputs(). Subsequently, any pixel from any input dataset that has no data becomes masked for the entire stack of arrays, leading to a final output consisting of an array where all valid data exists across all input datasets.

Data engineering step

Once the valid set of arrays are generated, the pixels that intersect with the landslide points are selected, as well as a 3x3 kernel window around the pixel. These points are marked as landslides areas, and are then selected across the arrays and flattened into a single dimension (for each type of input dataset). For the same number of landslide points, the same number of non-landslide points are then randomly selected in the stack of arrays and subsequently flattened as well. The flattened data, in the form of a pandas.DataFrame object, serves as input for the next steps, i.e. machine learning. Using the generate_xy() method, two dataframes are exported: the first consists of the flattened pixel values from each input dataset that coincide with the landslide point and the kernel window around it, and the second consists of landslide classes, where 0 indicates no landslide and 1 indicates landslide.

Generating inputs for model training
1######   Data Engineering Stage   ######
2# The user has a choice to further refine the input data prior to running the
3# model.
4x, y = project.generate_xy()

During the parameterisation stage, the dtype of each input dataset was necessary to declare. When the input data has a numerical data type (eg. elevation data), no additional data treatment is needed other than masking. If the data is categorical, however, a dummy variable needs to be generated for each category in the form of a binary variable (0s and 1s). By defining the data types in the parameterisation stage, dummy variables will be automatically created with the input data name as a prefix, followed by the category value.

generate_xy() is a separate step specifically created to allow further refinement from the user. If the user is satisfied with the input data for training the model, the user can directly drop the landslide ID columns and proceed to running the model.

Dropping landslide ID and preparing for model training
1x = x.drop(columns=['landslide_ids'])

Running the model(s)

Running the model requires defining the model choice. In the LHAT tool, the user can choose from three different machine learning methods:

  • Support Vector Machine

  • Random Forest

  • Logistic Regression

For each of the models, model parameterisation is performed automatically using GridSearch module. In LHAT, each model is parameterised according to the combination of parameters that produce the highest accuracy. In future developments, we would like to refine the model such that the model parameterisation is performed base on another criteria, as ranking on accuracy may run the risk of overtraining the model. Within the lhat.Model module, the input data is split according to 80% training and 20% test set.

An example of running (all) models is shown in example.py.

Example of how to run the machine learning model
1# As an example
2for m in ['SVM', 'RF', 'LR']:
3    project.run_model(
4        x = x,
5        y = y,
6        model = m,
7        modelExist = False
8        )

Although LHAT is capable of rapid risk assessments, model runtimes can vary depending on several factors:

  • The bounding box of the area

  • The resolution of the pixel size

  • The amount of input datasets

Once the modelling is complete, the results are exported as GeoTIFF files in the ‘Output’ folder of the project (within ‘Projects’). The random state defined in the tool allows for reproducability of the results, should somebody like to replicate the modelling.