Saturday, January 21, 2017

Using SAP HANA Cloud Plattform from ABAP to find root causes for runtime bottlenecks

Hi guys,

continuing my previous experiments, I have now set aside AWS Machine Learning for a while and turned towards the SAP HANA Cloud Platform. SAP HCP also offers Machine Learning capabilities under the label of "Predictive Analysis". I was reading a bit and found out about additional capabilities to the prediction like introspection mechanisms on the models you create within SAP HCP Predictive Analysis like, finding key influencing factors for target figures in your data. This so far is not offered by Amazon's Machine Learning, so it got me excited.

Goals

The goal of this experiment is to enable every ABAP developer to use SAP HCP Predictive Analysis via an native API layer without the hassle of having to understanding architecture, infrastructure and details of the implementation. This will bring the capabilities much closer to where the hidden data treasure lies, waiting for insights: The ERP System - even if it's not S/4 HANA or even sitting on top of a HANA database itself.

This ultimately enables ABAP developers to extend existing business functionality with Predictive Analysis capabilities.




Use Case

I decided to keep topic the same as in my previous articles: Predicting the runtime of the SNP System Scan. This time however I wanted to take advantage of the introspection capabilities and ask what factors influences the runtime the most.

I am going in with an expectation that the runtime will most likely depend on the database, the database vendor and version. Of course the version should play a significant role, as we are contantly trying to improve performance. However this may be in a battle with features we are adding. Also the industry could be of interest because this may lead to different areas of the database being popuplated with different amounts of data, due to different processes being used. Let's see what HCP fiends out.

Preparation

In order to make use of SAP HCP Predictive Services you have to fulfill some prerequisites. Rather than describing them completely I will reference the tutorials, I was following to do the same within my company. If you already have setup an SAP HCP account and the SAP HANA Cloud Connector you only need to perform steps 2, 4 and 5.
  1. Create an SAP HCP trial account.
  2. Deploy SAP HANA Cloud Platform predictive services on your trial account by following (see this tutorial)
  3. Install and setup the SAP HANA Cloud Connector in your corporate system landscape
  4. Configure access in the SAP HANA Cloud Connector to the HANA database you have set up in step 2.
  5. This makes your HCP database appear to be in your own network and you can configure it as an ADBC resource in transaction DBCO of your SAP NetWeaver system. That you will execute your ABAP code on later.

Architecture

After having set up the infrastructure let's think about the architecture of the application. It's not one of the typical extension patterns, that SAP forsees because the application logic resides on the ABAP system, making use of an application in the cloud rather than a new application sitting in SAP HCP that is treating your on-premise SAP systems as datasources.



Example Implementation

So by this time all necessary prerequisites are fulfilled and it's time to have some fun with ABAP - hard to believe ;-) But as long as you can build an REST-Client you can extend your core with basically anything you can imagine nowadays. So here we go:

FORM main.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lr_scan_data TYPE REF TO data.
  DATA: lr_prepared_data TYPE REF TO data.
  DATA: lr_ml TYPE REF TO /snp/hcp01_cl_ml.
  DATA: lv_dataset_id TYPE i.
  DATA: lr_ex TYPE REF TO cx_root.
  DATA: lv_msg TYPE string.

  FIELD-SYMBOLS: <lt_data> TYPE table.

*"--- PROCESSING LOGIC ------------------------------------------------
  TRY.
      "fetch the data into an internal table
      PERFORM get_system_scan_data CHANGING lr_scan_data.
      ASSIGN lr_scan_data->* TO <lt_data>.

      "prepare data (e.g. convert, select features)
      PERFORM prepare_data USING <lt_data> CHANGING lr_prepared_data.
      ASSIGN lr_prepared_data->* TO <lt_data>.

      "create a dataset (called a model on other platforms like AWS)
      CREATE OBJECT lr_ml.
      PERFORM create_dataset USING lr_ml <lt_data> CHANGING lv_dataset_id.

      "check if...
      IF lr_ml->is_ready( lv_dataset_id ) = abap_true.

        "...creation was successful
        PERFORM find_key_influencers USING lr_ml lv_dataset_id.

      ELSEIF lr_ml->is_failed( lv_dataset_id ) = abap_true.

        "...creation failed
        lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Model &1 has failed' iv_1 = lv_dataset_id ).
        MESSAGE lv_msg TYPE 'S' DISPLAY LIKE 'E'.

      ENDIF.

    CATCH cx_root INTO lr_ex.

      "output errors
      lv_msg = lr_ex->get_text( ).
      PERFORM display_lines USING lv_msg.

  ENDTRY.

ENDFORM.
This is basically the same procedure as last time, when connecting AWS Machine Learning. Again I was fetching the data via a REST Service from the SNP Data Cockpit instance I am using to keep statistics on all executed SNP System Scans. However, you can basically fetch your data that will be used as a data source for your model in any way that you like. Most probably you will be using OpenSQL SELECTs to fetch the data accordingly. Just as a reminder, the results looked somewhat like this:


Prepare the Data

This is the raw data and it's not perfect! The data quality is not quite good and in the shape that it's in. According to this article there are some improvements that I need to do in order to improve its quality.

  1. Normalizing values (e.g. lower casing, mapping values or clustering values). E.g.
    • Combining the database vendor and the major version of the database because those two values only make sense when treated in combination and not individually
    • Clustering the database size to 1.5TB chunks as these values can be guessed easier when executing predictions
    • Clustering the runtime into exponentially increasing categories does not work with HCP Predictive Services as you only solve regression problems so far that rely on numeric values.
  2. Filling up empty values with reasonable defaults. E.g.
    • treating all unknown SAP client types as test clients
  3. Make values and field names more human readable. This is not necessary for the machine learning algorithms, but it makes for better manual result interpretation
  4. Removing fields that do not make good features, like 
    • IDs
    • fields that cannot be provided for later predictions, because values cannot be determined easily or intuitively
  5. Remove records that still do not have good data quality. E.g. missing values in
    • database vendors
    • SAP system types
    • customer industry
  6. Remove records that are not representative. E.g. 
    • they refer to scans with exceptionally short runtimes probably due to intentionally limiting the scope
    • small database sizes that are probably due to non productive systems
So the resulting coding to do this preparation and data cleansing looks almost the same as in the AWS Example:

FORM prepare_data USING it_data TYPE table CHANGING rr_data TYPE REF TO data.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lr_q TYPE REF TO /snp/cn01_cl_itab_query.

*"--- PROCESSING LOGIC ------------------------------------------------
  CREATE OBJECT lr_q.

  "selecting the fields that make good features
  lr_q->select( iv_field = 'COMP_VERSION'       iv_alias = 'SAP_SYSTEM_TYPE' ).
  lr_q->select( iv_field = 'DATABASE'           iv_uses_fields = 'NAME,VERSION' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).
  lr_q->select( iv_field = 'DATABASE_SIZE'      iv_uses_fields = 'DB_USED' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).
  lr_q->select( iv_field = 'OS'                 iv_alias = 'OPERATING_SYSTEM' ).
  lr_q->select( iv_field = 'SAP_CLIENT_TYPE'    iv_uses_fields = 'CCCATEGORY' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD'  ).
  lr_q->select( iv_field = 'COMPANY_INDUSTRY1'  iv_alias = 'INDUSTRY' ).
  lr_q->select( iv_field = 'IS_UNICODE'         iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).
  lr_q->select( iv_field = 'SCAN_VERSION' ).
  lr_q->select( iv_field = 'RUNTIME_MINUTES'    iv_ddic_type = 'INT4' ). "make sure this column is converted into a number

  "perform the query on the defined internal table
  lr_q->from( it_data ).

  "filter records that are not good for results
  lr_q->filter( iv_field = 'DATABASE'           iv_filter = '-' ). "no empty values in the database
  lr_q->filter( iv_field = 'SAP_SYSTEM_TYPE'    iv_filter = '-' ). "no empty values in the SAP System Type
  lr_q->filter( iv_field = 'INDUSTRY'           iv_filter = '-' ). "no empty values in the Industry
  lr_q->filter( iv_field = 'RUNTIME_MINUTES'    iv_filter = '>=10' ). "Minimum of 10 minutes runtime
  lr_q->filter( iv_field = 'DATABASE_GB_SIZE'   iv_filter = '>=50' ). "Minimum of 50 GB database size

  "sort by runtime
  lr_q->sort( 'RUNTIME_MINUTES' ).

  "execute the query
  rr_data = lr_q->run( ).

ENDFORM.
Basically the magic is done using the SNP/CN01_CL_ITAB_QUERY class, which is part of the SNP Transformation Backbone framework. It enables SQL like query capabilities on ABAP internal tables. This includes transforming field values, which is done using callback mechanisms.

FORM on_virtual_field USING iv_field is_record TYPE any CHANGING cv_value TYPE any.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lv_database TYPE string.
  DATA: lv_database_version TYPE string.
  DATA: lv_tmp TYPE string.
  DATA: lv_int TYPE i.
  DATA: lv_p(16) TYPE p DECIMALS 1.

  FIELD-SYMBOLS: <lv_value> TYPE any.

*"--- MACRO DEFINITION ------------------------------------------------
  DEFINE mac_get_field.
    clear: &2.
    assign component &1 of structure is_record to <lv_value>.
    if sy-subrc = 0.
      &2 = <lv_value>.
    else.
      return.
    endif.
  END-OF-DEFINITION.

*"--- PROCESSING LOGIC ------------------------------------------------
  CASE iv_field.
    WHEN 'DATABASE'.

      "combine database name and major version to one value
      mac_get_field 'NAME' lv_database.
      mac_get_field 'VERSION' lv_database_version.
      SPLIT lv_database_version AT '.' INTO lv_database_version lv_tmp.
      CONCATENATE lv_database lv_database_version INTO cv_value SEPARATED BY space.

    WHEN 'DATABASE_SIZE'.

      "categorize the database size into 1.5 TB chunks (e.g. "up to 4.5 TB")
      mac_get_field 'DB_USED' cv_value.
      lv_p = ( floor( cv_value / 1500 ) + 1 ) * '1.5'. "simple round to full 1.5TB chunks
      cv_value = /snp/cn00_cl_string_utils=>text( iv_text = 'up to &1 TB' iv_1 = lv_p ).
      TRANSLATE cv_value USING ',.'. "translate commas to dots to the CSV does not get confused

    WHEN 'SAP_CLIENT_TYPE'.

      "fill up the client category type with a default value
      mac_get_field 'CCCATEGORY' cv_value.
      IF cv_value IS INITIAL.
        cv_value = 'T'. "default to (T)est SAP client
      ENDIF.

    WHEN 'IS_UNICODE'.

      "convert the unicode flag into more human readable values
      IF cv_value = abap_true.
        cv_value = 'unicode'.
      ELSE.
        cv_value = 'non-unicode'.
      ENDIF.

  ENDCASE.

ENDFORM.
After that the data looks nice and cleaned up this time like this:


Creating the Dataset

In SAP HANA Cloud Platform Predictive Services you rather create a dataset. This is basically split up into:

  1. Creating a Database Table in a HANA Database 
  2. Uploading Data into that Database Table
  3. Registering the dataset 

This mainly corresponds to creating a datasource with AWS Machine Learning API. However, you do not explicitly train the dataset or create a model. This is done implicitly done - maybe. We'll discover more about that soon.

FORM create_dataset USING ir_hcp_machine_learning TYPE REF TO /snp/hcp01_cl_ml
                          it_table TYPE table
                 CHANGING rv_dataset_id.

  rv_dataset_id = ir_hcp_machine_learning->create_dataset(

    "...by creating a temporary table in a HCP HANA database
    "   instance from an internal table
    "   and inserting the records, so it can be used
    "   as a machine learning data set
    it_table = it_table

  ).

ENDFORM.

My API will create a temporary table for each interal table you are creating a dataset on. It's a column table without a primary key. All column types are determined automatically using runtime type inspection. If colums of the internal table are strings, I rather determine the length by scanning the content than creating CLOBs which are not suited well for Predictive Services.

Please note that uploading speed significantly suffers, if you are inserting content line-by-line, which is the case if cl_sql_statement does not support set_param_table on your release. This also was the case for my system, so I had to build that functionality myself.

After that it is finally time to find the key influencers, that affect the runtime of the SNP System Scan the most...

FORM find_key_influencers USING ir_ml TYPE REF TO /snp/hcp01_cl_ml
                                iv_dataset_id TYPE i.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lt_key_influencers TYPE /snp/hcp00_tab_key_influencer.

*"--- PROCESSING LOGIC ------------------------------------------------
  "...introspect the model, e.g. finding the features (=columns) that influence
  "   a given figure (=target column) in descending order
  lt_key_influencers = ir_ml->get_key_influencers(

    "which dataset should be inspected
    iv_dataset_id = iv_dataset_id

    "what is the target columns, for which the key influencers
    "should be calculated
    iv_target_column = 'RUNTIME_MINUTES'

    "how many key influencers should be calculated?
    iv_number_of_influencers = 5

  ).

  "DATABASE_SIZE:    37% Influence
  "DATABASE:         23% Influence (e.g. ORACLE 12, SAP HANA DB etc.)
  "SCAN_VERSION:     15% Influence
  "OPERATING_SYSTEM: 10% Influence
  "SAP_SYSTEM_TYPE    5% Influence (e.g. SAP R/3 4.6c; SAP ECC 6.0; S/4 HANA 16.10 etc.)

  "...remove dataset afterwards
  ir_ml->remove_dataset( iv_dataset_id ).

ENDFORM.


As mentioned above databse size, database vendor and scan version were not a suprise. I didn't think that the operating system would have such a big influence, as SAP NetWeaver is abstracting that away. I expected the SAP system type to have more of an influence, as I figured, that different data models will have a bigger impact on performance. So all in all not so many suprises, but then again, that makes the model trustworthy...

Challenges

Along the way I have found some challenges.


  • Authentication: I always seem to have a problem finding the simples things like which flags to set in the authentication mechanism. Just make sure to switch on "Trusted SAML 2.0 identiy provider", "User name and password", "Client Certificate" and "Application-to-Application SSO" on the FORM card of the Authentication Configuration and do not waste hours like me.
  • Upload Speed: As stated above, if you are inserting the contents of you internal table line-by-line you are significantly suffering performance. On the other hand inserting multiple 100k of records was not so much of a problem, once you untap mass insert/update. It may not be available in your ADBC implementation, depending on the version of your SAP NetWeaver stack, so consider backporting it from a newer system. It's definately worth it.
  • Table creation: I am a big fan of dynamic programming despite the performance penalties it has some times. However, when you are creating database tables to persist your dataset in you HCP  HANA database you have to make sure that columns are as narrow as possible for good performance or even relevance of your results.

Features of SAP HCP Predictive Analysis


  • Key Influencers: This is the use case that I have shown in this article
  • Scoring Equation: You can get the code that is doing the calculation of predictions either as an SQL query executable on any HANA database or a score card. The first is basically a decision tree, which can easily be transpiled into other languages and thereby be used for on-premise deliverys on the other hand this show, that the mechanics unterneath the SAP HCP Predictive Analysis application are currently quite simple, which I will dig into more in the conlusion below
  • Forecast: based on a time based column you can predict future values
  • Outliers: You can find exceptions in your dataset. While key influencers are more focussed on the structure, as they represent influencial columns to a result. Outliers show the exceptional rows to the calculated equation.
  • WhatIf: Simulates a planned action and return the significant changes resulting from it
  • Recommendations: Finding interesting products based on a purchase history by product and/or user. This can also be transferred to other recommendation scenarios.


Conclusion

So after this rather lengthy article I have come to a conclusion about SAP HCP Predictive Services, especially compared to AWS Machine Learning capabilities:

Pros
  • Business oriented scenarios: You definately do not have to think as much about good use cases. The API present them to you as shown in "Features" section above.
  • Fast: Calculation is really fast. Predictions are available almost instantaniously. Especially if benchmarked against AWS where building a datasource and training a model took well enough 10 minutes. But do not overestimate this, as the Cons will show you.
  • Introspection: Many services are about looking into the scenario. AWS is just about prediction at the moment. This transparency about dependencies inside the dataset were most interesting for me.
  • On Premise delivery of models via desicion trees: The fact that models are esposed as executable decision trees that can easily be transpiled into any other programming language makes on premise delivery possible. Basically prediction is effortless after doing so. But then again you have to manage the model life cycle and how updates to it are rolled out.
Cons
  • Higher Cost of Infrastructure: At least on the fixed cost part a productive HCP acount is not cheap. But then again there is no variable cost if you are able to deploy your models on premise.
  • Only Regression: Currently target figures have to be nummeric. So only regession problems can be solved. No classification problems. Of course HANA also has natural language processing on board but this is not availble for machine learning purposes per se.
  • Little options for manipulating learning: You just register a model. Nothing said about how training and validation is to be performed, how to normalize data on the platform and so on
  • Trading speed for quality: As stated registering a model is fast, introspecting it etc. is also very fast. But then again I was able to achive different results with the same dataset when I sorted it differently. Consistently. And not just offsetting the model by 2% but rather big time. For example, when sorting my dataset differently the key influencers turned out to be completely different ones. This is actually quite concerning. Maybe I am missing something, but maybe this is why training AWS models takes significantly longer, because they scramble datasets and run multiple passes over it to determine the best an most stable model.
While SAP HCP Predictive Services looks very promising, has good use cases and is appealing especially for it's transparency, stability and reliability have to improve before it's safe to rely on it for business decisions. Well I only have a HCP trial account at the moment, maybe this intentional. Let's see how predictive services on-premise on a HANA 2.0 database are doing...

0 comments: