Thursday, January 05, 2017

Creating AWS Machine Learning Models from ABAP

Hi guys,

extending my previous article about "Using AWS Machine Learning from ABAP to predict runtimes" I have now been able to extend the ABAP based API to create models from ABAP internal tables (which is like a collection of records, for the Non-ABAPers ;-).

This basically enables ABAP developers to utilize Machine Learning full cycle without ever having to leave their home turf or worry about the specifics of the AWS Machine Learning implementations.

My use case still is the same: Predicting runtimes of the SNP System Scan based on well known parameters like database vendor (e.g. Oracle, MaxDB), database size, SNP System Scan version and others. But since my first model was not quite meeting my expectations I wanted to be able to play around easily, adding and removing attributes from the model with a nice ABAP centric workflow. This probably makes it most effective for other ABAP developers to utilize Machine Learning. So let's take a look at the basic structure of the example program:

1:   REPORT /snp/aws01_ml_create_model.  
2:    
3:   START-OF-SELECTION.  
4:    PERFORM main.  
5:    
6:   FORM main.  
7:   *"--- DATA DEFINITION -------------------------------------------------  
8:    DATA: lr_scan_data TYPE REF TO data.  
9:    DATA: lr_prepared_data TYPE REF TO data.  
10:   DATA: lr_ml TYPE REF TO /snp/aws00_cl_ml.  
11:   DATA: lv_model_id TYPE string.  
12:   DATA: lr_ex TYPE REF TO cx_root.  
13:   DATA: lv_msg TYPE string.  
14:    
15:   FIELD-SYMBOLS: <lt_data> TYPE table.  
16:    
17:  *"--- PROCESSING LOGIC ------------------------------------------------  
18:   TRY.  
19:     "fetch the data into an internal table  
20:     PERFORM get_system_scan_data CHANGING lr_scan_data.  
21:     ASSIGN lr_scan_data->* TO <lt_data>.  
22:    
23:     "prepare data (e.g. convert, select features)  
24:     PERFORM prepare_data USING <lt_data> CHANGING lr_prepared_data.  
25:     ASSIGN lr_prepared_data->* TO <lt_data>.  
26:    
27:     "create a model  
28:     CREATE OBJECT lr_ml.  
29:     PERFORM create_model USING lr_ml <lt_data> CHANGING lv_model_id.  
30:    
31:     "check if...  
32:     IF lr_ml->is_ready( lv_model_id ) = abap_true.  
33:    
34:      "...creation was successful  
35:      lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Model &1 is ready' iv_1 = lv_model_id ).  
36:      MESSAGE lv_msg TYPE 'S'.  
37:    
38:     ELSEIF lr_ml->is_failed( lv_model_id ) = abap_true.  
39:    
40:      "...creation failed  
41:      lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Model &1 has failed' iv_1 = lv_model_id ).  
42:      MESSAGE lv_msg TYPE 'S' DISPLAY LIKE 'E'.  
43:    
44:     ENDIF.  
45:    
46:    CATCH cx_root INTO lr_ex.  
47:    
48:     "output errors  
49:     lv_msg = lr_ex->get_text( ).  
50:     PERFORM display_lines USING lv_msg.  
51:    
52:   ENDTRY.  
53:    
54:  ENDFORM.  

And now let's break it down into it's individual parts:

Fetch Data into an Internal Table

In my particular case I was fetching the data via a REST Service from the SNP Data Cockpit instance I am using to keep statistics on all executed SNP System Scans. However, you can basically fetch your data that will be used as a data source for your model in any way that you like. Most probably you will be using OpenSQL SELECTs to fetch the data accordingly. Resulting data looks somewhat like this:

Prepare Data

This is the raw data and it's not perfect! The data quality is not quite good and in the shape that it's in. According to this article there are some improvements that I need to do in order to improve its quality.
  • Normalizing values (e.g. lower casing, mapping values or clustering values). E.g.
    • Combining the database vendor and the major version of the database because those two values only make sense when treated in combination and not individually
    • Clustering the database size to 1.5TB chunks as these values can be guessed easier when executing predictions
    • Clustering the runtime into exponentially increasing categories (although this may also hurt accuracy...)
  • Filling up empty values with reasonable defaults. E.g.
    • treating all unknown SAP client types as test clients
  • Make values and field names more human readable. This is not necessary for the machine learning algorithms, but it makes for better manual result interpretation
  • Removing fields that do not make good features, like 
    • IDs
    • fields that cannot be provided for later predictions, because values cannot be determined easily or intuitively
  • Remove records that still do not have good data quality. E.g. missing values in
    • database vendors
    • SAP system types
    • customer industry
  • Remove records that are not representative. E.g. 
    • they refer to scans with exceptionally short runtimes probably due to intentionally limiting the scope
    • small database sizes that are probably due to non productive systems
1:   FORM prepare_data USING it_data TYPE table CHANGING rr_data TYPE REF TO data.  
2:   *"--- DATA DEFINITION -------------------------------------------------  
3:    DATA: lr_q TYPE REF TO /snp/cn01_cl_itab_query.  
4:    
5:   *"--- PROCESSING LOGIC ------------------------------------------------  
6:    CREATE OBJECT lr_q.  
7:    
8:    "selecting the fields that make good features  
9:    lr_q->select( iv_field = 'COMP_VERSION'      iv_alias = 'SAP_SYSTEM_TYPE' ).  
10:   lr_q->select( iv_field = 'DATABASE'          iv_uses_fields = 'NAME,VERSION' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
11:   lr_q->select( iv_field = 'DATABASE_SIZE'     iv_uses_fields = 'DB_USED' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
12:   lr_q->select( iv_field = 'OS'                iv_alias = 'OPERATING_SYSTEM' ).  
13:   lr_q->select( iv_field = 'SAP_CLIENT_TYPE'   iv_uses_fields = 'CCCATEGORY' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
14:   lr_q->select( iv_field = 'COMPANY_INDUSTRY1' iv_alias = 'INDUSTRY' ).  
15:   lr_q->select( iv_field = 'IS_UNICODE'        iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
16:   lr_q->select( iv_field = 'SCAN_VERSION' ).  
17:   lr_q->select( iv_field = 'RUNTIME'           iv_uses_fields = 'RUNTIME_HOURS' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
18:    
19:   "perform the query on the defined internal table  
20:   lr_q->from( it_data ).  
21:    
22:   "filter records that are not good for results  
23:   lr_q->filter( iv_field = 'DATABASE'         iv_filter = '-' ). "no empty values in the database  
24:   lr_q->filter( iv_field = 'SAP_SYSTEM_TYPE'  iv_filter = '-' ). "no empty values in the SAP System Type  
25:   lr_q->filter( iv_field = 'INDUSTRY'         iv_filter = '-' ). "no empty values in the Industry  
26:   lr_q->filter( iv_field = 'RUNTIME_MINUTES'  iv_filter = '>=10' ). "Minimum of 10 minutes runtime  
27:   lr_q->filter( iv_field = 'DATABASE_GB_SIZE' iv_filter = '>=50' ). "Minimum of 50 GB database size  
28:    
29:   "sort by runtime  
30:   lr_q->sort( 'RUNTIME_MINUTES' ).  
31:    
32:   "execute the query  
33:   rr_data = lr_q->run( ).  
34:    
35:  ENDFORM.  

Basically the magic is done using the SNP/CN01_CL_ITAB_QUERY class, which is part of the SNP Transformation Backbone framework. It enables SQL like query capabilities on ABAP internal tables. This includes transforming field values, which is done using callback mechanisms.


1:   FORM on_virtual_field USING iv_field is_record TYPE any CHANGING cv_value TYPE any.  
2:   
3:    "...  
4:    
5:    CASE iv_field.  
6:     WHEN 'DATABASE'.  
7:    
8:      "combine database name and major version to one value  
9:      mac_get_field 'NAME' lv_database.  
10:     mac_get_field 'VERSION' lv_database_version.  
11:     SPLIT lv_database_version AT '.' INTO lv_database_version lv_tmp.  
12:     CONCATENATE lv_database lv_database_version INTO cv_value SEPARATED BY space.  
13:    
14:    WHEN 'DATABASE_SIZE'.  
15:    
16:     "categorize the database size into 1.5 TB chunks (e.g. "up to 4.5 TB")  
17:     mac_get_field 'DB_USED' cv_value.  
18:     lv_p = ( floor( cv_value / 1500 ) + 1 ) * '1.5'. "simple round to full 1.5TB chunks  
19:     cv_value = /snp/cn00_cl_string_utils=>text( iv_text = 'up to &1 TB' iv_1 = lv_p ).  
20:     TRANSLATE cv_value USING ',.'. "translate commas to dots to the CSV does not get confused  
21:    
22:    WHEN 'SAP_CLIENT_TYPE'.  
23:    
24:     "fill up the client category type with a default value  
25:     mac_get_field 'CCCATEGORY' cv_value.  
26:     IF cv_value IS INITIAL.  
27:      cv_value = 'T'. "default to (T)est SAP client  
28:     ENDIF.  
29:    
30:    WHEN 'IS_UNICODE'.  
31:    
32:     "convert the unicode flag into more human readable values  
33:     IF cv_value = abap_true.  
34:      cv_value = 'unicode'.  
35:     ELSE.  
36:      cv_value = 'non-unicode'.  
37:     ENDIF.  
38:    
39:    WHEN 'RUNTIME'.  
40:    
41:     "categorize the runtime into human readable chunks  
42:     mac_get_field 'RUNTIME_HOURS' lv_int.  
43:     IF lv_int <= 1.  
44:      cv_value = 'up to 1 hour'.  
45:     ELSEIF lv_int <= 2.  
46:      cv_value = 'up to 2 hours'.  
47:     ELSEIF lv_int <= 3.  
48:      cv_value = 'up to 3 hours'.  
49:     ELSEIF lv_int <= 4.  
50:      cv_value = 'up to 4 hours'.  
51:     ELSEIF lv_int <= 5.  
52:      cv_value = 'up to 5 hours'.  
53:     ELSEIF lv_int <= 6.  
54:      cv_value = 'up to 6 hours'.  
55:     ELSEIF lv_int <= 12.  
56:      cv_value = 'up to 12 hours'.  
57:     ELSEIF lv_int <= 24.  
58:      cv_value = 'up to 1 day'.  
59:     ELSEIF lv_int <= 48.  
60:      cv_value = 'up to 2 days'.  
61:     ELSEIF lv_int <= 72.  
62:      cv_value = 'up to 3 days'.  
63:     ELSE.  
64:      cv_value = 'more than 3 days'.  
65:     ENDIF.  
66:    
67:   ENDCASE.  
68:    
69:  ENDFORM.  

After running all those preparations, the data is transformed into a record set that looks like this:


Create a Model

Ok, preparing data for a model is something that the developer has to do for each individual problem he wants to solve. But I guess this is done better if performed in a well known environment. After all this is the whole purpose of the ABAP API. Now we get to the parts that's easy, as creating the model based on the internal table we have prepared so far is fully automated. As a developer you are completely relieved from the following tasks:

  • Converting the internal table into CSV
  • Uploading it into an AWS S3 bucket and assigning the correct priviledges, so it can be used for machine learning
  • Creating a data source based on the just uploaded AWS S3 object and providing the input schema (e.g. which fields are category fields, which ones are numeric etc.). As this information can automatically be derived from DDIC information
  • Creating a model from the datasource
  • Training the model
  • Creating an URL Endpoint so the model can be used for predictions as seen in the previous article.
That's quite a lot of stuff, that you do not need to do anymore. Doing all this is just one API call away:

1:   FORM create_model USING ir_aws_machine_learning TYPE REF TO /snp/aws00_cl_ml  
2:                           it_table TYPE table  
3:                  CHANGING rv_model_id.  
4:    
5:     rv_model_id = ir_aws_machine_learning->create_model(  
6:    
7:     "...by creating a CSV file from an internal table  
8:     "  and upload it to AWS S3, so it can be used  
9:     "  as a machine learning data source  
10:    it_table = it_table  
11:    
12:    "...by defining a target field that is used  
13:    iv_target_field = 'RUNTIME'  
14:    
15:    "...(optional) by defining a title  
16:    iv_title = 'Model for SNP System Scan Runtimes'  
17:    
18:    "...(optional) to create an endpoint, so the model  
19:    "  can be used for predictions. This defaults to  
20:    "  true, but you may want to switch it off  
21:    
22:    " IV_CREATE_ENDPOINT = ABAP_FALSE  
23:    
24:    "...(optional) by defining fields that should be  
25:    "  treated as text rather than as a category.  
26:    "  By default all character based fields are treated  
27:    "  as categorical fields  
28:    
29:    " IV_TEXT_FIELDS = 'COMMA,SEPARATED,LIST,OF,FIELDNAMES'  
30:    
31:    "...(optional) by defining fields that should be  
32:    "  treated as numerical fields rather than categorical  
33:    "  fields. By detault the type will be derived from the  
34:    "  underlying data type, but for convenience reasons  
35:    "  you may want to use this instead of creating and  
36:    "  filling a completely new structure  
37:    
38:    " IV_NUMERIC_FIELDS = 'COMMA,SEPARATED,LIST,OF,FIELDNAMES'  
39:    
40:    "...(optional) by defining if you want to create the model  
41:    "  synchronously or asynchronously. By default a the  
42:    "  datasource, model, evaluation and endpoint are created  
43:    "  synchronously so that after returning from the method call  
44:    "  you can immediately start with predictions.  
45:    
46:    " IV_WAIT = ABAP_TRUE by default  
47:    " IV_SHOW_PROGRESS = ABAP_TRUE by default  
48:    " IV_REFRESH_RATE_IN_SECS = 5 seconds by default  
49:    
50:   ).  
51:    
52:  ENDFORM.  

As you see, most stuff is optional. Sane default values are provided that assume synchronously uploading the data, creating the datasource, model, training and endpoint. So you can directly perform predictions afterwards. Creating all of this in an asynchronous fashion is also possible. Just in case you do not rely on performing predictions directly. After all, the whole process takes up 10 to 15 minutes - which is why showing progress becomes important, especially since you do not want to run into time out situations, when doing this in online mode with a GUI connected.

The Result

After all is done, you can perform predictions. Right let's just hop over into AWS machine learning console and see the results:

A CSV file was created in an AWS S3 bucket...


...then a datasource, ML model and an evaluation for training the model were created (also an endpoint, but the screenshot does not show it) ...


...and finally we can inspect the model performance.

Conclusion

This is a big step towards making Machine Learning available to many without the explicit need to cope with vendor specific aspects. However understanding the principles of machine learning, especially in regards to the problems, you can apply it to and what good data quality means for good predictions is a requirement.

0 comments: