Introduction Last updated: 2020-07-31

Although machine learning algorithms are widely used in extremely diverse situations, in practice, one or more major limitations almost invariably appear and significantly constrain successful applications. Frequently, these problems are associated with large increases in the rate of generation of data, the quantity of data and the number of attributes (variables) to be processed. Increasingly, the data situation is now beyond the capabilities of conventional data mining methods. The term “Real Time” is used to describe how well a machine learning algorithm can accommodate an ever increasing data load instantaneously. However, such real time problems are usually closely coupled with the fact that conventional algorithms operate in a batch mode where having all of the relevant data at once is a requirement. Xarang as a real time machine learning toolbox has the following characteristics, independent of the amount of data involved:

The term “Real Time” is used to describe how well a machine learning algorithm can accommodate an ever increasing data load instantaneously. However, such real time problems are usually closely coupled with the fact that conventional algorithms operate in a batch mode where having all of the relevant data at once is a requirement. Xarang as a real time machine learning toolbox has the following characteristics, independent of the amount of data involved:

  1. Incremental learning (Learn): Immediately updating a model with each new observation without the necessity of pooling new data with old data.
  2. Decremental learning (Forget): Immediately updating a model by excluding observations identified as adversely affecting model performance without forming a new dataset omitting this data and returning to the model formulation step.
  3. Variable addition (Grow): Adding a new attribute (variable) on the fly, without the necessity of pooling new data with old data.
  4. Variable deletion (Shrink): Immediately discontinuing use of an attribute identified as adversely affecting model performance.
  5. Distributed processing: Separately processing distributed data or segments of large data (that may be located in diverse geographic locations) and re-combining the results to obtain a single model.
  6. Parallel processing: Carrying out parallel processing extremely rapidly from multiple conventional processing units (multi-threads, multi-processors or a specialized chip).

Project

The Project tab is the first screen of Xarang. We can create a new project, open or save an existing project, or merge multiple projects. Projects are stored as a set of files.

Creating a new project

  1. To create a new project, click "New" project icon.
  2. Select a folder.
  3. Enter project name in the "File name" box.
  4. Click "Save".

Open an existing project

  1. To open an existing project, click on the Open icon.
  2. Select the project file (e.g., GSE73002.ilm) click the Open button.
  3. Click the "Open" button.

Save As project

  1. To save an existing project under a different name, click on the Save icon.
  2. Type a new project name (e.g., NewModel) and click OK.

Merge projects

  1. To merge two or more existing projects, click on the Merge icon.
  2. Click on the Open Folder icon to open "Browse For Folder" dialogue box.
  3. Select the folder that contains the projects you would like to merge and click OK.

Data

The Data tab allows you to load data from local files, databases, or the Cloud.

Load a single local flat file

  1. To load a local dataset, click the Open File button.
  2. Select the data file.
  3. Click Open.

Load multiple files from a folder

  1. To load more than one dataset, click the Open Folder button.
  2. Click Open button button. and select the folder that contains data files.
  3. Select one or more files.
  4. Click OK.

Load from a database

  1. To load data from a database, click the Open Database button.
  2. Enter the DSN, User ID/ Password (if needed) in the related fields and click
  3. Select one or more tables from the list. All tables must have the same schema.
  4. Click OK.

Load from the cloud

  1. To load data from the cloud, click the Open Cloud button
  2. Enter the AWS Access Key and AWS Secret Key.
  3. Click the Connect button.
  4. Select a Bucket.
  5. Select a Folder.
  6. Select one or more files. All files must have the same schema.
  7. Click OK.

Learner

  1. Select "Target" varaible.
  2. Click to guess the type of each variable. You can change the type of variable by right clicking on that variable.
  3. Select "Ignore" if you would like to exclude a variable from the learning process.
  4. You can also load a saved schema file by clicking on .
  5. Click the Start Learner button .
  6. Stop the process any time by clicking the Stop button
  7. You can rollback the learning (called forgetting) by checking the Decremental checkbox and then clicking the Start button again.

Explorer

The Explorer tab allows you to explore the data statistical analysis and visualization techniques.

  1. Univariate (descriptive statistics)
  2. Bivariate (inferential statistics)
  3. Multivariate (factor analysis)

Explorer - Univariate

Univariate data analysis explores attributes (variables) one by one using statistical analysis. Attributes are either numerical or categorical (encoded to binary). Learn more

  1. Select the Univariate tab.
  2. Select one or more variables from the Variable 1 list.
  3. To view only Binary or Numeric variables, click All, then Binary or Numeric.

Explorer - Bivariate

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences. There are four types of bivariate analysis.

  1. Correlation
  2. Hypothesis Testing
  3. ANOVA
  4. Test of Independence

Correlation

Linear correlation quantifies the strength of a linear relationship between two numerical variables. When there is no correlation between two variables, there is no tendency for the values of one quantity to increase or decrease with the values of the second quantity. Learn more

  1. Select the Correlation tab.
  2. Select one or more variables from the Variable 1 list.
  3. Select the second variable from the Variable 2 list.
  4. Click to visualize the result.

Hypothesis Testing

  1. Select the Hypothesis Testing tab.
  2. Select a numerical variable from the Variable 2 list. Note: To view only Binary or Numeric variables, click All, then Binary or Numeric.
  3. Select Binary Variable 1 and if necessary Binary Variable 2 from the drop down lists.
  4. Click Z Test , T Test , or F Test button. The related result will be displayed.
  5. Click to visualize the result.

Z Test

The Z test assesses whether the difference between averages of two attributes are statistically significant. This analysis is appropriate for comparing the average of a numerical attribute with a known average or two conditional averages of a numerical attribute given two binary attributes (two categories of the same categorical attribute). Learn more

T Test

The T test like Z test assesses whether the averages of two numerical attributes are statistically different from each other when the number of data points is less than 30. T test is appropriate for comparing the average of a numerical attribute with a known average or two conditional averages of a numerical attribute given two binary attributes (two categories of the same categorical attribute). Learn more

F Test

The F-test is used to compare the variances of two attributes.F test can be used for comparing the variance of a numerical \ attribute with a known variance or two conditional variances of a numerical attribute given two binary attributes (two categories of the same categorical attribute). Learn more

ANOVA

ANOVA (Analysis of Variance) assesses whether the averages of more than two groups are statistically different from each other, under the assumption that the corresponding populations are normally distributed. ANOVA is useful for comparing averages of two or more numerical attributes or two or more conditional averages of a numerical attribute given two or more binary attributes (two or more categories of the same categorical attribute). Learn more

  1. Select the ANOVA tab.
  2. Select a numerical variable from the Variable 2 list. Note: To view only Binary or Numeric variables, click All, then Binary or Numeric.
  3. Select the Binary Variables from the Binary Variables list.
  4. Click the ANOVA button. The ANOVA table will be displayed.
  5. Click the Stats radio button to view the Count, Mean, and Variance.
  6. Click to visualize the result.

Test of Independence

The Chi2 test can be used to determine the association between categorical (binary) attributes. It is based on the difference between the expected frequencies and the observed frequencies in one or more categories in the frequency table. The Chi2 distribution returns a probability for the computed Chi2 and the degree of freedom. A probability of zero shows complete dependency between two categorical attributes and a probability of one means that two categorical attributes are completely independent.Learn more

  1. Select the Test of Independence tab.
  2. Select Binary variables in Rows and a Binary variables in Columns.
  3. Click the Chi2 button.

Explorer - Factor Analysis

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, people may respond similarly to questions about income, education, and occupation, which are all associated with the latent variable socioeconomic status. The relationship of each variable to the underlying factor is expressed by the so-called factor loading. Here is an example of the output of a simple factor analysis. The first number underneath of every factor are "eigenvalue" and "percentage of variance explained".

Extraction Methods:

Xarang supports six extraction methods:

  1. Alpha Factoring
  2. Generalized Least Squares
  3. Image Factoring
  4. Iterative Principal Axis
  5. Maximum Likelihood
  6. Principal Components Analysis (PCA)
  7. Unweighted Least Squares

PCA is the most popular extraction method. However, information on the relative strengths and weaknesses of these techniques is not well known. In general, Maximum Likelihood or Iterative Principal Axis will give you the best results,depending on whether your data are generally normally-distributed or significantly non-normal, respectively.

Number of Factors:

After extraction you must decide how many factors to retain for rotation. Both over-extraction and under-extraction of factors retained for rotation can have damaging effects on the results. The default in most statistical software packages is to retain all factors with eigenvalues greater than 1.0. Alternate tests for factor retention include the screen test. The scree test involves examining the graph of the eigenvalues and looking for the natural bend or break point in the data where the curve flattens out. The number of datapoints above the “break” (i.e., not including the point at which the break occurs) is usually the number of factors to retain.

Rotation Methods:

An important feature of factor analysis is that the axes of the factors can be rotated within the multidimensional variable space. Rotations that allow for correlation are called oblique rotations; rotations that assume the factors are not correlated are called orthogonal rotations.

Varimax is the most popular orthogonal rotation and Promax is the only oblique rotation method supported by Xarang.

  1. Equamax
  2. Promax
  3. Quartimax
  4. Varimax

Modeler

The Modeler constructs two types of predictive models:

  1. Classification
  2. Regression

Binary Classification

Classification refers to the data mining task of attempting to build a predictive model when the target is categorical. If the number of unique values are just two (0,1) it is called Binary Classification. The main goal of classification is to divide a dataset into mutually exclusive groups such that the members of each group are as close as possible to one another, and different groups are as far as possible from one another.

  1. Select the Classification tab on the bottom left of the window.
  2. Select the Classification target from the dropdown list. Only binary variables will be displayed.
  3. Select the input variables from the Inputs list.
  4. Click to save the selected variables.
  5. Click to open the selected variable list.
  6. Click Model button to build the model.
  7. To avoid attributes that do not contribute significantly to model prediction you can use the Reducer function. You can also adjust the Delta value and number of Iterations to influence the outcome of the Reducer. The Delta is the contribution threshold that a certain variable must provide to the model in order to be selected by the Reducer.
  8. If you did not use the Reducer, you will need to select one or more Input variables and build the model by clicking the Model button.
  9. To change input variables in real time, check the OnTheFly checkbox. You can now select or unselect variables to instantly change and build the model.
  10. To generate a script for the model, click the Script button. The Model Script window will open with the required scripts (Equation, SQL Script, VB Code and Java Code).

Regression

Regression refers to the data mining problem of attempting to build a predictive model when the target is numerical. The simplest form of regression, simple linear regression, fits a line to a set of data.

  1. Select the Regression tab on the bottom left of the window.
  2. Select the Regression target from the dropdown list. You can select either a numeric or binary variable.
  3. Select the input variables from the Inputs list.
  4. Click to save the selected variables.
  5. Click to open the selected variable list.
  6. Click Model button to build the model.
  7. To avoid attributes that do not contribute significantly to model prediction you can use the Reducer function. You can also adjust the Delta value and number of Iterations to influence the outcome of the Reducer. The Delta is the contribution threshold that a certain variable must provide to the model in order to be selected by the Reducer.
  8. If you did not use the Reducer, you will need to select one or more Input variables and build the model by clicking the Model button.
  9. To change input variables in real time, check the OnTheFly checkbox. You can now select or unselect variables to instantly change and build the model.
  10. To generate a script for the model, click the Script button. The Model Script window will open with the required scripts (Equation, SQL Script, VB Code and Java Code).

Predictor

The Predictor uses a new dataset and a model for prediction in four steps:

  1. Select a dataset
  2. Select a model
  3. Predict using a model
  4. Evaluate prediction result

Predictor - Data

On the Predictor tab opens to the Data tab. Load the dataset from a local drive, a database or a Cloud service that you would like to use to make predictions.

Predictor - Model

  1. Click the Model tab.
  2. Select either Classification, Regression or MultiCLass from the Model drop-down list.
  3. Select one or more Input variables and a Target variable.
  4. You can also append other variables to the output file by selecting them from the Key list.

Predictor - Predict

To begin the Predictor, click the Start Predictor button. The results will be displayed and an output file will be created. Learn more about LDA, MLR and model evaluation

Predictor - Evaluate

If you have used a Classification Model, click on to view more evaluation charts.

This is Error Histogram for a regression model.

Deploy

Models can be deployed to a remote server using FTP.

  1. Click on "Connect" to connect to the remote server.
  2. Click to select a save model file.
  3. Click to deploy the model to the remote server.
  4. The models on the remote server can be renamed or deleted.

Scorecard

  1. Click "Models" to refresh the list.
  2. Select a model from the list and click .
  3. Fill the scorecard or check "Avg" to fill the scorecard with the related average values.
  4. Click to predict using the model.
  5. Click or to browse the current session queries.

A/B Test

A/B testing (also known Multivariate testing) is a method of comparing two versions of a model against each other to determine which one performs better. you have used a Classification Model, click on to view more evaluation charts.

  1. Click to refresh the list of models.
  2. You can compare up to 5 models by selecting them from the corresponding list.
  3. Assign a percentage between 1-100 to each model. The total percentage should equal to 100.
  4. Select a dataset from the list.
  5. Click