Run SQLFlow Using Docker

SQLFlow releases several Docker images that contains the SQLFlow server, MySQL server, sample datasets, Jupyter Notebook server, and the SQLFlow plugin for Jupyter.

You can use these Docker images for either local trying out or production deployment.

Preparation

Install Docker Community Edition on your PC/Macbook/Server.
Pull the latest SQLFlow Docker images. Or you can also build the Docker image from source code following this guide.
```
docker pull sqlflow/sqlflow
docker pull sqlflow/sqlflow:mysql
docker pull sqlflow/sqlflow:jupyter
```

Try Out SQLFlow Using Notebook

Type the below command to start three containers to start a MySQL server, SQLFlow server and a Jupyter notebook server.

docker run --name=sqlflow-mysql -d -p 8888:8888 sqlflow/sqlflow:mysql
docker run --net=container:sqlflow-mysql -d sqlflow/sqlflow:latest sqlflowserver
docker run --net=container:sqlflow-mysql -d sqlflow/sqlflow:jupyter

You can also use a specified version (e.g. v0.4.0) of the SQLFlow server by changing the second line above to docker run --net=container:sqlflow-mysql -d sqlflow/sqlflow:v0.4.0 sqlflowserver.
Open a web browser, go to localhost:8888, open any tutorial notebook like iris-dnn.ipynb file, then you can follow the tutorial and run the SQL statements to run the training and prediction.

XGBoost on SQLFlow Tutorial

This is a tutorial on train/predict XGBoost model in SQLFLow, you can find more SQLFlow usage from the Language Guide, in this tutorial you will learn how to:

Train a XGBoost model to fit the boston housing dataset; and
Predict the housing price using the trained model;

The Dataset

This tutorial would use the Boston Housing as the demonstration dataset. The database contains 506 lines and 14 columns, the meaning of each column is as follows:

Column	Explain
crim	per capita crime rate by town.
zn	proportion of residential land zoned for lots over 25,000 sq.ft.
indus	proportion of non-retail business acres per town.
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox	nitrogen oxides concentration (parts per 10 million).
rm	average number of rooms per dwelling.
age	proportion of owner-occupied units built prior to 1940.
dis	weighted mean of distances to five Boston employment centres.
rad	index of accessibility to radial highways.
tax	full-value property-tax rate per $10,000.
ptratio	pupil-teacher ratio by town.
black	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat	lower status of the population (percent).
medv	median value of owner-occupied homes in $1000s.

We separated the dataset into train/test dataset, which is used to train/predict our model. SQLFlow would automatically split the training dataset into train/validation dataset while training progress.

In [1]:

 
%%sqlflow
describe boston.train;
​

Out[1]:

	Field	Type	Null	Default
0	crim	float	YES	None
1	zn	float	YES	None
2	indus	float	YES	None
3	chas	int(11)	YES	None
4	nox	float	YES	None
5	rm	float	YES	None
6	age	float	YES	None
7	dis	float	YES	None
8	rad	int(11)	YES	None
9	tax	int(11)	YES	None
10	ptratio	float	YES	None
11	b	float	YES	None
12	lstat	float	YES	None
13	medv	float	YES	None

In [2]:

 
%%sqlflow
describe boston.test;
​

Out[2]:

	Field	Type	Null	Default
0	crim	float	YES	None
1	zn	float	YES	None
2	indus	float	YES	None
3	chas	int(11)	YES	None
4	nox	float	YES	None
5	rm	float	YES	None
6	age	float	YES	None
7	dis	float	YES	None
8	rad	int(11)	YES	None
9	tax	int(11)	YES	None
10	ptratio	float	YES	None
11	b	float	YES	None
12	lstat	float	YES	None
13	medv	float	YES	None

Fit Boston Housing Dataset

First, let's train an XGBoost regression model to fit the boston housing dataset, we prefer to train the model for 30 rounds, and using squarederror loss function that the SQLFLow extended SQL can be like:

TO TRAIN xgboost.gbtree
WITH
    train.num_boost_round=30,
    objective="reg:squarederror"

xgboost.gbtree is the estimator name, gbtree is one of the XGBoost booster, you can find more information from here.

We can specify the training data columns in COLUMN clause, and the label by LABEL keyword:

COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv

To save the trained model, we can use INTO clause to specify a model name:

INTO sqlflow_models.my_xgb_regression_model

Second, let's use a standard SQL to fetch the training data from table boston.train:

SELECT * FROM boston.train

Finally, the following is the SQLFlow Train statement of this regression task, you can run it in the cell:

In [3]:

 
%%sqlflow
SELECT * FROM boston.train
TO TRAIN xgboost.gbtree
WITH
    objective="reg:squarederror",
    train.num_boost_round = 30
COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv
INTO sqlflow_models.my_xgb_regression_model;
​

Start training XGBoost model...

[08:08:18] 496x13 matrix with 6448 entries loaded from train.txt_0

[0]	train-rmse:17.1167

[1]	train-rmse:12.3003

[2]	train-rmse:8.95505

[3]	train-rmse:6.57081

[4]	train-rmse:4.8891

[5]	train-rmse:3.69686

[6]	train-rmse:2.88164

[7]	train-rmse:2.30785

[8]	train-rmse:1.92918

[9]	train-rmse:1.62107

[10]	train-rmse:1.42948

[11]	train-rmse:1.30048

[12]	train-rmse:1.19441

[13]	train-rmse:1.09655

[14]	train-rmse:1.03565

[15]	train-rmse:0.981582

[16]	train-rmse:0.932085

[17]	train-rmse:0.867206

[18]	train-rmse:0.840945

[19]	train-rmse:0.799894

[20]	train-rmse:0.766308

[21]	train-rmse:0.712008

[22]	train-rmse:0.677878

[23]	train-rmse:0.629921

[24]	train-rmse:0.609242

[25]	train-rmse:0.585225

[26]	train-rmse:0.536009

[27]	train-rmse:0.493594

[28]	train-rmse:0.47246

[29]	train-rmse:0.444385

Evaluation result: {'train': {'rmse': [17.116695, 12.300258, 8.955046, 6.570806, 4.889098, 3.696858, 2.881636, 2.307854, 1.929176, 1.621072, 1.429485, 1.300483, 1.194412, 1.096546, 1.035647, 0.981582, 0.932085, 0.867206, 0.840945, 0.799894, 0.766308, 0.712008, 0.677878, 0.629921, 0.609242, 0.585225, 0.536009, 0.493594, 0.47246, 0.444385]}}

Predict the Housing Price

After training the regression model, let's predict the house price using the trained model.

First, we can specify the trained model by USING clause:

USING sqlflow_models.my_xgb_regression_model

Than, we can specify the prediction result table by TO PREDICT clause:

TO PREDICT boston.predict.medv

And using a standard SQL to fetch the prediction data:

SELECT * FROM boston.test

Finally, the following is the SQLFLow Prediction statement:

In [4]:

 
%%sqlflow
SELECT * FROM boston.test
TO PREDICT boston.predict.medv
USING sqlflow_models.my_xgb_regression_model;
​

rank=0 nworkers=1

2021-09-28 08:08:37.383197 Start predicting XGBoost model...

[08:08:37] SparsePageSource::CreateRowPage Finished writing to predict.txt_0.cache

[08:08:37] 10x13 matrix with 120 entries loaded from predict.txt_0#predict.txt_0.cache

2021-09-28 08:08:37.573467 Done predicting. Predict table: boston.predict

Let's have a glance at prediction results.

In [5]:

 
%%sqlflow
SELECT * FROM boston.predict;
​

Out[5]:

	crim	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.28960	9.69	0.585	5.390	72.9	2.7986	6	391	19.2	396.90	21.14	20.6996
1	0.26838	9.69	0.585	5.794	70.6	2.8927	6	391	19.2	396.90	14.10	21.9199
2	0.23912	9.69	0.585	6.019	65.3	2.4091	6	391	19.2	396.90	12.92	21.5170
3	0.17783	9.69	0.585	5.569	73.5	2.3999	6	391	19.2	395.77	15.10	21.1983
4	0.22438	9.69	0.585	6.027	79.7	2.4982	6	391	19.2	396.90	14.33	20.9567
5	0.06263	11.93	0.573	6.593	69.1	2.4786	1	273	21.0	391.99	9.67	27.7089
6	0.04527	11.93	0.573	6.120	76.7	2.2875	1	273	21.0	396.90	9.08	25.9830
7	0.06076	11.93	0.573	6.976	91.0	2.1675	1	273	21.0	396.90	5.64	30.3944
8	0.10959	11.93	0.573	6.794	89.3	2.3889	1	273	21.0	393.45	6.48	28.3626
9	0.04741	11.93	0.573	6.030	80.8	2.5050	1	273	21.0	396.90	7.88	25.4340

本文摘自：https://www.cnblogs.com/

docker run sqlfow xgboost demo2021-09-28 16:51:42