small test_set xgb predict - xgboost

i would like to ask a question about a problem that i have for the last couple days.
First of all i am a beginner in machine learning and this is my first time using the XGBoost algorithm so excuse me for any mistakes I have done.
I trained my model to predict whether a log file is malicious or not. After i save and reload my model on a different session i use the predict function which seems to be working normally ( with a few deviations in probabilities but that is another topic, I know I, have seen it in another topic )
The problem is this: Sometimes when i try to predict a "small" csv file after load it seems to be broken predicting only the Zero label, even for indexes that are categorized correct previously.
For example, i load a dataset containing 20.000 values , the predict() is working. I keep only the first 5 of these values using pandas drop, again its working. If i save the 5 values on a different csv and reload it its not working. The same error happens if i just remove by hand all indexes (19.995) and save file only with 5 remaining.
I would bet it is a size of file problem but when i drop the indexes on the dataframe through pandas it seems to be working
Also the number 5 ( of indexes ) is for example purpose the same happens if I delete a large portion of the dataset.
I first came up with this problem after trying to verify by hand some completely new logs, which seem to be classified correctly if thrown into the big csv file but not in a new file on their own.
Here is my load and predict code
df = pd.read_csv('big_test.csv')
df3 = pd.read_csv('small_test.csv')
#This one is necessary for the loaded_model
class ColumnSelector(BaseEstimator, TransformerMixin):
def init(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
return x[self.column_list].to_dict(orient='records')
loaded_model = joblib.load('finalized_model.sav')
result = loaded_model.predict(df)
result2 = loaded_model.predict(df2)
result3 = loaded_model.predict(df3)
The results i get are these:
[1 0 1 ... 0 0 0]
[1 0 1 0 1]
[0 0 0 0 0]
I can provide any code even from training or my dataset if necessary.
*EDIT: I use a pipeline for my data. I tried to reproduce the error after using xgb to fit the iris data and i could not. Maybe there is something wrong with my pipeline? the code is below :
df = pd.read_csv('big_test.csv')
# Split Dataset
attributes = ['uri','code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ]
x_train, x_test, y_train, y_test = train_test_split(df[attributes], df['Scan'], test_size=0.2,
stratify=df['Scan'], random_state=0)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.2,
stratify=y_train, random_state=0)
# print('Train:', len(y_train), 'Dev:', len(y_dev), 'Test:', len(y_test))
# set up graph function
def plot_precision_recall_curve(y_true, y_pred_scores):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_scores)
return ggplot(aes(x='recall', y='precision'),
data=pd.DataFrame({"precision": precision, "recall": recall})) + geom_line()
# XGBClassifier
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
return x[self.column_list].to_dict(orient='records')
count_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2), min_df=10)
dict_vectorizer = DictVectorizer()
xgb = XGBClassifier(seed=0)
pipeline = Pipeline([
("feature_union", FeatureUnion([
('text_features', Pipeline([
('selector', ColumnSelector(['uri'])),
('count_vectorizer', count_vectorizer)
('categorical_features', Pipeline([
('selector', ColumnSelector(['code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ])),
('dict_vectorizer', dict_vectorizer)
('xgb', xgb)
]), y_train)
filename = 'finalized_model.sav'
joblib.dump(pipeline, filename)


multiprocessing in python for images

I want to use multiprocessing in python.
With func_process code, I extract patches from image and feed them to a trained network to predict the output:
In main code, imagine I have an image, in a for loop using the rows, I select patches and make a matrix of these patches and feed this to network. In output in func_process, we get a vector of 0s and 1s like: pclass = [1 1 0 0 0 0 1 1 0 ... 0 0 1 0] as prediction results.
I need to get all these vectors in each row of image and save them to outputimage_class to make the final mask.
I think since rows are independent from each other, I can use multiprocessing . I have written the code. But the problem is that eventually I get a black image at the end while I see that I have nonzero values for pclass but the final result is 0!!!!
Can you please tell me where is the problem with this code??
from joblib import Parallel, delayed
import multiprocessing
def func_process(outputimage_class,fname,image,hwsize,rowi):
patches=[] #create a set of patches, oeprate on a per column basis
for coli in xrange(33,1000):
prediction = net.predict(patches) #predict the output
pclass = prediction.argmax(axis=1) #get the argmax
outputimage_class[rowi,hwsize+1:image.shape[1]-hwsize]=pclass % make the mask
return outputimage_class
if __name__ == "__main__":
... #load the trained network
for fname in sorted(glob.glob(IMAGE_DIR+"*.tiff")): #get all of the files
newfname_class = "%s/%s_class.png" % (OUTPUT_DIR,base_fname) #create the new files
outputimage = np.zeros(shape=(10, 10))
scipy.misc.imsave(newfname_class, outputimage) #save a file to let potential other workers know that this file is being worked on and it should be skipped
image = #load the image to test
outputimage_class = np.zeros(shape=(image.shape[0],image.shape[1]))
% use multiprocessing
num_cores = multiprocessing.cpu_count()
outputimage_class = Parallel(n_jobs=num_cores)(delayed(func_process)(outputimage_class,fname,image,hwsize,rowi) for rowi in xrange(50,80))
outputimage_class = outputimage_class[hwsize:-hwsize, hwsize:-hwsize]

How do I get CSV files into an Estimator in Tensorflow 1.6

I am new to tensorflow (and my first question in StackOverflow)
As a learning tool, I am trying to do something simple. (4 days later I am still confused)
I have one CSV file with 36 columns (3500 records) with 0s and 1s.
I am envisioning this file as a flattened 6x6 matrix.
I have another CSV file with 1 columnn of ground truth 0 or 1 (3500 records) which indicates if at least 4 of the 6 of elements in the 6x6 matrix's diagonal are 1's.
I am not sure I have processed the CSV files correctly.
I am confused as to how I create the features dictionary and Labels and how that fits into the DNNClassifier
I am using TensorFlow 1.6, Python 3.6
Below is the small amount of code I have so far.
import tensorflow as tf
import os
def x_map(line):
rDefaults = [[] for cl in range(36)]
x_row = tf.decode_csv(line, record_defaults=rDefaults)
return x_row
def y_map(line):
line = tf.string_to_number(line, out_type=tf.int32)
y_row = tf.one_hot(line, depth=2)
return y_row
x_path_file = os.path.join('D:', 'Diag', '6x6_train.csv')
y_path_file = os.path.join('D:', 'Diag', 'HasDiag_train.csv')
filenames = [x_path_file]
x_dataset =
x_dataset =
x_dataset = x_dataset.batch(1)
x_iter = x_dataset.make_one_shot_iterator()
x_next_el = x_iter.get_next()
filenames = [y_path_file]
y_dataset =
y_dataset =
y_dataset = y_dataset.batch(1)
y_iter = y_dataset.make_one_shot_iterator()
y_next_el = y_iter.get_next()
init = tf.global_variables_initializer()
with tf.Session() as sess:
x_el = (
y_el = (
The output for x_el is:
(array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([0.] ... it goes on...
The output for y_el is:
[[1. 0.]]
You're pretty much there for a minimal working model. The main issue I see is that tf.decode_csv returns a tuple of tensors, where as I expect you want a single tensor with all values. Easy fix:
x_row = tf.stack(tf.decode_csv(line, record_defaults=rDefaults))
That should work... but it fails to take advantage of many of the awesome things the API has to offer, like shuffling, parallel threading etc. For example, if you shuffle each dataset, those shuffling operations won't be consistent. This is because you've created two separate datasets and manipulated them independently. If you create them independently, zip them together then manipulate, those manipulations will be consistent.
Try something along these lines:
def get_inputs(
count=None, shuffle=True, buffer_size=1000, batch_size=32,
num_parallel_calls=8, x_paths=[x_path_file], y_paths=[y_path_file]):
Get x, y inputs.
count: number of epochs. None indicates infinite epochs.
shuffle: whether or not to shuffle the dataset
buffer_size: used in shuffle
batch_size: size of batch. See outputs below
num_parallel_calls: used in map. Note if > 1, intra-batch ordering
will be shuffled
x_paths: list of paths to x-value files.
y_paths: list of paths to y-value files.
x: (batch_size, 6, 6) tensor
y: (batch_size, 2) tensor of 1-hot labels
def x_map(line):
rDefaults = [[] for cl in range(n_dims**2)]
x_row = tf.stack(tf.decode_csv(line, record_defaults=rDefaults))
return x_row
def y_map(line):
line = tf.string_to_number(line, out_type=tf.int32)
y_row = tf.one_hot(line, depth=2)
return y_row
def xy_map(x, y):
return x_map(x), y_map(y)
x_ds =
y_ds =
combined =, y_ds))
combined = combined.repeat(count=count)
if shuffle:
combined = combined.shuffle(buffer_size)
combined =, num_parallel_calls=num_parallel_calls)
combined = combined.batch(batch_size)
x, y = combined.make_one_shot_iterator().get_next()
return x, y
To experiment/debug,
x, y = get_inputs()
with tf.Session() as sess:
xv, yv =, y))
print(xv.shape, yv.shape)
For use in an estimator, pass the function itself.
estimator.train(get_inputs, max_steps=10000)
def get_eval_inputs():
return get_inputs(
count=1, shuffle=False

Creating a .CSV file from a Lua table

I am trying to create a .csv file from a lua table. I've read some of the documentation online and on this forum... but can't seem to get it. I think it's because of the format of the lua table - take a look for yourselves.
This script is all from a great open-source software called NeuralTalk2. The main point of the software is to caption images. You can read about it more on that page.
Anyways, let me introduce to you the first piece of code: a function that takes the lua table and writes it to a .json file. This is how it looks like:
function utils.write_json(path, j)
-- API reference
cjson.encode_sparse_array(true, 2, 10)
local text = cjson.encode(j)
local file =, 'w')
Once the code compiles, the .json file looks like this:
[{"caption":"a view of a UNK UNK in a cloudy sky","image_id":"0001"},{"caption":"a view of a UNK UNK in a cloudy sky","image_id":"0002"}]
It goes on much longer, but generally, there is a "caption" following by some text, and an "image_id" followed by the image id.
When I print the table onto the terminal, it looks like this:
1681 :
caption : "a person holding a cell phone in their hand"
image_id : "1681"
1682 :
caption : "a person is taking a picture of a mirror"
image_id : "1682"
It has things before it and after it... I am just showing you the general format of the table.
You may wonder how the table is defined... I am not sure there is a very clear definition of it inside the script. I will share it just for you to see, the file where it is defined depends on so many other files, so it's messy.
I am hoping from the terminal output, you can understand generally the structure of the table, and from that know how the table is structured. I want to output it to a .csv file that will look like this
image_id captions
1 xxxx
2 xxxx
3 xxxx
How can I do this..? Not sure, given the format of the lua table...
Here is the script where it is defined. Specifically, it is define at the end, but again, not sure itll be too much help.
require 'torch'
require 'nn'
require 'nngraph'
-- exotics
require 'loadcaffe'
-- local imports
local utils = require 'misc.utils'
require 'misc.DataLoader'
require 'misc.DataLoaderRaw'
require 'misc.LanguageModel'
local net_utils = require 'misc.net_utils'
local csv_utils = require 'misc.csv_utils'
-- Input arguments and options
cmd = torch.CmdLine()
cmd:text('Train an Image Captioning model')
-- Input paths
cmd:option('-model','','path to model to evaluate')
-- Basic options
cmd:option('-batch_size', 1, 'if > 0 then overrule, otherwise load from checkpoint.')
cmd:option('-num_images', 100, 'how many images to use when periodically evaluating the loss? (-1 = all)')
cmd:option('-language_eval', 0, 'Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.')
cmd:option('-dump_images', 1, 'Dump images into vis/imgs folder for vis? (1=yes,0=no)')
cmd:option('-dump_json', 1, 'Dump json with predictions into vis folder? (1=yes,0=no)')
cmd:option('-dump_path', 0, 'Write image paths along with predictions into vis json? (1=yes,0=no)')
-- Sampling options
cmd:option('-sample_max', 1, '1 = sample argmax words. 0 = sample from distributions.')
cmd:option('-beam_size', 2, 'used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.')
cmd:option('-temperature', 1.0, 'temperature when sampling from distributions (i.e. when sample_max = 0). Lower = "safer" predictions.')
-- For evaluation on a folder of images:
cmd:option('-image_folder', '', 'If this is nonempty then will predict on the images in this folder path')
cmd:option('-image_root', '', 'In case the image paths have to be preprended with a root path to an image folder')
-- For evaluation on MSCOCO images from some split:
cmd:option('-input_h5','','path to the h5file containing the preprocessed dataset. empty = fetch from model checkpoint.')
cmd:option('-input_json','','path to the json file containing additional info and vocab. empty = fetch from model checkpoint.')
cmd:option('-split', 'test', 'if running on MSCOCO images, which split to use: val|test|train')
cmd:option('-coco_json', '', 'if nonempty then use this file in DataLoaderRaw (see docs there). Used only in MSCOCO test evaluation, where we have a specific json file of only test set images.')
-- misc
cmd:option('-backend', 'cudnn', 'nn|cudnn')
cmd:option('-id', 'evalscript', 'an id identifying this run/job. used only if language_eval = 1 for appending to intermediate files')
cmd:option('-seed', 123, 'random number generator seed to use')
cmd:option('-gpuid', 0, 'which gpu to use. -1 = use CPU')
-- Basic Torch initializations
local opt = cmd:parse(arg)
torch.setdefaulttensortype('torch.FloatTensor') -- for CPU
if opt.gpuid >= 0 then
require 'cutorch'
require 'cunn'
if opt.backend == 'cudnn' then require 'cudnn' end
cutorch.setDevice(opt.gpuid + 1) -- note +1 because lua is 1-indexed
-- Load the model checkpoint to evaluate
assert(string.len(opt.model) > 0, 'must provide a model')
local checkpoint = torch.load(opt.model)
-- override and collect parameters
if string.len(opt.input_h5) == 0 then opt.input_h5 = checkpoint.opt.input_h5 end
if string.len(opt.input_json) == 0 then opt.input_json = checkpoint.opt.input_json end
if opt.batch_size == 0 then opt.batch_size = checkpoint.opt.batch_size end
local fetch = {'rnn_size', 'input_encoding_size', 'drop_prob_lm', 'cnn_proto', 'cnn_model', 'seq_per_img'}
for k,v in pairs(fetch) do
opt[v] = checkpoint.opt[v] -- copy over options from model
local vocab = checkpoint.vocab -- ix -> word mapping
-- Create the Data Loader instance
local loader
if string.len(opt.image_folder) == 0 then
loader = DataLoader{h5_file = opt.input_h5, json_file = opt.input_json}
loader = DataLoaderRaw{folder_path = opt.image_folder, coco_json = opt.coco_json}
-- Load the networks from model checkpoint
local protos = checkpoint.protos
protos.expander = nn.FeatExpander(opt.seq_per_img)
protos.crit = nn.LanguageModelCriterion()
protos.lm:createClones() -- reconstruct clones inside the language model
if opt.gpuid >= 0 then for k,v in pairs(protos) do v:cuda() end end
-- Evaluation fun(ction)
local function eval_split(split, evalopt)
local verbose = utils.getopt(evalopt, 'verbose', true)
local num_images = utils.getopt(evalopt, 'num_images', true)
loader:resetIterator(split) -- rewind iteator back to first datapoint in the split
local n = 0
local loss_sum = 0
local loss_evals = 0
local predictions = {}
while true do
-- fetch a batch of data
local data = loader:getBatch{batch_size = opt.batch_size, split = split, seq_per_img = opt.seq_per_img}
data.images = net_utils.prepro(data.images, false, opt.gpuid >= 0) -- preprocess in place, and don't augment
n = n + data.images:size(1)
-- forward the model to get loss
local feats = protos.cnn:forward(data.images)
-- evaluate loss if we have the labels
local loss = 0
if data.labels then
local expanded_feats = protos.expander:forward(feats)
local logprobs = protos.lm:forward{expanded_feats, data.labels}
loss = protos.crit:forward(logprobs, data.labels)
loss_sum = loss_sum + loss
loss_evals = loss_evals + 1
-- forward the model to also get generated samples for each image
local sample_opts = { sample_max = opt.sample_max, beam_size = opt.beam_size, temperature = opt.temperature }
local seq = protos.lm:sample(feats, sample_opts)
local sents = net_utils.decode_sequence(vocab, seq)
for k=1,#sents do
local entry = {image_id = data.infos[k].id, caption = sents[k]}
if opt.dump_path == 1 then
entry.file_name = data.infos[k].file_path
table.insert(predictions, entry)
if opt.dump_images == 1 then
-- dump the raw image to vis/ folder
local cmd = 'cp "' .. path.join(opt.image_root, data.infos[k].file_path) .. '" vis/imgs/img' .. #predictions .. '.jpg' -- bit gross
os.execute(cmd) -- dont think there is cleaner way in Lua
if verbose then
print(string.format('image %s: %s', entry.image_id, entry.caption))
-- if we wrapped around the split or used up val imgs budget then bail
local ix0 = data.bounds.it_pos_now
local ix1 = math.min(data.bounds.it_max, num_images)
if verbose then
print(string.format('evaluating performance... %d/%d (%f)', ix0-1, ix1, loss))
if data.bounds.wrapped then break end -- the split ran out of data, lets break out
if num_images >= 0 and n >= num_images then break end -- we've used enough images
local lang_stats
if opt.language_eval == 1 then
lang_stats = net_utils.language_eval(predictions,
return loss_sum/loss_evals, predictions, lang_stats
local loss, split_predictions, lang_stats = eval_split(opt.split, {num_images = opt.num_images})
print('loss: ', loss)
if lang_stats then
if opt.dump_json == 1 then
-- dump the json
utils.write_json('vis/vis.json', split_predictions)
csv_utils.write('vis/vis.csv', split_predictions, ";")
1681 :
caption : "a person holding a cell phone in their hand"
image_id : "1681"
1682 :
caption : "a person is taking a picture of a mirror"
image_id : "1682"
Every {} denotes a table. The number or text in front of the colon is a key and the stuff behind the colon is the value stored in the table under that key.
Let's create a table structure that would result an output like that one above:
local myTable = {}
myTable[1681] = {caption = "a person holding a cell phone in their hand",
image_id = "1681"}
myTable[1682] = {caption = "a person is taking a picture of a mirror",
image_id = "1682"}
Not sure what your problem is here. I think creating the desired csv file is rather trivial. All you need is a loop that creates a new line for each table entry and add the respective value's image_id (or key) and caption
one line could look like:
local nextLine = myTable[1681].image_id .. "," .. myTable[1681].caption .. "\n"
of course this is not very beautiful and you would use a loop to get all elements of that table but I think I should leave some work for you as well ;)
If anyone is wondering, I figured out the solution a long time ago.
function nt2_write(path, data, sep)
sep = sep or ','
local file = assert(, "w"))
file:write('Image ID' .. "," .. 'Caption')
for k, v in pairs(data) do
file:write(v["image_id"] .. "," .. v["caption"])
Of course, you may need to change the string values, but yeah. Happy programming.

How to combine FCNN and RNN in Tensorflow?

I want to make a Neural Network, which would have recurrency (for example, LSTM) at some layers and normal connections (FC) at others.
I cannot find a way to do it in Tensorflow.
It works, if I have only FC layers, but I don't see how to add just one recurrent layer properly.
I create a network in a following way :
with tf.variable_scope("autoencoder_variables", reuse=None) as scope:
for i in xrange(self.__num_hidden_layers + 1):
# Train weights
name_w = self._weights_str.format(i + 1)
w_shape = (self.__shape[i], self.__shape[i + 1])
a = tf.multiply(4.0, tf.sqrt(6.0 / (w_shape[0] + w_shape[1])))
w_init = tf.random_uniform(w_shape, -1 * a, a)
self[name_w] = tf.Variable(w_init,
# Train biases
name_b = self._biases_str.format(i + 1)
b_shape = (self.__shape[i + 1],)
b_init = tf.zeros(b_shape)
self[name_b] = tf.Variable(b_init, trainable=True, name=name_b)
if i+1 == self.__recurrent_layer:
# Create an LSTM cell
lstm_size = self.__shape[self.__recurrent_layer]
self['lstm'] = tf.contrib.rnn.BasicLSTMCell(lstm_size)
It should process the batches in a sequential order. I have a function for processing just one time-step, which will be called later, by a function, which process the whole sequence :
def single_run(self, input_pl, state, just_middle = False):
"""Get the output of the autoencoder for a single batch
input_pl: tf placeholder for ae input data of size [batch_size, DoF]
state: current state of LSTM memory units
just_middle : will indicate if we want to extract only the middle layer of the network
Tensor of output
last_output = input_pl
# Pass through the network
for i in xrange(self.num_hidden_layers+1):
w = self._w(i + 1)
b = self._b(i + 1)
last_output = self._activate(last_output, w, b)
last_output, state = self['lstm'](last_output,state)
return last_output
The following function should take sequence of batches as input and produce sequence of batches as an output:
def process_sequences(self, input_seq_pl, dropout, just_middle = False):
"""Get the output of the autoencoder
input_seq_pl: input data of size [batch_size, sequence_length, DoF]
dropout: dropout rate
just_middle : indicate if we want to extract only the middle layer of the network
Tensor of output
if(~just_middle): # if not middle layer
numb_layers = self.__num_hidden_layers+1
numb_layers = FLAGS.middle_layer
with tf.variable_scope("process_sequence", reuse=None) as scope:
# Initial state of the LSTM memory.
state = initial_state = self['lstm'].zero_state(FLAGS.batch_size, tf.float32)
tf.get_variable_scope().reuse_variables() # THIS IS IMPORTANT LINE
# First - Apply Dropout
the_whole_sequences = tf.nn.dropout(input_seq_pl, dropout)
# Take batches for every time step and run them through the network
# Stack all their outputs
with tf.control_dependencies([tf.convert_to_tensor(state, name='state') ]): # do not let paralelize the loop
stacked_outputs = tf.stack( [ self.single_run(the_whole_sequences[:,time_st,:], state, just_middle) for time_st in range(self.sequence_length) ])
# Transpose output from the shape [sequence_length, batch_size, DoF] into [batch_size, sequence_length, DoF]
output = tf.transpose(stacked_outputs , perm=[1, 0, 2])
return output
The issue is with a variable scopes and their property "reuse".
If I run this code as it is I am getting the following error:
' Variable Train/process_sequence/basic_lstm_cell/weights does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope? '
If I comment out the line, which tell it to reuse variables ( tf.get_variable_scope().reuse_variables() ) I am getting the following error:
'Variable Train/process_sequence/basic_lstm_cell/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?'
It seems, that we need "reuse=None" for the weights of the LSTM cell to be initialized and we need "reuse=True" in order to call the LSTM cell.
Please, help me to figure out the way to do it properly.
I think the problem is that you're creating variables with tf.Variable. Please, use tf.get_variable instead -- does this solve your issue?
It seems that I have solved this issue using the hack from the official Tensorflow RNN example ( with the following code
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
The hack is that when we run LSTM first time, tf.get_variable_scope().reuse is set to False, so that the new LSTM cell is created. When we run it next time, we set tf.get_variable_scope().reuse to True, so that we are using the LSTM, which was already created.

diverging results from weka training and java training

I'm trying to create an "automated trainning" using weka's java api but I guess I'm doing something wrong, whenever I test my ARFF file via weka's interface using MultiLayerPerceptron with 10 Cross Validation or 66% Percentage Split I get some satisfactory results (around 90%), but when I try to test the same file via weka's API every test returns basically a 0% match (every row returns false)
here's the output from weka's gui:
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 78 91.7647 %
Incorrectly Classified Instances 7 8.2353 %
Kappa statistic 0.8081
Mean absolute error 0.0817
Root mean squared error 0.24
Relative absolute error 17.742 %
Root relative squared error 51.0603 %
Total Number of Instances 85
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.885 0.068 0.852 0.885 0.868 0.958 1
0.932 0.115 0.948 0.932 0.94 0.958 0
Weighted Avg. 0.918 0.101 0.919 0.918 0.918 0.958
=== Confusion Matrix ===
a b <-- classified as
23 3 | a = 1
4 55 | b = 0
and here's the code I've using on java (actually it's on .NET using IKVM):
var classifier = new weka.classifiers.functions.MultilayerPerceptron();
classifier.setOptions(weka.core.Utils.splitOptions("-L 0.7 -M 0.3 -N 75 -V 0 -S 0 -E 20 -H a")); //these are the same options (the default options) when the test is run under weka gui
string trainingFile = Properties.Settings.Default.WekaTrainingFile; //the path to the same file I use to test on weka explorer
weka.core.Instances data = null;
data = new weka.core.Instances(new; //loads the file
data.setClassIndex(data.numAttributes() - 1); //set the last column as the class attribute
var tmp = System.IO.Path.GetTempFileName(); //creates a temp file to create an arff file with a single row with the instance I want to test taken from the arff file loaded previously
using (var f = System.IO.File.CreateText(tmp))
//long code to read data from db and regenerate the line, simulating data coming from the source I really want to test
var dataToTest = new weka.core.Instances(new;
dataToTest.setClassIndex(dataToTest.numAttributes() - 1);
double prediction = 0;
for (int i = 0; i < dataToTest.numInstances(); i++)
weka.core.Instance curr = dataToTest.instance(i);
weka.core.Instance inst = new weka.core.Instance(data.numAttributes());
for (int n = 0; n < data.numAttributes(); n++)
weka.core.Attribute att = dataToTest.attribute(data.attribute(n).name());
if (att != null)
if (att.isNominal())
if ((data.attribute(n).numValues() > 0) && (att.numValues() > 0))
String label = curr.stringValue(att);
int index = data.attribute(n).indexOfValue(label);
if (index != -1)
inst.setValue(n, index);
else if (att.isNumeric())
inst.setValue(n, curr.value(att));
throw new InvalidOperationException("Unhandled attribute type!");
prediction += cl.classifyInstance(inst);
//prediction is always 0 here, my ARFF file has two classes: 0 and 1, 92 zeroes and 159 ones
it's funny because if I change the classifier to let's say NaiveBayes the results match the test made via weka's gui
You are using a deprecated way of reading in ARFF files. See this documentation. Try this instead:
import weka.core.converters.ConverterUtils.DataSource;
DataSource source = new DataSource("/some/where/data.arff");
Instances data = source.getDataSet();
Note that that documentation also shows how to connect to a database directly, and bypass the creation of temporary ARFF files. You could, additionally, read from the database and manually create instances to populate the Instances object with.
Finally, if simply changing the classifier type at the top of the code to NaiveBayes solved the problem, then check the options in your weka gui for MultilayerPerceptron, to see if they are different from the defaults (different settings can cause the same classifier type to produce different results).
Update: it looks like you're using different test data in your code than in your weka GUI (from a database vs a fold of the original training file); it might also be the case that the particular data in your database actually does look like class 0 to the MLP classifier. To verify whether this is the case, you can use the weka interface to split your training arff into train/test sets, and then repeat the original experiment in your code. If the results are the same as the gui, there's a problem with your data. If the results are different, then we need to look more closely at the code. The function you would call is this (from the Doc):
public Instances trainCV(int numFolds, int numFold)
I had the same Problem.
Weka gave me different results in the Explorer compared to a cross-validation in Java.
Something that helped:
Instances dataSet = ...;
dataSet.stratify(numOfFolds); // use this
//before splitting the dataset into train and test set!