First of all, no need to dwell on SVM: this is just one of many classification methods. Yes, SVM has its own specifics (other methods — own), but at this stage You can use the common algorithms for data preprocessing.
\r
\r
what kind of signs to take?
This is called
feature selection and
feature extraction.
\r
In simple words, the process looks like this:
1. Make a list of the available characteristics.
2. Adding various functions from signs (as mentioned the logarithm of the weight), and combination of different characteristics (e.g. length*width*height), etc. What to combine and what transformations to use, shall give, knowledge of tasks and common sense. This process relates to feature extraction.
3. Define the error function, that is defined as will be assessed by the classification accuracy. For example, it may be the ratio of correctly recognized examples to their total. Here it is useful to read about
precision and recall.
4. Move one level of abstraction higher.
Imagine a kind of black box, inside of which is a classifier with training and test samples. At the entrance of the box a binary vector indicating which characteristics need to use the classifier output is the value of the classification error rate (on test set).
\r
Thus, the task of feature selection is reduced to the optimization problem: find such input vector, in which the output value of box (error classification) will be minimal. You can, for example, to add characteristics one by one (starting with those that most improve the result) — see
gradient descent. You can use something more serious, like
genetic algorithms.
\r
\r
Do I need to normalize the numerical values of these characteristics?
It strongly depends on the specific task and the characteristics.
\r
\r
What to do if the number of beans in reality (in the training set) refers to the amount of chaff as 1/200? If it spoils a training sample?
In General, spoils: if one of the examples is much smaller than the other, there is a risk that the classifier will "learn" examples from the training set, and will not be able adequately to find out other similar examples (
over-fitting).
Besides, if you are using a simple error function (pravilnosti / razmernye), philosophically tuned classifier can always meet "the chaff" and in 99.5% of cases will be right :)