Rev 6250 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
20 | reyssat | 1 | |
2 | !set p2=!item 2 of $special_parm |
||
3 | !if $p2!=$empty |
||
4 | !if $p2=list |
||
5 | !read help/symtext/list.phtml |
||
6 | !if $stylecnt > 0 |
||
7 | !exit |
||
8 | !endif |
||
9 | !endif |
||
10 | !set test=!defof style_exists in symtext/$module_language/$p2/def |
||
11 | !if $test=yes |
||
12 | !changeto symtext/$module_language/$p2/help.phtml |
||
13 | !endif |
||
14 | !endif |
||
15 | |||
6133 | bpr | 16 | <h2>Symtext documentation</h2> |
20 | reyssat | 17 | |
18 | !href cmd=help&special_parm=symtext,list List of symtext styles |
||
6250 | bpr | 19 | . |
20 | <p> |
||
20 | reyssat | 21 | |
22 | Symtext is a natural language parsing syntax. It is designed to make it |
||
23 | easier to identify different ways to say the same thing in natural language, |
||
24 | and its main purpose is for the recognition of freely typed or composed |
||
25 | short text answers to exercises. |
||
26 | <p> |
||
27 | |||
28 | Recognition of free text answers is difficult due to the following facts. |
||
29 | <ul> |
||
30 | |||
31 | <li>Different context requires different tolerance and precision. A language |
||
32 | exercise cannot tolerate spelling or grammar error, which may not be the |
||
33 | case for a mathematical exercise. |
||
6250 | bpr | 34 | </li> |
20 | reyssat | 35 | <li>Natural language often allows many different ways to say the same thing, |
36 | between "A or B" and "B or A", "Paul is older than Bill" and "Bill is |
||
37 | younger than Paul", "x and y are similar" and "x is similar to y", or even |
||
38 | "this costs too much" and "it is too expensive". |
||
6250 | bpr | 39 | </li> |
20 | reyssat | 40 | <li>Typing errors are common in freely typed text. In many cases, typing |
41 | errors should be tolerated. But before an unknown word, it is difficult for |
||
42 | the software to tell whether it is a typing error or a bad answer. |
||
6250 | bpr | 43 | </li> |
20 | reyssat | 44 | </ul> |
45 | |||
46 | In view of the above, the design of symtext has incorporated the following |
||
47 | features. |
||
48 | |||
49 | <ul> |
||
50 | <li>A nestable syntax allowing the identification of various language |
||
51 | alternatives (different ways to say the same thing). |
||
6250 | bpr | 52 | </li> |
20 | reyssat | 53 | <li>Macro dictionaries can be defined to help improve the human readability |
54 | of the matching rules. |
||
6250 | bpr | 55 | </li> |
20 | reyssat | 56 | <li>User-definable multiple dictionaries that can be used for various text |
57 | analysis purposes. |
||
6250 | bpr | 58 | </li> |
20 | reyssat | 59 | <li>Designated portions of the text can be output for further processing. |
6250 | bpr | 60 | </li> |
61 | <li> |
||
62 | It is based on user-definable styles, with different styles defining |
||
20 | reyssat | 63 | different dictionaries and macros. So they can be used to deal with |
64 | different context. |
||
6250 | bpr | 65 | </li> |
20 | reyssat | 66 | <li>Language scope can be delimited by declaring the list of allowed words. |
67 | Text containing words not in the list can be considered to be out-scope and |
||
68 | be sent back for rephrasing, instead of being rejected as bad answer. A |
||
69 | correct use of this feature can solve most of the problems related to typing |
||
70 | errors and unexpected answers. |
||
6250 | bpr | 71 | </li> |
20 | reyssat | 72 | </ul> |
73 | |||
6250 | bpr | 74 | <hr /><h3> |
75 | How it works</h3> |
||
76 | <p> |
||
20 | reyssat | 77 | Symtext deals with the problem of comparing two sentences. The first is the |
78 | <em>sample</em> which is typically the answer given to an exercise. It is |
||
79 | compared to the second sentence, the <em>tester</em>, which is typically the |
||
80 | good answer as declared by the author of the exercise. |
||
6250 | bpr | 81 | </p><p> |
20 | reyssat | 82 | |
83 | The sample must be plain text in natural language, while the tester may |
||
84 | contain <em>symtext rules</em> allowing it to <em>match</em> various samples |
||
85 | that are considered to have the same meaning. Such various ways to say the |
||
86 | same thing are alternatives in the natural language. The scope of the |
||
87 | acceptable alternatives depends on the context of the exercise, therefore |
||
88 | must be precisely defined by the author. Symtext is designed to allow |
||
89 | authors to make such definitions. |
||
6250 | bpr | 90 | </p><p> |
20 | reyssat | 91 | |
92 | Symtext rules are word based, that is, it only compares words. A word is a |
||
93 | chain of alphabetic characters or digits delimited by spaces or special |
||
94 | symbols (parentheses, quotes, punctuations etc.). Any special symbol is |
||
4427 | bpr | 95 | considered as a word by itself. And symtext does not count the number of |
20 | reyssat | 96 | space characters between two words: any chain of consecutive space |
97 | characters will be reduced to one space. |
||
6250 | bpr | 98 | </p><p> |
20 | reyssat | 99 | |
100 | A set of basic <em>builtin rules</em> are defined in the symtext syntax. For |
||
5903 | bpr | 101 | example, the rule <span class="tt">[Iperm:x,and,y]</span> matches both samples "x and y" |
20 | reyssat | 102 | and "y and x". Rules can be nested: |
103 | <pre>neither [Aperm:[Alt:I,me,we,us],nor,our teacher]</pre> |
||
2000 | bpr | 104 | matches the following 8 cases. |
6250 | bpr | 105 | </p><p> |
20 | reyssat | 106 | |
107 | "neither I nor our teacher", "neither our teacher nor I", "neither me nor |
||
108 | our teacher", "neither our teacher nor me", "neither we nor our teacher", |
||
109 | "neither our teacher nor we", "neither us nor our teacher", "neither our |
||
110 | teacher nor us". |
||
6250 | bpr | 111 | </p><p> |
20 | reyssat | 112 | |
113 | In general applications, a context <em>style</em> can be declared before |
||
114 | making the comparison. A style is a set of dictionaries and options. These |
||
115 | include pre-transformation dictionaries that can be used for example to |
||
116 | identify singular and plural words before comparison, a macro dictionary |
||
117 | that can simplify the writing of tester rules and make it more readable, and |
||
118 | user-definable dictionaries for various other purposes. |
||
119 | !href cmd=help&special_parm=symtext,list List of styles |
||
120 | . |
||
6250 | bpr | 121 | </p><p> |
20 | reyssat | 122 | |
5903 | bpr | 123 | For example, a <em>positional macro</em> <span class="tt">_divides</span> can be defined in |
124 | the macro dictionary, so that the tester <span class="tt">x _divides [y + z]</span> will |
||
20 | reyssat | 125 | match the following samples. |
6250 | bpr | 126 | </p><p> |
20 | reyssat | 127 | |
128 | "x divides y + z", "x is a factor of y + z", "y + z is divisible by x", "y + |
||
129 | z is a multiple of x". |
||
6250 | bpr | 130 | </p><p> |
20 | reyssat | 131 | |
132 | Note here that such a macro is positional, so that the string "y + z" must |
||
133 | be enclosed in a pair of brackets to make them look as one word for the |
||
134 | macro. Otherwise it will rather match things like "y is a multiple of x + |
||
135 | z", which clearly is wrong. |
||
6250 | bpr | 136 | </p><p> |
20 | reyssat | 137 | |
138 | This example shows that the final power of the syntax depends primarily on the |
||
139 | construction of the macro dictionary (which will vary from style to style). |
||
6250 | bpr | 140 | </p><p> |
20 | reyssat | 141 | |
142 | The tester is a text string containing ordinary words, matching rules and |
||
143 | positional macros. An ordinary word is simply compared with the word at the |
||
144 | corresponding position in the sample, while matching rules and macros can |
||
145 | match multiple possibilities in the sample. |
||
6250 | bpr | 146 | </p><p> |
20 | reyssat | 147 | |
148 | Before comparison takes place, words in both the sample and the tester may |
||
149 | first be transformed in order to identify small differences that one wants |
||
150 | to ignore, such as upper and lower cases, singular and plural nouns etc. |
||
6250 | bpr | 151 | </p><p> |
20 | reyssat | 152 | |
153 | Unlike regular expression, symtext match occurs only if the tester matches |
||
154 | the whole sample. Match does not occur if the tester only matches a part of |
||
155 | the sample. However, wildcard rules can be included in the tester if part of |
||
156 | the sample needs to be ignored. |
||
6250 | bpr | 157 | </p> |
20 | reyssat | 158 | |
6250 | bpr | 159 | <hr /><h3>Details of the syntax</h3> |
20 | reyssat | 160 | |
161 | <b>Definitions</b>. <ul> |
||
162 | <li>A <em>tstring</em> is a succession of <em>atoms</em>. |
||
6250 | bpr | 163 | </li><li>An <em>atom</em> is either a <em>word</em>, a <em>bracket block</em> or |
20 | reyssat | 164 | a positional macro name. |
6250 | bpr | 165 | </li><li>A <em>word</em> is either a list of consecutive alphanumerical |
20 | reyssat | 166 | characters or a single special character. In the first case, the word is |
167 | delimited by either spaces or non-alphanumerical characters. |
||
6250 | bpr | 168 | </li><li>A <em>bracket block</em> is a substring enclosed by a pair of brackets. |
20 | reyssat | 169 | It can be either a tstring, or a <em>matching rule</em>. |
6250 | bpr | 170 | </li><li>A <em>positional macro</em> is a word (macro name) preceded by the |
20 | reyssat | 171 | underline character. The macro name must be defined in the macro dictionary, |
172 | otherwise the whole atom will be treated as an ordinary word. |
||
6250 | bpr | 173 | </li></ul> |
174 | <p> |
||
20 | reyssat | 175 | A <em>matching rule</em> may be either builtin or defined in the style macro |
176 | dictionary. It must be enclosed by a pair of brackets, and the first |
||
177 | character must be alphabetic. If the first character is upper-case, it is |
||
178 | builtin. otherwise it is a macro. |
||
6250 | bpr | 179 | </p><p> |
20 | reyssat | 180 | |
5903 | bpr | 181 | Syntax of the matching rule: <span class="tt">[rule_name:parameters]</span>. |
20 | reyssat | 182 | <em>rule_name</em> must start with the first character of the block, it must |
183 | be a valid rule name, and the colon must immediately follow the name (no |
||
184 | spaces inserted). Otherwise the block will be treated as a normal tstring |
||
6250 | bpr | 185 | rather than a rule. |
186 | </p><p> |
||
20 | reyssat | 187 | |
188 | <em>Parameters</em> is a comma-separated list of strings. Each parameter can |
||
189 | be a tstring itself (hence can contain nested subrules), except in some |
||
190 | special cases of builtin rules where some of the parameter has a special |
||
191 | meaning, e.g. the first parameter of the rule <em>Pick</em> must be a |
||
192 | positive integer. |
||
6250 | bpr | 193 | </p><p> |
20 | reyssat | 194 | |
195 | There are also two special bracket blocks that are in fact simplifications |
||
196 | of builtin matching rules: |
||
5903 | bpr | 197 | <ul><li> |
198 | <span class="tt">[A|B|C]</span> is equivalent to <span class="tt">[Alt:A,B,C]</span>. For this |
||
199 | reason, the character <span class="tt">|</span> is reserved. To have it matched, write |
||
200 | <span class="tt">[|]</span> (or <span class="tt">[Alt:|]</span>). |
||
20 | reyssat | 201 | |
202 | |||
5903 | bpr | 203 | </li><li> |
204 | <span class="tt">[**]</span> is equivalent to <span class="tt">[Wild:**]</span>, <span class="tt">[* *]</span> is |
||
205 | equivalent to <span class="tt">[Wild:* *]</span>, etc. A block falls into this category if |
||
206 | the first character is a '<span class="tt">*</span>'. |
||
20 | reyssat | 207 | |
208 | </ul> |
||
209 | |||
6250 | bpr | 210 | <hr /><h3>Builtin rules</h3> |
211 | <p> |
||
20 | reyssat | 212 | A builtin rule is a matching rule where the first character of the name is |
6250 | bpr | 213 | upper-case. |
214 | </p><p> |
||
20 | reyssat | 215 | |
216 | Any parameter may include the comma character, as long as it is enclosed by |
||
6250 | bpr | 217 | a pair of parentheses or brackets. |
218 | </p> |
||
20 | reyssat | 219 | |
220 | |||
221 | |||
222 | !read tabletheme |
||
223 | !set wims_backslash_insmath=yes |
||
224 | $table_header |
||
225 | $table_hdtr |
||
226 | <th>name</th> |
||
6249 | bpr | 227 | <th><small>Number of<br />parameters</small></th> |
20 | reyssat | 228 | <th>Effect</th> |
229 | <th>Detail</th> |
||
6249 | bpr | 230 | </tr> |
20 | reyssat | 231 | $table_tr |
6249 | bpr | 232 | <td>Alt</td> |
233 | <td align=middle>\(>= 1)</td> |
||
234 | <td>Matches any one of the parameters.</td> |
||
235 | <td><span class="tt">[Alt:a,b,c d]</span> matches "a", "b" or "c d".</td> |
||
236 | </tr> |
||
20 | reyssat | 237 | $table_tr |
6249 | bpr | 238 | <td>Aperm</td> |
239 | <td align=middle>\(>= 3)</td> |
||
240 | <td>"And" styled permutation.</td> |
||
5903 | bpr | 241 | <td><span class="tt">[Aperm:[,],and,A,B,C]</span> matches "A, B and C", "B, A and C", etc. |
2000 | bpr | 242 | The order of parameters 3 and up is arbitrary, and the first two parameters |
20 | reyssat | 243 | are used to insert between them: parameter 1 is inserted except for the |
244 | last insertion where parameter 2 is inserted. |
||
6249 | bpr | 245 | </td></tr> |
20 | reyssat | 246 | $table_tr |
6249 | bpr | 247 | <td>Apick</td> |
248 | <td align=middle>\(>= 4)</td> |
||
249 | <td>"And" styled arbitrary selection.</td> |
||
5903 | bpr | 250 | <td><span class="tt">[Apick:3,[,],and,A,B,C,D,E]</span> matches "B, E and A", "C, A and E", |
20 | reyssat | 251 | etc. Parameter 1 must be an integer and gives the number of items to pick. |
6249 | bpr | 252 | </td></tr> |
20 | reyssat | 253 | |
254 | $table_tr |
||
6249 | bpr | 255 | <td>Dic</td> |
256 | <td align=middle>\(1)</td> |
||
257 | <td>Dictionary check</td> |
||
5903 | bpr | 258 | <td><span class="tt">[Dic:wordtype transitive verb]</span> matches any word or group of |
20 | reyssat | 259 | words that is defined in the dictionary "wordtype", with a definition that |
6249 | bpr | 260 | contains an item "transitive verb". |
20 | reyssat | 261 | |
262 | Note. No word transformation is performed on the parameter of this rule. |
||
6249 | bpr | 263 | </td></tr> |
20 | reyssat | 264 | $table_tr |
6249 | bpr | 265 | <td>Dperm</td> |
266 | <td align=middle>\(4)</td> |
||
267 | <td>Dependent permutation: parameters to match depend on the sample.</td> |
||
5903 | bpr | 268 | <td><span class="tt">[Dperm:a,b,c,d]</span> matches either "a b c" or "c d a", but nothing |
5768 | bpr | 269 | else. For example, <br/> |
5903 | bpr | 270 | <span class="tt">[Dperm:x,beats,y,is beaten by]</span> matches either "x beats y" or "y is |
20 | reyssat | 271 | beaten by x". Or in French, |
5768 | bpr | 272 | <br/> |
5903 | bpr | 273 | <span class="tt">il [Dperm:,y,est allé,à Paris]</span> matches either "il y est allé" or |
20 | reyssat | 274 | "il est allé à Paris". |
6249 | bpr | 275 | </td></tr> |
20 | reyssat | 276 | |
277 | $table_tr |
||
6249 | bpr | 278 | <td>Ins</td> |
279 | <td align=middle>\(>= 3)</td> |
||
280 | <td>Arbitrary insertion of parameter 1.</td> |
||
5903 | bpr | 281 | <td><span class="tt">[Ins:A,B,C,D,E]</span> matches "B A C D E", "B C A D E", "B C D A E". |
20 | reyssat | 282 | Parameter 2 and up must be matched in the given order, while parameter 1 may |
283 | find its place anywhere between them. <p> |
||
284 | |||
285 | To match "A B C D", "B A C D", "B C A D" and "B C D A", put two empty |
||
5903 | bpr | 286 | parameters: <span class="tt">[Ins:A,,B,C,D,]</span>. |
6249 | bpr | 287 | </p></td></tr> |
20 | reyssat | 288 | $table_tr |
6249 | bpr | 289 | <td>Iperm</td> |
290 | <td align=middle>\(3)</td> |
||
291 | <td>Inter-permutation.</td> |
||
5903 | bpr | 292 | <td><span class="tt">[Iperm:Bill,and,Alice]</span> matches "Bill and Alice" and "Alice and |
20 | reyssat | 293 | Bill". But not the three words in any other order. |
6249 | bpr | 294 | </td></tr> |
20 | reyssat | 295 | $table_tr |
6249 | bpr | 296 | <td>M</td> |
297 | <td align=middle>\(1)</td> |
||
298 | <td>Shared macro.</td> |
||
20 | reyssat | 299 | <td>The content (any tstring) of the macro can be shared with other calls (with |
300 | the same content). This is mainly designed for the macros file, with the aim of |
||
301 | reducing the size of compiled ruleset. Moreover, Shared macros can be self-nested |
||
302 | (while non-shared ones cannot). |
||
6249 | bpr | 303 | </td></tr> |
20 | reyssat | 304 | $table_tr |
6249 | bpr | 305 | <td>Neg</td> |
306 | <td align=middle>\(1)</td> |
||
307 | <td>Logical match negation.</td> |
||
20 | reyssat | 308 | <td>This rule returns match if the sample does not match its parameter, and |
309 | vice versa. <p> |
||
310 | In the first case, the rule matches the empty string in the sample. |
||
6249 | bpr | 311 | </p></td></tr> |
20 | reyssat | 312 | $table_tr |
6249 | bpr | 313 | <td>Nomatch</td> |
314 | <td align=middle>\(0)</td> |
||
315 | <td>This is a synonym of <span class="tt">None</span>.</td> |
||
20 | reyssat | 316 | <td> |
6249 | bpr | 317 | </td></tr> |
20 | reyssat | 318 | $table_tr |
6249 | bpr | 319 | <td>None</td> |
320 | <td align=middle>\(0)</td> |
||
321 | <td>Matches nothing.</td> |
||
322 | <td><span class="tt">[None:]</span> always returns no match.</td> |
||
323 | </tr> |
||
20 | reyssat | 324 | $table_tr |
6249 | bpr | 325 | <td>Not</td> |
326 | <td align=middle>\(1)</td> |
||
327 | <td>This is a synonym of <span class="tt">Neg</span>.</td> |
||
20 | reyssat | 328 | <td> |
6249 | bpr | 329 | </td></tr> |
20 | reyssat | 330 | $table_tr |
6249 | bpr | 331 | <td>Opick</td> |
332 | <td align=middle>\(>= 2)</td> |
||
333 | <td>Matches an ordered subset of given number of parameters.</td> |
||
5903 | bpr | 334 | <td>This rule is as <span class="tt">Pick</span>, except that it only matches subsets that |
20 | reyssat | 335 | are in the same order as that given in the parameters. |
6249 | bpr | 336 | </td></tr> |
20 | reyssat | 337 | $table_tr |
6249 | bpr | 338 | <td>Out</td> |
339 | <td align=middle>\(2)</td> |
||
340 | <td>Match plus output</td> |
||
20 | reyssat | 341 | <td>The first parameter is a variable name, and the second parameter can be |
342 | any combination of words, subrules and macros. If match occurs for the |
||
343 | second parameter, the matching text will be put as a value of the variable |
||
344 | and output. <p> |
||
345 | |||
5903 | bpr | 346 | Example. <span class="tt">[Out:myvar,[*]]</span> matches any single word, and if the |
20 | reyssat | 347 | matched word is "myword" (in the sample), the match output contains a string |
348 | "myvar=myword" that can be parsed to know what word the user has entered in |
||
349 | this location. |
||
6249 | bpr | 350 | </p> |
351 | </td></tr> |
||
20 | reyssat | 352 | $table_tr |
6249 | bpr | 353 | <td>Perm</td> |
354 | <td align=middle>\(>= 2)</td> |
||
355 | <td>Matches all the parameters in arbitrary order.</td> |
||
5903 | bpr | 356 | <td><span class="tt">[Perm:x,y,z]</span> matches "x y z", "y x z", "z x y" etc. |
6249 | bpr | 357 | </td></tr> |
20 | reyssat | 358 | $table_tr |
6249 | bpr | 359 | <td>Pick</td> |
360 | <td align=middle>\(>= 2)</td> |
||
361 | <td>Matches a subset of given number of parameters in any order.</td> |
||
20 | reyssat | 362 | <td>The first parameter must be a positive integer n. The rule matches any |
5768 | bpr | 363 | subset of n parameters within the rest, in any order. <br/> |
5903 | bpr | 364 | Example: <span class="tt">[Pick:2,a,b,c,d]</span> matches "a b", "d b", "c a" etc. <br/> |
365 | <span class="tt">[Pick:3,x,y,z]</span> is equivalent to <span class="tt">[Perm:x,y,z]</span>. <br/> |
||
366 | <span class="tt">[Pick:1,a,b,c,d]</span> is equivalent to <span class="tt">[Alt:a,b,c,d]</span>. |
||
20 | reyssat | 367 | |
368 | <p> |
||
5903 | bpr | 369 | Extensions: <span class="tt">[Pick:+2,...]</span> matches any subset of at least 2 |
5768 | bpr | 370 | parameters. <br/> |
5903 | bpr | 371 | <span class="tt">[Pick:-3,...]</span> matches any subset of at most 3 parameters (including |
20 | reyssat | 372 | the empty subset). |
6249 | bpr | 373 | </p> |
20 | reyssat | 374 | <p> |
375 | Known bug: repetition of the same parameter is not recognized. |
||
5903 | bpr | 376 | <span class="tt">[Pick:2,a,b,c,d]</span> does not match "a c c". |
6249 | bpr | 377 | </p> |
378 | </td></tr> |
||
20 | reyssat | 379 | $table_tr |
6249 | bpr | 380 | <td>Rep</td> |
381 | <td align=middle>\(>= 1)</td> |
||
20 | reyssat | 382 | <td>Matches an arbitrary number (at least one) of parameters in any order and |
6249 | bpr | 383 | with any repetition.</td> |
5903 | bpr | 384 | <td><span class="tt">[Rep:0,1]</span> matches "0 1", "1", "0 1 0 0 1 1 0", etc. |
6249 | bpr | 385 | </td></tr> |
20 | reyssat | 386 | $table_tr |
6249 | bpr | 387 | <td>W</td> |
388 | <td align=middle>\(0 or 1)</td> |
||
389 | <td>Matches words in a list.</td> |
||
20 | reyssat | 390 | <td>This rule matches the next word if it appears somewhere in the tester or |
391 | if it is a word given in the parameter. <p> |
||
392 | |||
393 | If this rule is put in the last tester line, words in all the tester lines |
||
394 | will count. |
||
395 | |||
396 | $table_tr |
||
6249 | bpr | 397 | <td>Wild</td> |
398 | <td align=middle>\(1)</td> |
||
399 | <td>Wildcard word match.</td> |
||
20 | reyssat | 400 | <td>The unique parameter must be composed of words "*", "**", and/or "**n" |
401 | where n is a positive number. The first matches any single word, the second |
||
402 | matches 0 or any number of words, and the third matches from 0 to n |
||
5768 | bpr | 403 | arbitrary words. For example, <br/> |
5903 | bpr | 404 | <span class="tt">[Wild:* * **3]</span> matches between 2 to 5 words (inclusive). |
6249 | bpr | 405 | </td></tr> |
20 | reyssat | 406 | $table_end |
407 | |||
6249 | bpr | 408 | <hr /><h3>Construction of styles</h3> |
409 | <p> |
||
20 | reyssat | 410 | A style corresponds to a directory and its contents. Under WIMS, the style |
411 | can either be shared among all modules in the public_html/scripts/symtext |
||
412 | directory, or be special to one module, in the module's directory. |
||
6249 | bpr | 413 | </p><p> |
20 | reyssat | 414 | |
5903 | bpr | 415 | The style must contain an index file, named <span class="tt">def</span>. It defines the |
20 | reyssat | 416 | basic configuration choices of the style. Every line of the file is a |
6249 | bpr | 417 | definition under the format <span class="tt">name=value</span>. |
418 | </p><p> |
||
20 | reyssat | 419 | |
5903 | bpr | 420 | The <span class="tt">def</span> file must contain a definition <span class="tt">style_exists=yes</span>, |
20 | reyssat | 421 | otherwise the existence of the style will not be recognized. All the rest is |
6249 | bpr | 422 | optional. |
423 | </p><p> |
||
20 | reyssat | 424 | |
425 | It may contain a definition of <em>option</em>, that lists option words that |
||
6249 | bpr | 426 | will always be activated for the style. |
427 | </p><p> |
||
20 | reyssat | 428 | |
429 | It can also define general dictionaries using the name |
||
430 | <em>dictionaries</em>. The value must be a list of words, each corresponding |
||
431 | to a dictionary file in the style. The number of general dictionaries is |
||
432 | limited. |
||
433 | |||
6249 | bpr | 434 | </p><p> |
435 | |||
20 | reyssat | 436 | For each general dictionary, a variable NAME_unknown can be defined (where |
437 | NAME should be replaced by the dictionary name), which tells how a word |
||
438 | should be treated if it is not found in the dictionary (unknown). The value |
||
5903 | bpr | 439 | may be <span class="tt">delete</span> (default) which means the unknown word should be |
440 | replaced by an empty string; <span class="tt">leave</span> which will return the unknown |
||
20 | reyssat | 441 | word unchanged; or anything else. In the last case, the value will be used |
442 | to replace the unknown word. |
||
443 | |||
6249 | bpr | 444 | </p><p> |
445 | |||
20 | reyssat | 446 | There may also be three dictionary files with reserved names: |
5903 | bpr | 447 | <span class="tt">suffix</span>, <span class="tt">trans</span> and <span class="tt">macros</span>. All dictionaries are |
448 | line dictionaries, with each line in the format <span class="tt">name:definition</span>. Names |
||
449 | must be sorted (using the special program <span class="tt">dicsort</span> in the WIMS |
||
20 | reyssat | 450 | package). All of the dictionaries are optional. |
451 | |||
6249 | bpr | 452 | </p><p> |
453 | |||
20 | reyssat | 454 | Both name and definition may contain space characters. However, except macro |
455 | definitions there is no transformation after the dictionary is read, so only |
||
456 | single space characters should be used. The name field should start and end |
||
457 | with non-space characters. Multiple definitions with a same name will give |
||
458 | unpredictable result. |
||
459 | |||
6249 | bpr | 460 | </p><p> |
461 | |||
20 | reyssat | 462 | The <em>suffix</em> dictionary is a very special one, that is used to |
463 | transform word suffixes before any other transformation. It is easy to |
||
464 | understand except that in the name field, the suffixes are defined in |
||
465 | reverse order. |
||
466 | |||
6249 | bpr | 467 | </p><p> |
468 | |||
20 | reyssat | 469 | The <em>trans</em> dictionary is used for word replacements after suffix |
470 | transformation. Both dictionaries will be consulted before any string |
||
471 | comparison takes place. For example, if we want to identify nouns under |
||
472 | singular and plural forms, we can first use the <em>suffix</em> dictionary |
||
473 | to transform plural nouns into singular suffix if they obey a general suffix |
||
474 | rule; then for nouns with special plural forms, the <em>trans</em> |
||
475 | dictionary can be used to transform them. |
||
476 | |||
6249 | bpr | 477 | </p><p> |
478 | |||
20 | reyssat | 479 | Both the <em>suffix</em> and <em>trans</em> dictionaries must be constructed |
480 | to be <em>order 1 stable</em>, that is, if an already transformed string is |
||
481 | resubmitted to the dictionary, no further transformation will take place. |
||
482 | |||
6249 | bpr | 483 | </p><p> |
484 | |||
20 | reyssat | 485 | The <em>macros</em> dictionary contains both definitions for positional |
486 | macros and macro rules. The former have names starting with the underline |
||
487 | character, while the latter starts with lower case letters. |
||
488 | |||
6249 | bpr | 489 | </p><p> |
490 | |||
20 | reyssat | 491 | The definition of a macro is a tstring that will be used to replace the |
492 | macro. Macros can be nested, that is, the definition of a macro may contain |
||
493 | calls to other macros, in any order. However, infinite nesting loops will |
||
494 | result in an error. |
||
495 | |||
6249 | bpr | 496 | </p><p> |
497 | |||
20 | reyssat | 498 | In order to preserve consistency for positional macros, the definition of |
499 | any macro must be composed of exactly one atom. |
||
500 | |||
6249 | bpr | 501 | </p><p> |
502 | |||
20 | reyssat | 503 | Macro definitions may contain parameters. For this purpose, the character |
5903 | bpr | 504 | <span class="tt">@</span> has a special meaning in a macro definition. When invoked, it |
20 | reyssat | 505 | must be followe by an integer. And the character together with the following |
506 | integer will be replaced by a macro parameter during macro expansion. |
||
507 | |||
6249 | bpr | 508 | </p><p> |
509 | |||
5903 | bpr | 510 | For a rule macro, <span class="tt">@1</span> means the first parameter, <span class="tt">@2</span> means |
20 | reyssat | 511 | the second parameter, etc. It is an error if the macro is invoked in a |
512 | tstring without giving enough parameters. |
||
513 | |||
6249 | bpr | 514 | </p><p> |
515 | |||
5903 | bpr | 516 | For a positional macro, <span class="tt">@1</span> designates the first atom following the |
517 | macro, while <span class="tt">@-1</span> designates the first atom preceding the macro, etc. |
||
20 | reyssat | 518 | It is also an error if a positional macro is inserted in a tstring without |
519 | enough atoms before or after it as required by its definition. |
||
520 | |||
6249 | bpr | 521 | </p><p> |
522 | |||
20 | reyssat | 523 | Care must be taken to the point that a macro parameter may result in several |
524 | atoms after expansion. This is not a problem unless the macro definition |
||
525 | contains a positional macro. In case where it is necessary to ensure the |
||
526 | position of parameters, one can enclose the parameter by a pair of brackets, |
||
5903 | bpr | 527 | such as <span class="tt">[@1]</span>. |
20 | reyssat | 528 | |
6249 | bpr | 529 | </p><p> |
530 | |||
20 | reyssat | 531 | A general dictionary has the same syntax as the reserved ones. In this case, |
532 | the definition field can be a comma-separated list of items. These |
||
5903 | bpr | 533 | dictionaries are used via the <span class="tt">Dic</span> builtin rule, which gives a match |
20 | reyssat | 534 | if one of the items in the definition is equal to the value given in the |
535 | parameter of the rule. |
||
536 | |||
6249 | bpr | 537 | </p> |
20 | reyssat | 538 | |
6249 | bpr | 539 | <hr /><h3>The command line program</h3> |
20 | reyssat | 540 | |
1996 | bpr | 541 | The command line program <em>symtext</em> is specially built for WIMS, so that |
20 | reyssat | 542 | all the input data are sent through environment variables. It can also be |
543 | used as a standalone program, but in this case it is better that a wrapper |
||
6249 | bpr | 544 | script be used to put the input-output into a more *nix flavor. |
20 | reyssat | 545 | |
546 | $table_header |
||
547 | <caption>List of environment parameters</caption> |
||
548 | $table_hdtr<th>Name</th> |
||
549 | <th>Value</th> |
||
550 | <th>Comments</th> |
||
551 | </tr> |
||
552 | $table_tr |
||
553 | <td>wims_exec_parm</td> |
||
554 | <td>The main data input</td> |
||
6249 | bpr | 555 | <td>A multi-line string. |
556 | <ul> |
||
557 | <li>Line 1: command followed by options. Valid commands: |
||
558 | <ul> |
||
5903 | bpr | 559 | <li><span class="tt">match</span> check matching. |
6249 | bpr | 560 | </li> |
5903 | bpr | 561 | <li><span class="tt">debug</span> check matching with debug information. |
6249 | bpr | 562 | </li> |
20 | reyssat | 563 | </ul> |
6249 | bpr | 564 | </li> |
565 | <li>Line 2: The sample.</li> |
||
566 | <li>Line 3 and up: Each line is a tester.</li> |
||
20 | reyssat | 567 | </ul> |
6249 | bpr | 568 | <p> |
20 | reyssat | 569 | Lines can also be delimited by the semi-colon. For this reason, semi-colons |
570 | must be protected by parentheses in both the sample and the tester. |
||
6249 | bpr | 571 | |
572 | </p><p> |
||
20 | reyssat | 573 | Options have the same syntax as in the style option definition. With one |
5903 | bpr | 574 | more possible definition here: <span class="tt">style=[the_name_of_style]</span>. |
20 | reyssat | 575 | </td></tr> |
576 | |||
577 | $table_tr |
||
578 | <td>module_dir</td> |
||
579 | <td>Directory to current module</td> |
||
580 | <td>Automatically defined if called by WIMS. If this variable is undefined, |
||
581 | then w_symtext must give the complete path of the style. |
||
582 | </td></tr> |
||
583 | |||
584 | $table_tr |
||
585 | <td>w_module_language</td> |
||
586 | <td>Language</td> |
||
587 | <td>Only used when called by WIMS. Can be overrun by the "language=" option. |
||
588 | </td></tr> |
||
589 | |||
590 | $table_end |
||
591 | <p> |
||
592 | |||
593 | Options have two origins: either from the environment variable |
||
594 | <em>w_symtext_option</em> or from the <em>def</em> file of the style. The two |
||
595 | have the same syntax. |
||
596 | |||
6249 | bpr | 597 | </p> |
598 | |||
20 | reyssat | 599 | !set option_data=!trim \ |
600 | alnumly,word,Transform everything non-alphabetic and non-digit into space.\ |
||
601 | alphaonly,word,Transform everything non-alphabetic into space.\ |
||
602 | deaccent,word,Remove accents from letters before comparison.\ |
||
603 | debug,word,Output debug information to stderr.\ |
||
604 | language,value,A two-letter language code.\ |
||
605 | matchall,word,Match every line of the tester, instead of stopping after the first match.\ |
||
606 | nocase,word,Fold both texts to lower case before comparison.\ |
||
5903 | bpr | 607 | nocs,word,Replace computer-oriented characters by spaces (<span class="tt">_&$#\\@~</span>)\ |
20 | reyssat | 608 | nomath,word,Replace mathematical operators by spaces (<tt>+-*/=|%<>()_</tt>)\ |
5903 | bpr | 609 | noparentheses,word,Replace parentheses by spaces (<span class="tt">()[]{}</span>)\ |
610 | nopunct,word,Replace puncuation characters by spaces (<span class="tt">.,;:?!"</span>) except the dot as a decimal point.\ |
||
611 | noquote,word,Replace quoting characters by spaces (<span class="tt">`'"</span>)\ |
||
20 | reyssat | 612 | reaccent,word,Allow composition of accented letters using special characters.\ |
613 | style,value,The style,, only valid in the environment parameter.\ |
||
614 | |||
615 | |||
616 | $table_header |
||
617 | <caption>List of options</caption> |
||
618 | $table_hdtr |
||
619 | <th>Name</th> |
||
620 | <th>Nature</th> |
||
621 | <th>Meaning</th> |
||
622 | </tr> |
||
623 | |||
624 | !set n=!linecnt $option_data |
||
625 | !for i=1 to $n |
||
626 | !set l=!line $i of $option_data |
||
627 | !distribute item $l into name,nature,meaning |
||
628 | $table_tr |
||
629 | <td>$name</td> |
||
6249 | bpr | 630 | <td>$nature</td> |
20 | reyssat | 631 | <td>$meaning |
632 | !next i |
||
633 | |||
634 | $table_end |
||
635 | |||
6249 | bpr | 636 | </p><p> |
637 | |||
20 | reyssat | 638 | <b>Program output</b>. The output is empty if no match is found. |
639 | |||
6249 | bpr | 640 | |
20 | reyssat | 641 | !set error_data=!nonempty lines \ |
642 | bad_command,Invalid command in the input.\ |
||
643 | bad_dictionary,Non-existing dictionary specified.\ |
||
644 | bad_macro,Bad macro name.\ |
||
645 | bad_macro_position,Positional macro placed in the tester where pre- or post-parameters cannot be found.\ |
||
646 | bad_pickcnt,Invalid first parameter for Pick.\ |
||
647 | block_overflow,Too many rules and parameters defined in the tester (before or after macro expansion).\ |
||
648 | duplication_in_dictionary,A name is defined twice in the indicated dictionary (in the style).\ |
||
649 | file_too_long,File size exceeded limit.\ |
||
650 | level_overflow,Too much nesting; probably an internal bug.\ |
||
651 | list_overflow,A rule contains too many parameters.\ |
||
652 | macro_level_overflow,Too many recursive macro definitions. Usually it is an infinite loop in the macro dictionary.\ |
||
653 | name_too_long,Macro or variable name exceeded length limit.\ |
||
654 | string_too_long,String length limit exceeded.\ |
||
655 | style_not_found,Inexisting style specified.\ |
||
656 | syntax_error,Syntax error in a macro or rule.\ |
||
657 | tag_overflow,Tester expansion is too complicated.\ |
||
658 | too_many_dictionaries,The number of dictionaries declared in the style has exceeded limit.\ |
||
659 | unknown_cmd,Unknown matching rule name.\ |
||
660 | unmatched_parentheses,Unmatched parentheses or brackets.\ |
||
661 | unsorted_dictionary,The indicated dictionary (in the style) is in bad order.\ |
||
662 | wrong_parmcnt,A matching rule has a number of parameters that does not meet its definition.\ |
||
663 | |||
664 | |||
665 | $table_header |
||
666 | <caption>Error messages</caption> |
||
667 | $table_hdtr |
||
668 | <th>Message</th> |
||
669 | <th>Meaning</th> |
||
670 | </tr> |
||
671 | |||
672 | !set n=!linecnt $error_data |
||
673 | !for i=1 to $n |
||
674 | !set l=!line $i of $error_data |
||
675 | !distribute items $l into msg,mean |
||
676 | $table_tr |
||
7190 | bpr | 677 | <td class="tt">$msg</td> |
20 | reyssat | 678 | <td>$mean</td> |
679 | </tr> |
||
680 | !next i |
||
681 | $table_end |
||
682 |