Rev 6800 | Rev 6803 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
6793 | bpr | 1 | WIMS' search engine and als |
6797 | czzmrn | 2 | =========================== |
6405 | czzmrn | 3 | |
6797 | czzmrn | 4 | WIMS' search engine works in two stages: |
6405 | czzmrn | 5 | |
6802 | reyssat | 6 | 1) update of index files when server data is changed (module added...), |
7 | typically once a day. |
||
8 | 2) use of index files at each user's request to find some activities |
||
9 | |||
10 | |||
11 | Here are some details : |
||
12 | |||
13 | 1) update of index files |
||
14 | =========================== |
||
15 | A series of scripts creates a set of auxiliary files (generally |
||
6797 | czzmrn | 16 | stored in ~/public_html/bases/sys/, see description further down) and |
17 | a list of "keywords" (stored in ~/public_html/bases/site/). |
||
6405 | czzmrn | 18 | |
6797 | czzmrn | 19 | (the scripts must be run in the order given here, as some files |
20 | created on earlier stages are used in subsequent stages). In general |
||
21 | the whole process is run by the script ~/bin/mkindex. |
||
22 | |||
6802 | reyssat | 23 | * Firstly a series of 3 perl scripts (mkdomain,mkwgrp,modindclass), |
24 | that ~/bin/mkindex.sh calls via ~/public_html/bases/sys/mkindex.sh : |
||
25 | |||
26 | - the programm ~/public_html/bases/sys/mkdomain.pl creates the lists |
||
27 | of domains from the graph in domain/domain with its translations |
||
28 | (domain/domain.$lang) and in json format (english) to be used for |
||
29 | completion in modtool properties |
||
30 | |||
6797 | czzmrn | 31 | - the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX |
32 | files of all the modules on the site and generates |
||
33 | |||
34 | - keywords (in format .json) to be used for completion in the search |
||
35 | engine) |
||
36 | - the files in wgrp |
||
37 | |||
38 | (using the keywords and keywords_lang in the INDEX files, according |
||
39 | to this rule: taking keywords_$lang if it exists, or keywords |
||
40 | (whatever it is a $lang-module or not). |
||
41 | |||
42 | Some files are created in keywords as keywords/algebra.fr.tmp, but |
||
43 | not used for the moment. The keywords in these "keywords file" are |
||
44 | exactly those in the variable keywords (or keywords_$lang if it |
||
45 | exists), doing it with the following rules: taking keywords_$lang if |
||
46 | it exists, or keywords (whatever it is a $lang-module or not). |
||
47 | |||
6793 | bpr | 48 | - the program ~/public_html/bases/sys/modindclass.pl creates the lists |
6797 | czzmrn | 49 | of keywords coming from the example classes in |
6800 | reyssat | 50 | ~/public_html/bases/class as well as the files author, |
6797 | czzmrn | 51 | description, language, level, title (no ranking is done). |
6793 | bpr | 52 | |
6802 | reyssat | 53 | * Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads |
6797 | czzmrn | 54 | |
6793 | bpr | 55 | -- the INDEX files of all the modules on the site |
6797 | czzmrn | 56 | -- the auxiliary files in ~/public_html/bases/sys/ (see description |
57 | below) |
||
6405 | czzmrn | 58 | |
6797 | czzmrn | 59 | and produces keywords lists stored in ~wims/public_html/bases/site : |
60 | they contains the words (or words groups) coming from the variable |
||
61 | keywords of the INDEX but also words of the title, description |
||
62 | (deleting small words). |
||
6795 | bpr | 63 | |
6797 | czzmrn | 64 | "modind" creates as well a serial list of all the modules available |
65 | on the site, see ~/public_html/bases/site/serial, and calculates the |
||
66 | ranking of the site's modules. The modules are classified according |
||
67 | to their types: A=all (except sheet and classes), D=document, O=OEF, |
||
68 | X=exercise, T= tool, R=recreation, M= data module. |
||
6405 | czzmrn | 69 | |
6797 | czzmrn | 70 | To do that, "modind" uses some dictionnaries as |
71 | suffix.$search_lang. --> MC I would simply say uses most of the files in |
||
72 | ~/public_html/bases/sys/, e.g. wgrp |
||
73 | |||
74 | -- separately "modind" reads also the files in |
||
75 | ~/public_html/bases/sys/sheet and do the same type of works |
||
76 | |||
6802 | reyssat | 77 | |
78 | 2) use of index files |
||
79 | =========================== |
||
80 | The script ~/public_html/modules/home/search.proc (called by the |
||
6797 | czzmrn | 81 | "Search" form) reads the lists above, do the actual search in such |
82 | lists and displays the modules found. It reads also the files of |
||
83 | ~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets |
||
84 | |||
6802 | reyssat | 85 | |
86 | |||
87 | More technical details about both stages |
||
88 | ======================================== |
||
89 | |||
6800 | reyssat | 90 | In both stages files in this directory ~/public_html/bases/sys/ (see comments |
6797 | czzmrn | 91 | below)(suffix.$lang for example, but see upper remark) are used to |
92 | process the keywords present in the modules' INDEX files. Each |
||
93 | "search language" has its own series of files. |
||
94 | |||
95 | ?? For any module, in any language, the keywords |
||
96 | |||
6800 | reyssat | 97 | The contents of these INDEX files should be checked by developers and |
6797 | czzmrn | 98 | translators, to improve the behaviour of the search engine. |
99 | |||
6800 | reyssat | 100 | The files in this directory ~/public_html/bases/sys/ |
101 | are automatically generated (on install) |
||
6792 | czzmrn | 102 | by the corresponding ".src" file in the "src" subdirectory. |
6405 | czzmrn | 103 | |
6797 | czzmrn | 104 | If any of the files described below is omitted, then the corresponding |
105 | feature in the corresponding language is disabled. E.g. the files |
||
106 | words.fr/words.fr.src and suffix.fr/suffix.fr.src will be/have been |
||
107 | deleted in order to make the search engine correctly working. |
||
6793 | bpr | 108 | |
6798 | czzmrn | 109 | (Remark : I delete the files words.fr.src and suffix.fr.src by |
6800 | reyssat | 110 | renaming for the moment xx_orig, so they are not used, but on a |
6798 | czzmrn | 111 | public servor, feature in the corresponding language is |
112 | disabled. E.g. the files the files suffix.fr.src must be deleted by |
||
113 | hand. |
||
114 | |||
6797 | czzmrn | 115 | Rmk : (bpr) I deliberately delete the suffix.fr as it is |
116 | incompatible with a list of words shown by completion (for example, |
||
117 | loi normale was translated in loi norm??, I do not remember, it is |
||
118 | impossible to write such things to completion, and loi normale was |
||
119 | not found). suffix.en should be also deleted. |
||
120 | |||
121 | |||
122 | , will be done by the script in the stable release if we are OK) |
||
123 | |||
6792 | czzmrn | 124 | Syntax: the lines for most of these files are in the form |
6552 | bpr | 125 | |
6792 | czzmrn | 126 | == |
127 | givenword:substitute |
||
128 | == |
||
129 | |||
130 | ============================================================= |
||
131 | |||
132 | Files |
||
133 | ===== |
||
134 | |||
135 | words.$search_lang : correct misprints in the search words |
||
136 | (used both by "mkindex" and "search.proc"). |
||
137 | |||
138 | E.g. if the file words.en contains the line |
||
139 | |||
140 | == |
||
141 | analytical:analytic |
||
142 | == |
||
143 | |||
144 | then the word "analytical" is considered a misprint and any occurrence |
||
145 | of the string "analytical" is replaced in the search by the string |
||
146 | "analytic" (for the language "en") |
||
147 | |||
6797 | czzmrn | 148 | Note: words.fr was deleted because it caused the search engine not to |
149 | work properly. The site manager can reactivate the functionality by |
||
150 | adding the file again (?? how to get the "original" files from the |
||
151 | svn?). |
||
152 | |||
6792 | czzmrn | 153 | Note: the file words.en is used by the module tool/wcalc.en (see |
154 | ~/public_html/modules/tool/wcalc.en/dic ) |
||
155 | |||
156 | ===================== |
||
157 | |||
158 | suffix.$search_lang : process common suffixes in the search words |
||
159 | (used both by "mkindex" and "search.proc"). |
||
160 | |||
161 | E.g. if the file suffix.en contains the line |
||
162 | |||
163 | == |
||
164 | ertem:meter |
||
165 | == |
||
166 | |||
167 | then any word ending in "metre" ("ertem" the other way round) is |
||
168 | substituted by the corresponding one ending in "meter" (kilometre --> |
||
169 | kilometer) |
||
170 | |||
6797 | czzmrn | 171 | Note: suffix.fr was deleted because it caused the search engine/the |
172 | keyword completion not to work properly. The site manager can |
||
173 | reactivate the functionality by adding the file again. |
||
174 | |||
6792 | czzmrn | 175 | ===================== |
176 | |||
177 | wgrp/wgrp.$search_lang : groups of word |
||
6797 | czzmrn | 178 | (these files are automatically generated, and used by "mkindex") |
6792 | czzmrn | 179 | |
180 | E.g. if the file wgrp/wgrp.en contains the line |
||
181 | |||
182 | == |
||
183 | affine geometry:affine geometry, |
||
184 | == |
||
185 | |||
186 | then the search matches for the group of words "affine geometry" as a |
||
187 | whole: if the the user searches for "affine geometry" the search |
||
188 | engine returns only the modules containing as keyword the exact string |
||
189 | "affine geometry" (if such line were not present the search engine |
||
190 | would return both the modules containing the word "affine" and the |
||
191 | modules containing the word "geometry"). |
||
192 | |||
193 | The "wgrp" files are now generated from the modules' keywords by the |
||
194 | script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains |
||
195 | multiple words keywords, such keywords are added to the wgrp files. |
||
196 | |||
197 | E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword |
||
198 | |||
199 | keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice |
||
200 | |||
201 | so for each of the groups of words between two commas the |
||
202 | corresponding groups of words are created |
||
203 | |||
204 | finite group |
||
205 | conjugacy class |
||
206 | normal subgroup |
||
207 | subgroup lattice |
||
208 | |||
209 | (in the corresponding language file) |
||
210 | |||
211 | NOTE: problems when the strings contains the apostrophe "'" |
||
212 | (e.g. "algorithme d'euclide") |
||
213 | |||
214 | ===================== |
||
215 | |||
216 | indignore.$search_lang : ignored words |
||
217 | (used by "mkindex") |
||
218 | |||
219 | All the words listed in the file are ignored by the search engine. |
||
220 | |||
221 | ===================== |
||
222 | |||
223 | abuse.$search_lang : swearwords to be ignored by the search engine |
||
224 | (used by ??) |
||
225 | |||
226 | ===================== |
||
227 | |||
228 | andor.$search_lang : conjunctions ("and", "or") to be ignored by the |
||
229 | search engine |
||
230 | |||
6797 | czzmrn | 231 | The file andor.xx is mentioned in src/insmath.c (processing logic |
232 | statements in math formulas) but this is for the moment used by no |
||
233 | modules (to be used, one must have insmath_logic=yes which do not |
||
234 | exist in any public module as I know). |
||
6794 | bpr | 235 | |
6797 | czzmrn | 236 | |
6792 | czzmrn | 237 | ===================== |
238 | |||
239 | keywords.fr : ?? |
||
6794 | bpr | 240 | (used by ??) should be deleted |
6792 | czzmrn | 241 | |
242 | ======================================================= |
||
243 | |||
244 | |||
245 | Some indexing examples |
||
246 | ====================== |
||
247 | |||
6797 | czzmrn | 248 | U1/algebra/vecshoot.en |
6793 | bpr | 249 | |
6797 | czzmrn | 250 | As this is an exercise module it is indexed in the lists A.$lang (All) |
251 | and X.$lang (eXercise). |
||
6793 | bpr | 252 | |
6797 | czzmrn | 253 | This is a multilanguage module (main language "en", translation |
254 | language "it"). |
||
255 | |||
256 | The index file contains the following (nonempty) lines |
||
257 | |||
258 | title=Vector shoot |
||
259 | description=click on a linear combination of 2D vectors. |
||
260 | language=en |
||
261 | category=exercise |
||
262 | domain=algebra, linear algebra |
||
263 | level=H4,H5,H6,U1,U2 |
||
264 | keywords=vector, linear combination |
||
265 | scoring=yes |
||
266 | copyright=© 1998- (<a href="COPYING">GNU GPL</a>) 2013 |
||
267 | author=XIAO,Gang |
||
268 | address=xiao@unice.fr |
||
269 | version=2.20 |
||
270 | wims_version=4.05a |
||
271 | translation_language=it |
||
272 | title_it=Colpisci i vettori |
||
273 | description_it=individuare una combinazione lineare di vettori 2D. |
||
274 | keywords_it=vettore, combinazione lineare,bersaglio |
||
275 | translator_it=Anna, Lucci |
||
276 | translator_address_it=anna.lucci@gmail.it |
||
277 | |||
278 | In stage 1 the module is given a serial number (depending on the |
||
279 | modules actually available on each site, on my site the serial number |
||
280 | is "1003"). As the distribution also includes the modules |
||
281 | U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that |
||
282 | correspond to translation of this module into "cn" and "fr" |
||
283 | respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this |
||
284 | module (1003) but contain only reference to the corresponding |
||
285 | translated module (1002 resp 2004). --> HELP there is no A.cn file!! |
||
286 | |||
287 | The files A.en contains the following lines related to this module. |
||
288 | |||
289 | ?? (...?2 is the ranking, why do we sometimes have ....?4 ) |
||
290 | |||
291 | 2d:1003?2 from description and description_it |
||
292 | algebra:1003?2 from domain |
||
293 | bersaglio:1003?2 from keywords_it |
||
294 | click:1003?2 from description |
||
295 | combination:1003?2 from description (_not_ from keywords) |
||
296 | combinazione:1003?2 from description_it |
||
297 | combinazione lineare:1003?2 from keywords + wgrp.en |
||
298 | gang:1003?2 from author |
||
299 | levelh4:1003?2 from level=h4 (and so on) |
||
300 | levelh5:1003?2 |
||
301 | levelh6:1003?2 |
||
302 | levelu1:1003?2 |
||
303 | levelu2:1003?2 |
||
304 | linear:1003?2 from description |
||
305 | linear algebra:1003?2 from keywords |
||
306 | linear combination:1003?2 from keywords |
||
307 | lineare:1003?2 from description_it |
||
308 | shoot:1003?4 from title |
||
309 | vector:1003?4 from title + description |
||
310 | (vectors --> vector because of |
||
311 | directive "sr:r" in suffix.en) |
||
312 | vettore:1003?2 from keywords_it |
||
313 | xiao:1003?2 from author |
||
314 | |||
315 | The file A.it contains the following lines related to this module. |
||
316 | |||
317 | (NOTE: only difference is that in A.it there is the keyword "vectors", |
||
318 | no difference in keywords, the only difference is in the list of |
||
319 | modules, list that I omitted to clarify this example) |
||
320 | |||
321 | 2d:1003?2 |
||
322 | algebra:1003?2 |
||
323 | bersaglio:1003?2 |
||
324 | click:1003?2 |
||
325 | combination:1003?2 |
||
326 | combinazione:1003?2 |
||
327 | combinazione lineare:1003?2 |
||
328 | gang:1003?2 |
||
329 | levelh4:1003?2 |
||
330 | levelh5:1003?2 |
||
331 | levelh6:1003?2 |
||
332 | levelu1:1003?2 |
||
333 | levelu2:1003?2 |
||
334 | linear:1003?2 |
||
335 | linear algebra:1003?2 |
||
336 | linear combination:1003?2 |
||
337 | lineare:1003?2 |
||
338 | shoot:1003?4 |
||
339 | vector:1003?4 |
||
340 | vectors:1003?2 no corresponding in A.en because |
||
341 | of directive in suffix.en |
||
342 | vettore:1003?2 |
||
343 | xiao:1003?2 |
||
344 | |||
345 | NOTE: title_it is missing from the index: you cannot find the module |
||
346 | by searching for its Italian title |
||
347 | |||
348 | The file A.$lang for languages different from the above contains lines |
||
349 | related to this module. |
||
350 | |||
351 | E.g. A.nl |
||
352 | |||
353 | 2d: |
||
354 | algebraisch: directive "algebra:algebraisch" in words.nl |
||
355 | bersaglio: |
||
356 | clicking: directive "click:clicking" in words.nl |
||
357 | combinaison: "combination:combinaison" in words.nl |
||
358 | combinazione: |
||
359 | combinazione lineare: |
||
360 | gang: |
||
361 | levelh4: |
||
362 | levelh5: |
||
363 | levelh6: |
||
364 | levelu1: |
||
365 | levelu2: |
||
366 | lineare: |
||
367 | linearly: "linear:linearly" in words.nl |
||
368 | niet: "on:niet" in words.nl |
||
369 | ofwel: "of:ofwel" |
||
370 | shooting: "shoot:shooting" |
||
371 | vector: |
||
372 | vettore: |
||
373 | xiao: |
||
374 | |||
375 | the wgrp groups "linear algebra" and "linear combination" are missing |
||
376 | because of the directive "linear:linearly" in words.nl which is |
||
377 | executed before wgrp (?? check). |
||
378 | |||
379 | note: ?? words.nl contains both the line algebra:algebraisch and |
||
380 | algebraisch:algebra ?? (and more similar pairs) |
||
381 | |||
382 | E.g. A.de |
||
383 | |||
384 | almost the same as A.en except for the lines "vectors" (suffix.en) and |
||
385 | "vector shoot" (WHY??). There is no "wgrp.de" file. |
||
386 | |||
387 | 2d: |
||
388 | algebra: |
||
389 | bersaglio: |
||
390 | click: |
||
391 | combination: |
||
392 | combinazione: |
||
393 | combinazione lineare: |
||
394 | gang: |
||
395 | levelh4: |
||
396 | levelh5: |
||
397 | levelh6: |
||
398 | levelu1: |
||
399 | levelu2: |
||
400 | linear: |
||
401 | linear algebra: |
||
402 | linear combination: |
||
403 | lineare: |
||
404 | shoot: |
||
405 | vector: |
||
406 | vector shoot: WHY??? |
||
407 | vectors: cfr. A.it |
||
408 | vettore: |
||
409 | xiao: |
||
410 | |||
411 | |||
412 | |||
6793 | bpr | 413 | ==================================== |
414 | |||
415 | In popup.fr, I change also the way to use the keywords for analogous |
||
416 | reason, I do not have done it in popup.$lang for $lang != fr). |
||
417 | |||
418 | The file suffix.fr was also used by wcalc.fr , for compatibility |
||
419 | with popup on the external web pages, I keep it (so copy it |
||
420 | in the wcalc.fr modules). |
||
6795 | bpr | 421 | |
6797 | czzmrn | 422 | Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here : |
6795 | bpr | 423 | - the perl script takes only the words in the variable keywords |
424 | (so only them are in the list of completion) |
||
425 | - modind.c creates files A.$lang etc which are based on words of keywords, |
||
426 | title, description. They are not all of them in the "completion list" |
||
427 | but can be written and found by the search engine. |
||
428 | |||
429 |