Rev 6804 | Rev 6876 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
6793 | bpr | 1 | WIMS' search engine and als |
6797 | czzmrn | 2 | =========================== |
6405 | czzmrn | 3 | |
6797 | czzmrn | 4 | WIMS' search engine works in two stages: |
6405 | czzmrn | 5 | |
6802 | reyssat | 6 | 1) update of index files when server data is changed (module added...), |
7 | typically once a day. |
||
8 | 2) use of index files at each user's request to find some activities |
||
9 | |||
10 | |||
11 | Here are some details : |
||
12 | |||
13 | 1) update of index files |
||
14 | =========================== |
||
15 | A series of scripts creates a set of auxiliary files (generally |
||
6797 | czzmrn | 16 | stored in ~/public_html/bases/sys/, see description further down) and |
17 | a list of "keywords" (stored in ~/public_html/bases/site/). |
||
6405 | czzmrn | 18 | |
6797 | czzmrn | 19 | (the scripts must be run in the order given here, as some files |
20 | created on earlier stages are used in subsequent stages). In general |
||
21 | the whole process is run by the script ~/bin/mkindex. |
||
22 | |||
6802 | reyssat | 23 | * Firstly a series of 3 perl scripts (mkdomain,mkwgrp,modindclass), |
24 | that ~/bin/mkindex.sh calls via ~/public_html/bases/sys/mkindex.sh : |
||
25 | |||
26 | - the programm ~/public_html/bases/sys/mkdomain.pl creates the lists |
||
27 | of domains from the graph in domain/domain with its translations |
||
28 | (domain/domain.$lang) and in json format (english) to be used for |
||
29 | completion in modtool properties |
||
30 | |||
6797 | czzmrn | 31 | - the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX |
32 | files of all the modules on the site and generates |
||
33 | |||
34 | - keywords (in format .json) to be used for completion in the search |
||
35 | engine) |
||
36 | - the files in wgrp |
||
37 | |||
38 | (using the keywords and keywords_lang in the INDEX files, according |
||
39 | to this rule: taking keywords_$lang if it exists, or keywords |
||
40 | (whatever it is a $lang-module or not). |
||
41 | |||
42 | Some files are created in keywords as keywords/algebra.fr.tmp, but |
||
43 | not used for the moment. The keywords in these "keywords file" are |
||
44 | exactly those in the variable keywords (or keywords_$lang if it |
||
45 | exists), doing it with the following rules: taking keywords_$lang if |
||
46 | it exists, or keywords (whatever it is a $lang-module or not). |
||
47 | |||
6793 | bpr | 48 | - the program ~/public_html/bases/sys/modindclass.pl creates the lists |
6797 | czzmrn | 49 | of keywords coming from the example classes in |
6800 | reyssat | 50 | ~/public_html/bases/class as well as the files author, |
6797 | czzmrn | 51 | description, language, level, title (no ranking is done). |
6793 | bpr | 52 | |
6802 | reyssat | 53 | * Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads |
6797 | czzmrn | 54 | |
6793 | bpr | 55 | -- the INDEX files of all the modules on the site |
6797 | czzmrn | 56 | -- the auxiliary files in ~/public_html/bases/sys/ (see description |
57 | below) |
||
6405 | czzmrn | 58 | |
6797 | czzmrn | 59 | and produces keywords lists stored in ~wims/public_html/bases/site : |
60 | they contains the words (or words groups) coming from the variable |
||
61 | keywords of the INDEX but also words of the title, description |
||
62 | (deleting small words). |
||
6795 | bpr | 63 | |
6797 | czzmrn | 64 | "modind" creates as well a serial list of all the modules available |
65 | on the site, see ~/public_html/bases/site/serial, and calculates the |
||
66 | ranking of the site's modules. The modules are classified according |
||
67 | to their types: A=all (except sheet and classes), D=document, O=OEF, |
||
68 | X=exercise, T= tool, R=recreation, M= data module. |
||
6405 | czzmrn | 69 | |
6808 | czzmrn | 70 | To do that, "modind" uses some dictionnaries in |
71 | ~/public_html/bases/sys/ (as suffix.$search_lang, wgrp, ...) |
||
6797 | czzmrn | 72 | |
73 | -- separately "modind" reads also the files in |
||
74 | ~/public_html/bases/sys/sheet and do the same type of works |
||
75 | |||
6802 | reyssat | 76 | |
77 | 2) use of index files |
||
78 | =========================== |
||
79 | The script ~/public_html/modules/home/search.proc (called by the |
||
6797 | czzmrn | 80 | "Search" form) reads the lists above, do the actual search in such |
81 | lists and displays the modules found. It reads also the files of |
||
82 | ~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets |
||
83 | |||
6802 | reyssat | 84 | |
85 | |||
86 | More technical details about both stages |
||
87 | ======================================== |
||
88 | |||
6808 | czzmrn | 89 | In both stages files in this directory ~/public_html/bases/sys/ (see |
90 | comments below) are used to process the keywords present in the |
||
91 | modules' INDEX files. Each "search language" has its own series of |
||
92 | files. |
||
6797 | czzmrn | 93 | |
6808 | czzmrn | 94 | The contents of the files in ~/public_html/bases/sys/ and of the |
95 | modules' INDEX files should be checked by developers and translators, |
||
96 | to improve the behaviour of the search engine. |
||
6797 | czzmrn | 97 | |
6808 | czzmrn | 98 | The files in this directory ~/public_html/bases/sys/ are automatically |
99 | generated (on install) by the corresponding ".src" file in the "src" |
||
100 | subdirectory, if it exists. |
||
6797 | czzmrn | 101 | |
102 | If any of the files described below is omitted, then the corresponding |
||
103 | feature in the corresponding language is disabled. E.g. the files |
||
104 | words.fr/words.fr.src and suffix.fr/suffix.fr.src will be/have been |
||
105 | deleted in order to make the search engine correctly working. |
||
6793 | bpr | 106 | |
6798 | czzmrn | 107 | (Remark : I delete the files words.fr.src and suffix.fr.src by |
6800 | reyssat | 108 | renaming for the moment xx_orig, so they are not used, but on a |
6798 | czzmrn | 109 | public servor, feature in the corresponding language is |
110 | disabled. E.g. the files the files suffix.fr.src must be deleted by |
||
111 | hand. |
||
112 | |||
6797 | czzmrn | 113 | Rmk : (bpr) I deliberately delete the suffix.fr as it is |
114 | incompatible with a list of words shown by completion (for example, |
||
115 | loi normale was translated in loi norm??, I do not remember, it is |
||
116 | impossible to write such things to completion, and loi normale was |
||
117 | not found). suffix.en should be also deleted. |
||
118 | |||
119 | |||
120 | , will be done by the script in the stable release if we are OK) |
||
121 | |||
6792 | czzmrn | 122 | Syntax: the lines for most of these files are in the form |
6552 | bpr | 123 | |
6792 | czzmrn | 124 | == |
125 | givenword:substitute |
||
126 | == |
||
127 | |||
128 | ============================================================= |
||
129 | |||
130 | Files |
||
131 | ===== |
||
132 | |||
133 | words.$search_lang : correct misprints in the search words |
||
134 | (used both by "mkindex" and "search.proc"). |
||
135 | |||
136 | E.g. if the file words.en contains the line |
||
137 | |||
138 | == |
||
139 | analytical:analytic |
||
140 | == |
||
141 | |||
142 | then the word "analytical" is considered a misprint and any occurrence |
||
143 | of the string "analytical" is replaced in the search by the string |
||
144 | "analytic" (for the language "en") |
||
145 | |||
6797 | czzmrn | 146 | Note: words.fr was deleted because it caused the search engine not to |
147 | work properly. The site manager can reactivate the functionality by |
||
148 | adding the file again (?? how to get the "original" files from the |
||
149 | svn?). |
||
150 | |||
6792 | czzmrn | 151 | Note: the file words.en is used by the module tool/wcalc.en (see |
152 | ~/public_html/modules/tool/wcalc.en/dic ) |
||
153 | |||
154 | ===================== |
||
155 | |||
156 | suffix.$search_lang : process common suffixes in the search words |
||
157 | (used both by "mkindex" and "search.proc"). |
||
158 | |||
159 | E.g. if the file suffix.en contains the line |
||
160 | |||
161 | == |
||
162 | ertem:meter |
||
163 | == |
||
164 | |||
165 | then any word ending in "metre" ("ertem" the other way round) is |
||
166 | substituted by the corresponding one ending in "meter" (kilometre --> |
||
167 | kilometer) |
||
168 | |||
6797 | czzmrn | 169 | Note: suffix.fr was deleted because it caused the search engine/the |
170 | keyword completion not to work properly. The site manager can |
||
171 | reactivate the functionality by adding the file again. |
||
172 | |||
6792 | czzmrn | 173 | ===================== |
174 | |||
175 | wgrp/wgrp.$search_lang : groups of word |
||
6797 | czzmrn | 176 | (these files are automatically generated, and used by "mkindex") |
6792 | czzmrn | 177 | |
178 | E.g. if the file wgrp/wgrp.en contains the line |
||
179 | |||
180 | == |
||
181 | affine geometry:affine geometry, |
||
182 | == |
||
183 | |||
184 | then the search matches for the group of words "affine geometry" as a |
||
185 | whole: if the the user searches for "affine geometry" the search |
||
186 | engine returns only the modules containing as keyword the exact string |
||
187 | "affine geometry" (if such line were not present the search engine |
||
188 | would return both the modules containing the word "affine" and the |
||
189 | modules containing the word "geometry"). |
||
190 | |||
191 | The "wgrp" files are now generated from the modules' keywords by the |
||
192 | script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains |
||
193 | multiple words keywords, such keywords are added to the wgrp files. |
||
194 | |||
195 | E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword |
||
196 | |||
197 | keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice |
||
198 | |||
199 | so for each of the groups of words between two commas the |
||
200 | corresponding groups of words are created |
||
201 | |||
202 | finite group |
||
203 | conjugacy class |
||
204 | normal subgroup |
||
205 | subgroup lattice |
||
206 | |||
207 | (in the corresponding language file) |
||
208 | |||
209 | NOTE: problems when the strings contains the apostrophe "'" |
||
210 | (e.g. "algorithme d'euclide") |
||
211 | |||
212 | ===================== |
||
213 | |||
214 | indignore.$search_lang : ignored words |
||
215 | (used by "mkindex") |
||
216 | |||
217 | All the words listed in the file are ignored by the search engine. |
||
218 | |||
219 | ===================== |
||
220 | |||
221 | abuse.$search_lang : swearwords to be ignored by the search engine |
||
222 | (used by ??) |
||
223 | |||
224 | ===================== |
||
225 | |||
226 | andor.$search_lang : conjunctions ("and", "or") to be ignored by the |
||
227 | search engine |
||
228 | |||
6797 | czzmrn | 229 | The file andor.xx is mentioned in src/insmath.c (processing logic |
230 | statements in math formulas) but this is for the moment used by no |
||
231 | modules (to be used, one must have insmath_logic=yes which do not |
||
232 | exist in any public module as I know). |
||
6794 | bpr | 233 | |
6797 | czzmrn | 234 | |
6792 | czzmrn | 235 | ===================== |
236 | |||
237 | keywords.fr : ?? |
||
6794 | bpr | 238 | (used by ??) should be deleted |
6792 | czzmrn | 239 | |
240 | ======================================================= |
||
241 | |||
242 | |||
243 | Some indexing examples |
||
244 | ====================== |
||
245 | |||
6797 | czzmrn | 246 | U1/algebra/vecshoot.en |
6793 | bpr | 247 | |
6797 | czzmrn | 248 | As this is an exercise module it is indexed in the lists A.$lang (All) |
249 | and X.$lang (eXercise). |
||
6793 | bpr | 250 | |
6797 | czzmrn | 251 | This is a multilanguage module (main language "en", translation |
252 | language "it"). |
||
253 | |||
254 | The index file contains the following (nonempty) lines |
||
255 | |||
256 | title=Vector shoot |
||
257 | description=click on a linear combination of 2D vectors. |
||
258 | language=en |
||
259 | category=exercise |
||
260 | domain=algebra, linear algebra |
||
261 | level=H4,H5,H6,U1,U2 |
||
262 | keywords=vector, linear combination |
||
263 | scoring=yes |
||
264 | copyright=© 1998- (<a href="COPYING">GNU GPL</a>) 2013 |
||
265 | author=XIAO,Gang |
||
266 | address=xiao@unice.fr |
||
267 | version=2.20 |
||
268 | wims_version=4.05a |
||
269 | translation_language=it |
||
270 | title_it=Colpisci i vettori |
||
271 | description_it=individuare una combinazione lineare di vettori 2D. |
||
272 | keywords_it=vettore, combinazione lineare,bersaglio |
||
273 | translator_it=Anna, Lucci |
||
274 | translator_address_it=anna.lucci@gmail.it |
||
275 | |||
276 | In stage 1 the module is given a serial number (depending on the |
||
277 | modules actually available on each site, on my site the serial number |
||
278 | is "1003"). As the distribution also includes the modules |
||
279 | U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that |
||
280 | correspond to translation of this module into "cn" and "fr" |
||
281 | respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this |
||
282 | module (1003) but contain only reference to the corresponding |
||
283 | translated module (1002 resp 2004). --> HELP there is no A.cn file!! |
||
284 | |||
285 | The files A.en contains the following lines related to this module. |
||
286 | |||
287 | ?? (...?2 is the ranking, why do we sometimes have ....?4 ) |
||
6804 | reyssat | 288 | (ER : It is a weight -- see name of variable in modind.c -- giving more importance to the title words : 4 if the word appears in the module title, 2 otherwise) |
6797 | czzmrn | 289 | |
290 | 2d:1003?2 from description and description_it |
||
291 | algebra:1003?2 from domain |
||
292 | bersaglio:1003?2 from keywords_it |
||
293 | click:1003?2 from description |
||
294 | combination:1003?2 from description (_not_ from keywords) |
||
295 | combinazione:1003?2 from description_it |
||
296 | combinazione lineare:1003?2 from keywords + wgrp.en |
||
297 | gang:1003?2 from author |
||
298 | levelh4:1003?2 from level=h4 (and so on) |
||
299 | levelh5:1003?2 |
||
300 | levelh6:1003?2 |
||
301 | levelu1:1003?2 |
||
302 | levelu2:1003?2 |
||
303 | linear:1003?2 from description |
||
304 | linear algebra:1003?2 from keywords |
||
305 | linear combination:1003?2 from keywords |
||
306 | lineare:1003?2 from description_it |
||
307 | shoot:1003?4 from title |
||
308 | vector:1003?4 from title + description |
||
309 | (vectors --> vector because of |
||
310 | directive "sr:r" in suffix.en) |
||
311 | vettore:1003?2 from keywords_it |
||
312 | xiao:1003?2 from author |
||
313 | |||
314 | The file A.it contains the following lines related to this module. |
||
315 | |||
316 | (NOTE: only difference is that in A.it there is the keyword "vectors", |
||
317 | no difference in keywords, the only difference is in the list of |
||
318 | modules, list that I omitted to clarify this example) |
||
319 | |||
320 | 2d:1003?2 |
||
321 | algebra:1003?2 |
||
322 | bersaglio:1003?2 |
||
323 | click:1003?2 |
||
324 | combination:1003?2 |
||
325 | combinazione:1003?2 |
||
326 | combinazione lineare:1003?2 |
||
327 | gang:1003?2 |
||
328 | levelh4:1003?2 |
||
329 | levelh5:1003?2 |
||
330 | levelh6:1003?2 |
||
331 | levelu1:1003?2 |
||
332 | levelu2:1003?2 |
||
333 | linear:1003?2 |
||
334 | linear algebra:1003?2 |
||
335 | linear combination:1003?2 |
||
336 | lineare:1003?2 |
||
337 | shoot:1003?4 |
||
338 | vector:1003?4 |
||
339 | vectors:1003?2 no corresponding in A.en because |
||
340 | of directive in suffix.en |
||
341 | vettore:1003?2 |
||
342 | xiao:1003?2 |
||
343 | |||
344 | NOTE: title_it is missing from the index: you cannot find the module |
||
345 | by searching for its Italian title |
||
346 | |||
347 | The file A.$lang for languages different from the above contains lines |
||
348 | related to this module. |
||
349 | |||
350 | E.g. A.nl |
||
351 | |||
352 | 2d: |
||
353 | algebraisch: directive "algebra:algebraisch" in words.nl |
||
354 | bersaglio: |
||
355 | clicking: directive "click:clicking" in words.nl |
||
356 | combinaison: "combination:combinaison" in words.nl |
||
357 | combinazione: |
||
358 | combinazione lineare: |
||
359 | gang: |
||
360 | levelh4: |
||
361 | levelh5: |
||
362 | levelh6: |
||
363 | levelu1: |
||
364 | levelu2: |
||
365 | lineare: |
||
366 | linearly: "linear:linearly" in words.nl |
||
367 | niet: "on:niet" in words.nl |
||
368 | ofwel: "of:ofwel" |
||
369 | shooting: "shoot:shooting" |
||
370 | vector: |
||
371 | vettore: |
||
372 | xiao: |
||
373 | |||
374 | the wgrp groups "linear algebra" and "linear combination" are missing |
||
375 | because of the directive "linear:linearly" in words.nl which is |
||
376 | executed before wgrp (?? check). |
||
377 | |||
378 | note: ?? words.nl contains both the line algebra:algebraisch and |
||
379 | algebraisch:algebra ?? (and more similar pairs) |
||
380 | |||
381 | E.g. A.de |
||
382 | |||
383 | almost the same as A.en except for the lines "vectors" (suffix.en) and |
||
384 | "vector shoot" (WHY??). There is no "wgrp.de" file. |
||
385 | |||
386 | 2d: |
||
387 | algebra: |
||
388 | bersaglio: |
||
389 | click: |
||
390 | combination: |
||
391 | combinazione: |
||
392 | combinazione lineare: |
||
393 | gang: |
||
394 | levelh4: |
||
395 | levelh5: |
||
396 | levelh6: |
||
397 | levelu1: |
||
398 | levelu2: |
||
399 | linear: |
||
400 | linear algebra: |
||
401 | linear combination: |
||
402 | lineare: |
||
403 | shoot: |
||
404 | vector: |
||
405 | vector shoot: WHY??? |
||
406 | vectors: cfr. A.it |
||
407 | vettore: |
||
408 | xiao: |
||
409 | |||
410 | |||
411 | |||
6793 | bpr | 412 | ==================================== |
413 | |||
414 | In popup.fr, I change also the way to use the keywords for analogous |
||
415 | reason, I do not have done it in popup.$lang for $lang != fr). |
||
416 | |||
417 | The file suffix.fr was also used by wcalc.fr , for compatibility |
||
418 | with popup on the external web pages, I keep it (so copy it |
||
419 | in the wcalc.fr modules). |
||
6795 | bpr | 420 | |
6797 | czzmrn | 421 | Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here : |
6795 | bpr | 422 | - the perl script takes only the words in the variable keywords |
423 | (so only them are in the list of completion) |
||
424 | - modind.c creates files A.$lang etc which are based on words of keywords, |
||
425 | title, description. They are not all of them in the "completion list" |
||
426 | but can be written and found by the search engine. |
||
427 | |||
428 | |||
6804 | reyssat | 429 | |
430 | Technical things about modind.c (ER. just to avoid forgetting work in progress) |
||
431 | =============================== |
||
432 | |||
433 | The tasks done are in order : |
||
434 | |||
435 | - prep() : * replaces if possible the default language list (defined at top of file) |
||
436 | by the list of languages installed on the server. |
||
437 | * gets the list of all modules prepared by a previous script |
||
438 | * opens files bases/site2/author|description|language|... |
||
439 | |||
440 | - modules() : for each language{for each module{extract information}}. |
||
441 | |||
442 | - clean() : closes files bases/site2/author|description|language|... |
||
443 | |||
444 | - sprep(),sheets() : idem for sheets. |
||
445 | |||
446 | |||
447 | |||
448 | Extracting information from one module for a given language (function onemodule) : |
||
449 | |||
450 | - write author,description,language,etc. information in each corresponding file |
||
451 | bases/site2/author|description|language|... |
||
452 | |||
453 | - normalizes data (suppress uppercase, accents, apostrophe, plural) |
||
454 | according to dictionary, to get normalized author,description, title, etc. |
||
455 | This is done in the loop for(i=0;i<trcnt;i++){...} |
||
456 | |||
457 | - transforms the (normalized) title into words (change commas to spaces) |
||
458 | and for each word, appends it with weight 4 using function appenditem. |
||
459 | the variables are the word itself, the current language treated, the serial number of module, |
||
460 | the weight=4, and the module language. |
||
461 | |||
462 | - put every information other than title (description, keywords, foreign titles, author...) |
||
463 | in a buffer, transforms it into words and appends this as above except than weight=2. |
||
464 | |||
465 | BUG ? : in this process, i_keywords_fr is used twice, probably the first one should be i_keywords_en, to be checked. |
||
466 | |||
467 | - the 2 preceeding points (treatment of title and other info) are repeated with the difference |
||
468 | that the transformation into words is replaced by a translation : |
||
469 | the commas are kept, but some usual words are deleted. |
||
470 | BUG ? : Another difference is that part of "other information than title" is missing, |
||
471 | for instance the foreign titles, require, author. |
||
472 | |||
473 | ER : I don't know why the process is repeated : should look at appenditem to see where it is appended, maybe the second time is somewhere else. |
||
474 | |||
475 | |||
476 | =============================== |