Subversion Repositories wimsdev

Rev

Rev 6804 | Rev 6876 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
6793 bpr 1
WIMS' search engine and als
6797 czzmrn 2
===========================
6405 czzmrn 3
 
6797 czzmrn 4
WIMS' search engine works in two stages:
6405 czzmrn 5
 
6802 reyssat 6
1) update of index files when server data is changed (module added...), 
7
   typically once a day.
8
2) use of index files at each user's request to find some activities
9
 
10
 
11
Here are some details : 
12
 
13
1) update of index files       
14
===========================
15
A series of scripts creates a set of auxiliary files (generally
6797 czzmrn 16
stored in ~/public_html/bases/sys/, see description further down) and
17
a list of "keywords" (stored in ~/public_html/bases/site/).
6405 czzmrn 18
 
6797 czzmrn 19
(the scripts must be run in the order given here, as some files
20
created on earlier stages are used in subsequent stages). In general
21
the whole process is run by the script ~/bin/mkindex.
22
 
6802 reyssat 23
* Firstly a series of 3 perl scripts (mkdomain,mkwgrp,modindclass), 
24
that ~/bin/mkindex.sh calls via ~/public_html/bases/sys/mkindex.sh : 
25
 
26
- the programm ~/public_html/bases/sys/mkdomain.pl creates the lists
27
  of domains from the graph in domain/domain with its translations
28
  (domain/domain.$lang) and in json format (english) to be used for
29
  completion in modtool properties
30
 
6797 czzmrn 31
- the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX
32
  files of all the modules on the site and generates 
33
 
34
  - keywords (in format .json) to be used for completion in the search
35
    engine)
36
  - the files in wgrp
37
 
38
  (using the keywords and keywords_lang in the INDEX files, according
39
  to this rule: taking keywords_$lang if it exists, or keywords
40
  (whatever it is a $lang-module or not).
41
 
42
  Some files are created in keywords as keywords/algebra.fr.tmp, but
43
  not used for the moment. The keywords in these "keywords file" are
44
  exactly those in the variable keywords (or keywords_$lang if it
45
  exists), doing it with the following rules: taking keywords_$lang if
46
  it exists, or keywords (whatever it is a $lang-module or not).
47
 
6793 bpr 48
- the program ~/public_html/bases/sys/modindclass.pl creates the lists
6797 czzmrn 49
  of keywords coming from the example classes in
6800 reyssat 50
  ~/public_html/bases/class as well as the files author,
6797 czzmrn 51
  description, language, level, title (no ranking is done).
6793 bpr 52
 
6802 reyssat 53
* Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads 
6797 czzmrn 54
 
6793 bpr 55
  -- the INDEX files of all the modules on the site 
6797 czzmrn 56
  -- the auxiliary files in ~/public_html/bases/sys/ (see description
57
     below)
6405 czzmrn 58
 
6797 czzmrn 59
  and produces keywords lists stored in ~wims/public_html/bases/site :
60
  they contains the words (or words groups) coming from the variable
61
  keywords of the INDEX but also words of the title, description
62
  (deleting small words).
6795 bpr 63
 
6797 czzmrn 64
  "modind" creates as well a serial list of all the modules available
65
  on the site, see ~/public_html/bases/site/serial, and calculates the
66
  ranking of the site's modules. The modules are classified according
67
  to their types: A=all (except sheet and classes), D=document, O=OEF,
68
  X=exercise, T= tool, R=recreation, M= data module.
6405 czzmrn 69
 
6808 czzmrn 70
  To do that, "modind" uses some dictionnaries in
71
  ~/public_html/bases/sys/ (as suffix.$search_lang, wgrp, ...)
6797 czzmrn 72
 
73
  -- separately "modind" reads also the files in
74
  ~/public_html/bases/sys/sheet and do the same type of works
75
 
6802 reyssat 76
 
77
2) use of index files       
78
===========================
79
The script ~/public_html/modules/home/search.proc (called by the
6797 czzmrn 80
"Search" form) reads the lists above, do the actual search in such
81
lists and displays the modules found. It reads also the files of
82
~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets
83
 
6802 reyssat 84
 
85
 
86
More technical details about both stages
87
========================================
88
 
6808 czzmrn 89
In both stages files in this directory ~/public_html/bases/sys/ (see
90
comments below) are used to process the keywords present in the
91
modules' INDEX files.  Each "search language" has its own series of
92
files.
6797 czzmrn 93
 
6808 czzmrn 94
The contents of the files in ~/public_html/bases/sys/ and of the
95
modules' INDEX files should be checked by developers and translators,
96
to improve the behaviour of the search engine.
6797 czzmrn 97
 
6808 czzmrn 98
The files in this directory ~/public_html/bases/sys/ are automatically
99
generated (on install) by the corresponding ".src" file in the "src"
100
subdirectory, if it exists.
6797 czzmrn 101
 
102
If any of the files described below is omitted, then the corresponding
103
feature in the corresponding language is disabled. E.g. the files
104
words.fr/words.fr.src and suffix.fr/suffix.fr.src will be/have been
105
deleted in order to make the search engine correctly working.
6793 bpr 106
 
6798 czzmrn 107
  (Remark : I delete the files words.fr.src and suffix.fr.src by
6800 reyssat 108
  renaming for the moment xx_orig, so they are not used, but on a
6798 czzmrn 109
  public servor, feature in the corresponding language is
110
  disabled. E.g. the files the files suffix.fr.src must be deleted by
111
  hand.
112
 
6797 czzmrn 113
  Rmk : (bpr) I deliberately delete the suffix.fr as it is
114
  incompatible with a list of words shown by completion (for example,
115
  loi normale was translated in loi norm??, I do not remember, it is
116
  impossible to write such things to completion, and loi normale was
117
  not found).  suffix.en should be also deleted.
118
 
119
 
120
, will be done by the script in the stable release if we are OK)
121
 
6792 czzmrn 122
Syntax: the lines for most of these files are in the form
6552 bpr 123
 
6792 czzmrn 124
==
125
givenword:substitute
126
==
127
 
128
=============================================================
129
 
130
Files
131
=====
132
 
133
words.$search_lang : correct misprints in the search words
134
(used both by "mkindex" and "search.proc"). 
135
 
136
E.g. if the file words.en contains the line
137
 
138
==
139
analytical:analytic
140
==
141
 
142
then the word "analytical" is considered a misprint and any occurrence
143
of the string "analytical" is replaced in the search by the string
144
"analytic" (for the language "en")
145
 
6797 czzmrn 146
Note: words.fr was deleted because it caused the search engine not to
147
work properly. The site manager can reactivate the functionality by
148
adding the file again (?? how to get the "original" files from the
149
svn?).
150
 
6792 czzmrn 151
Note: the file words.en is used by the module tool/wcalc.en (see
152
~/public_html/modules/tool/wcalc.en/dic )
153
 
154
=====================
155
 
156
suffix.$search_lang : process common suffixes in the search words
157
(used both by "mkindex" and "search.proc"). 
158
 
159
E.g. if the file suffix.en contains the line
160
 
161
==
162
ertem:meter
163
==
164
 
165
then any word ending in "metre" ("ertem" the other way round) is
166
substituted by the corresponding one ending in "meter" (kilometre -->
167
kilometer)
168
 
6797 czzmrn 169
Note: suffix.fr was deleted because it caused the search engine/the
170
keyword completion not to work properly. The site manager can
171
reactivate the functionality by adding the file again.
172
 
6792 czzmrn 173
=====================
174
 
175
wgrp/wgrp.$search_lang : groups of word
6797 czzmrn 176
(these files are automatically generated, and used by "mkindex")
6792 czzmrn 177
 
178
E.g. if the file wgrp/wgrp.en contains the line
179
 
180
==
181
affine geometry:affine geometry,
182
==
183
 
184
then the search matches for the group of words "affine geometry" as a
185
whole: if the the user searches for "affine geometry" the search
186
engine returns only the modules containing as keyword the exact string
187
"affine geometry" (if such line were not present the search engine
188
would return both the modules containing the word "affine" and the
189
modules containing the word "geometry").
190
 
191
The "wgrp" files are now generated from the modules' keywords by the
192
script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains
193
multiple words keywords, such keywords are added to the wgrp files. 
194
 
195
E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword 
196
 
197
keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice
198
 
199
so for each of the groups of words between two commas the
200
corresponding groups of words are created
201
 
202
finite group
203
conjugacy class
204
normal subgroup
205
subgroup lattice
206
 
207
(in the corresponding language file)
208
 
209
NOTE: problems when the strings contains the apostrophe "'"
210
(e.g. "algorithme d'euclide")
211
 
212
=====================
213
 
214
indignore.$search_lang : ignored words
215
(used by "mkindex")
216
 
217
All the words listed in the file are ignored by the search engine. 
218
 
219
=====================
220
 
221
abuse.$search_lang : swearwords to be ignored by the search engine
222
(used by ??)
223
 
224
=====================
225
 
226
andor.$search_lang : conjunctions ("and", "or") to be ignored by the 
227
search engine
228
 
6797 czzmrn 229
The file andor.xx is mentioned in src/insmath.c (processing logic
230
statements in math formulas) but this is for the moment used by no
231
modules (to be used, one must have insmath_logic=yes which do not
232
exist in any public module as I know).
6794 bpr 233
 
6797 czzmrn 234
 
6792 czzmrn 235
=====================
236
 
237
keywords.fr : ??
6794 bpr 238
(used by ??) should be deleted
6792 czzmrn 239
 
240
=======================================================
241
 
242
 
243
Some indexing examples
244
======================
245
 
6797 czzmrn 246
U1/algebra/vecshoot.en
6793 bpr 247
 
6797 czzmrn 248
As this is an exercise module it is indexed in the lists A.$lang (All)
249
and X.$lang (eXercise).
6793 bpr 250
 
6797 czzmrn 251
This is a multilanguage module (main language "en", translation
252
language "it"). 
253
 
254
The index file contains the following (nonempty) lines
255
 
256
  title=Vector shoot
257
  description=click on a linear combination of 2D vectors.
258
  language=en
259
  category=exercise
260
  domain=algebra, linear algebra
261
  level=H4,H5,H6,U1,U2
262
  keywords=vector, linear combination
263
  scoring=yes
264
  copyright=&copy; 1998- (<a href="COPYING">GNU GPL</a>) 2013
265
  author=XIAO,Gang
266
  address=xiao@unice.fr
267
  version=2.20
268
  wims_version=4.05a
269
  translation_language=it
270
  title_it=Colpisci i vettori
271
  description_it=individuare una combinazione lineare di vettori 2D.
272
  keywords_it=vettore, combinazione lineare,bersaglio
273
  translator_it=Anna, Lucci
274
  translator_address_it=anna.lucci@gmail.it
275
 
276
In stage 1 the module is given a serial number (depending on the
277
modules actually available on each site, on my site the serial number
278
is "1003"). As the distribution also includes the modules
279
U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that
280
correspond to translation of this module into "cn" and "fr"
281
respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this
282
module (1003) but contain only reference to the corresponding
283
translated module (1002 resp 2004). --> HELP there is no A.cn file!!
284
 
285
The files A.en contains the following lines related to this module.
286
 
287
?? (...?2 is the ranking, why do we sometimes have ....?4 )
6804 reyssat 288
(ER : It is a weight -- see name of variable in modind.c -- giving more importance to the title words : 4 if the word appears in the module title, 2 otherwise)
6797 czzmrn 289
 
290
2d:1003?2                           from description and description_it
291
algebra:1003?2			    from domain
292
bersaglio:1003?2		    from keywords_it
293
click:1003?2			    from description
294
combination:1003?2		    from description (_not_ from keywords)
295
combinazione:1003?2		    from description_it
296
combinazione lineare:1003?2	    from keywords + wgrp.en
297
gang:1003?2  			    from author
298
levelh4:1003?2			    from level=h4 (and so on)
299
levelh5:1003?2			    
300
levelh6:1003?2
301
levelu1:1003?2
302
levelu2:1003?2
303
linear:1003?2		            from description
304
linear algebra:1003?2		    from keywords
305
linear combination:1003?2	    from keywords
306
lineare:1003?2			    from description_it
307
shoot:1003?4			    from title
308
vector:1003?4                       from title + description 
309
				    (vectors --> vector because of 
310
				    directive "sr:r" in suffix.en)
311
vettore:1003?2			    from keywords_it
312
xiao:1003?2			    from author
313
 
314
The file A.it contains the following lines related to this module.
315
 
316
(NOTE: only difference is that in A.it there is the keyword "vectors",
317
no difference in keywords, the only difference is in the list of
318
modules, list that I omitted to clarify this example)
319
 
320
2d:1003?2
321
algebra:1003?2
322
bersaglio:1003?2
323
click:1003?2
324
combination:1003?2
325
combinazione:1003?2
326
combinazione lineare:1003?2
327
gang:1003?2
328
levelh4:1003?2
329
levelh5:1003?2
330
levelh6:1003?2
331
levelu1:1003?2
332
levelu2:1003?2
333
linear:1003?2
334
linear algebra:1003?2
335
linear combination:1003?2
336
lineare:1003?2
337
shoot:1003?4
338
vector:1003?4
339
vectors:1003?2			        no corresponding in A.en because 
340
                                        of directive in suffix.en
341
vettore:1003?2
342
xiao:1003?2
343
 
344
NOTE: title_it is missing from the index: you cannot find the module
345
by searching for its Italian title
346
 
347
The file A.$lang for languages different from the above contains lines
348
related to this module.
349
 
350
E.g. A.nl
351
 
352
2d:					
353
algebraisch:			directive "algebra:algebraisch" in words.nl
354
bersaglio:			
355
clicking:			directive "click:clicking" in words.nl
356
combinaison:			"combination:combinaison" in words.nl
357
combinazione:
358
combinazione lineare:
359
gang:
360
levelh4:
361
levelh5:
362
levelh6:
363
levelu1:
364
levelu2:
365
lineare:
366
linearly:			"linear:linearly" in words.nl
367
niet: 				"on:niet" in words.nl
368
ofwel:				"of:ofwel"
369
shooting:			"shoot:shooting"
370
vector:
371
vettore:
372
xiao:
373
 
374
the wgrp groups "linear algebra" and "linear combination" are missing
375
because of the directive "linear:linearly" in words.nl which is
376
executed before wgrp (?? check).
377
 
378
note: ?? words.nl contains both the line algebra:algebraisch and
379
algebraisch:algebra ?? (and more similar pairs)
380
 
381
E.g. A.de
382
 
383
almost the same as A.en except for the lines "vectors" (suffix.en) and
384
"vector shoot" (WHY??). There is no "wgrp.de" file.
385
 
386
2d:
387
algebra:
388
bersaglio:
389
click:
390
combination:
391
combinazione:
392
combinazione lineare:
393
gang:
394
levelh4:
395
levelh5:
396
levelh6:
397
levelu1:
398
levelu2:
399
linear:
400
linear algebra:
401
linear combination:
402
lineare:
403
shoot:
404
vector:
405
vector shoot:			WHY???
406
vectors:			cfr. A.it
407
vettore:
408
xiao:
409
 
410
 
411
 
6793 bpr 412
====================================
413
 
414
In popup.fr, I change also the way to use the keywords for analogous
415
reason, I do not have done it in popup.$lang for $lang != fr).
416
 
417
The file suffix.fr was also used by wcalc.fr , for compatibility
418
with popup on the external web pages, I keep it (so copy it
419
in the wcalc.fr modules).
6795 bpr 420
 
6797 czzmrn 421
Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here :
6795 bpr 422
  - the perl script takes only the words in  the variable keywords
423
  (so only them are in the list of completion)
424
  - modind.c creates files A.$lang etc which are based on words of keywords,
425
  title, description. They are not all of them in the "completion list"
426
  but can be written and found by the search engine.
427
 
428
 
6804 reyssat 429
 
430
Technical things about modind.c (ER. just to avoid forgetting work in progress)
431
===============================
432
 
433
The tasks done are in order : 
434
 
435
- prep() : * replaces if possible the default language list (defined at top of file)
436
             by the list of languages installed on the server.
437
           * gets the list of all modules prepared by a previous script
438
           * opens files bases/site2/author|description|language|...
439
 
440
- modules() : for each language{for each module{extract information}}.
441
 
442
- clean() : closes files bases/site2/author|description|language|...
443
 
444
- sprep(),sheets() : idem for sheets.
445
 
446
 
447
 
448
Extracting information from one module for a given language (function onemodule) : 
449
 
450
- write author,description,language,etc. information in each corresponding file
451
  bases/site2/author|description|language|...
452
 
453
- normalizes data (suppress uppercase, accents, apostrophe, plural) 
454
  according to dictionary, to get normalized author,description, title, etc.
455
  This is done in the loop for(i=0;i<trcnt;i++){...}
456
 
457
- transforms the (normalized) title into words (change commas to spaces) 
458
  and for each word, appends it with weight 4 using function appenditem.
459
  the variables are the word itself, the current language treated, the serial number of module,
460
  the weight=4, and the module language. 
461
 
462
- put every information other than title (description, keywords, foreign titles, author...) 
463
  in a buffer, transforms it into words and appends this as above except than weight=2.
464
 
465
  BUG ? : in this process, i_keywords_fr is used twice, probably the first one should be i_keywords_en, to be checked.
466
 
467
- the 2 preceeding points (treatment of title and other info) are repeated with the difference
468
  that the transformation into words is replaced by a translation : 
469
  the commas are kept, but some usual words are deleted.
470
  BUG ? : Another difference is that part of "other information than title" is missing, 
471
          for instance the foreign titles, require, author.
472
 
473
ER : I don't know why the process is repeated : should look at appenditem to see where it is appended, maybe the second time is somewhere else.
474
 
475
 
476
===============================