[Systers-dev] GSoC 2012 project

Robin Jeffries robin at jeffries.org
Thu Apr 5 09:13:49 PDT 2012


Ahh, the missing links help a lot.  Let me echo Priya's request for better
spacing in your emails.


On Thu, Apr 5, 2012 at 8:38 AM, Danci Emanuel <danci_emanuel at yahoo.com>wrote:

>
>  Hello Robin!
>
> Firstly, I`m sorry for the confusing message, but I saw that the added
> links did not work properly only after posting the message. Here is the
> version with the working links:
>
>   After doing some further reading I got some conclusions and proposals
> that I will post here. It`s possible that Priya might have stumbled upon
> these solutions, and if this is so, please let me know. For improving the
> search engine I found there possible solutions:
>
>   1. Integrating the htdig search engine with Mailman -> Link -
> http://www.openinfo.co.uk/mm/patches/444884/ - (This is only a possible
> solution, because I do not know exactly what searching capabilities this
> patch could provide).
>
>   2. Replacing pipermail with MHonArc - http://www.mhonarc.org/ - and
> building a custom search engine on top of it. As it says here -
> http://www.mhonarc.org/MHonArc/doc/faq/usage.html#searching - other
> open-source search engines have been previously used with MHonArc. One of
> the compatible search engine is Lucene -  http://lucene.apache.org/ -  and
> the advantage for using it is that last year I did some research related to
> open-source search engines because I needed to create a search engine for a
> library application that stores information (using MySQL databases) for
> over 25.000 books (I know it`s not a big number, but without it was working
> pretty slow because I made the application to work similarly to Google
> Instant). Moreover, the open-source search engines like Lucene, Solr or
> Sphinx  provide highly configurable options (here -
> http://stackoverflow.com/questions/1284083/choosing-a-stand-alone-full-text-search-server-sphinx-or-solr,
> here -
> http://stackoverflow.com/questions/737275/comparison-of-full-text-search-engine-lucene-sphinx-postgresql-mysql
> ,
>  and here -
> http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
>  - are some good comparisons between some of the most used open-source
> search engines).
>
>   3. Why not replace the pipermail archiver with a relational database
> (in this - http://marc.info/?q=about - article, at the paragraph 'About
> its present form' a similar idea is presented)? We could create a solid
> database design and thus we could sort the data by subject, by date or by
> any other field. Furthermore, we can use one of the open-source search
> engines and this way we could basically search for anything in the
> database. Currently I`m thinking about an idea to store an id number for
> every message that belong to a certain thread and when searching for a
> keyword in the database this way would be very easy to return the whole
> conversation (just search for all the messages that have the same id as the
> message in which the word was found).
>
>  For the dynamic lists project I would like to know if the current system
> used by Sisters is stable and if my only task would be to fully integrate
> it with Mailman 3.0 or if my task will be to come up with a new proposal
> and implementation of the system. If my task consists of only integrating
> the existing system with Mailman 3.0 all what I will have to do will be to
> understand well how the code works, the custom changes made by you and
> finding the way to successfully integrate it with Mailman 3.0?
>
>   For now I would like to know your opinion regarding the presented
> ideas, if they are good enough to be used and if I should start writing my
> application and focus only on extending the search capabilities and
> implementing the dynamic list with Mailman 3.0, leaving the UI part
> aside. I also thought about the possibilities of creating a nice UI for the
> Mailman using Django, but for now, as I said I would like to know if the
> presented ideas are good enough to be accepted and if I should focus only
> on them. Your advice it`s very valuable because you can evaluate the work
> volume much better than me and I do not want to end up in the situation to
> over-promise and  under-deliver.
>
>   In regard to the timeline, I could definitely do some work between the
> bonding period in order to make up for the final exams period so that I
> would be fine at the mid point. Furthermore, I do not have any problems
> with re-planing if that will be necessary.
>
>   I am looking forward to hearing your advice!
>
>
> Secondly, thanks for the clarifications you have provided. I was asking
> all these questions because I have started the application process pretty
> late this year and I wanted to make sure I understand the system and
> the requirements as clearly as possible. In regard to the possibility of
> searching for information in a large database, in a comment from one of the
> posts from this link -
> http://stackoverflow.com/questions/737275/comparison-of-full-text-search-engine-lucene-sphinx-postgresql-mysql -
> one of the users talks about performance saying that if the index is
> correctly build, an advanced query over millions of record can be done in
> just a couple of milliseconds (one of the key factors being the data
> structures that they use - ternary trees, binary trees, tries etc).
> Although I think it would have been a lot of fun and interesting to work
> at the search project, given the fact that there is very little time left
> for writing the application, I will only apply for the project
>  'Integrating Dynamic Sub-Lists with Mailman 3.0'. The question that I have
> in regard to this project is:
>  1. When integrating the dlists with Mailman 3.0 the postgreSQL database
> will have the same structure or does we have to make any changes to it? I
> am asking this question in order to know if I have to add a separate week
> in the timeline for working at the database.
>

My current assumption is that the database would not change, but if you
needed to add new fields or a new table, that would be OK. (we need to look
at bit more to see if there are other mailman features that might be
integrated with this).  If you do change the database, you need to think
about how to convert all the existing lists (on systers.org)

> Here is a sketch for the timeline:
>  Week 1, 2, 3, 4  [May 21 - June 17] - Read the documentation and the
> code in order to get to know how Mailman works at a deeper level.
>

this seems more than generous to me.  I think a week, maybe 2, if there are
additional MM features that we would want to integrate dlists with.  And
most of this is supposed to happen during the community bonding period.


> Week 5, 6 [June 18 - July 1] - Read the documentation and the code for the
> changes that Systers have made for the dynamic sub-lists.
>

I was assuming this would go on in parallel with the first.  You do realize
that GSOC is like a full time job -- I could read the systers code aloud as
a bedtime story to a child in about 2 hours, and another 30 minutes for the
documentation (which is a sad statement about our documentation)

Week 7 [July 2 - July 8] - Create the design for the changes that have to
> be implemented in the Mailman`s structure in order for the dynamic lists to
> properly work with Mailman 3.0
>

Sounds good.

>  Week 8, 9, 10, 11 [July 9 - August 5] - Implement the changes + submit
> the project for the mid-term evaluation.
>

I believe the mid term evaluation is around July 9.  So you have to have
something more than a design by then.

I don't see any mention of unit tests.  We are going to expect them of
everyone.

I hope that 4 weeks is generous for coding and writing tests; of course, I
don't know your coding skills and neither of us knows exactly what is
involved (I have not read the mailman 3.0 code at all)


> Week 12 [August 6 - August 12] - Additional testing and solving
> unexpected/related things that could come up.
>

Good to have;

You should also plan to
    - go through the process to get this submitted back into mailman core
    - provide patches for the 3 pieces to work in mailman 3.0 in case it
doesn't make it into the core for the first round.  We will most likely
accept someone else to create the mailman 2.0 patch process -- with luck
you can work off her/his work and just have to create the actual patches,
not do the surrounding work


> Week 13 [August 13 - August 20] - Final project documentation and wrapping
> the code for the final evaluation.
>

also a good idea.

>
> Does it look fine to you? Should I change/add/remove something?
>
> Thank you,
> Emanuel Danci
>    ------------------------------
> *From:* Robin Jeffries <robin at jeffries.org>
> *To:* Danci Emanuel <danci_emanuel at yahoo.com>
> *Cc:* "systers-dev+eligibility at systers.org" <
> systers-dev+eligibility at systers.org>
> *Sent:* Thursday, April 5, 2012 1:11 AM
>
> *Subject:* Re: [Systers-dev] GSoC 2012 project
>
> On Wed, Apr 4, 2012 at 2:19 PM, Danci Emanuel <danci_emanuel at yahoo.com
> >wrote:
>
> > Hello!
> >
> > After doing some further reading I got some conclusions and proposals
> that
> > I will post here. It`s possible that Priya might have stumbled upon these
> > solutions, and if this is so, please let me know. For improving the
> search
> > engine I found there possible solutions:
> > 1. Integrating the htdig search engine with Mailman -> Link (This is only
> > a possible solution, because I do not know exactly what searching
> > capabilities this patch could provide).
> >
>
> I didn't understand this
>
>
> > 2. Replacing pipermail with MHonArc and building a custom search engine
> on
> > top of it. As it says here other open-source search engines have been
> > previously used with MHonArc. One of the compatible search
> > engines is Lucene and the advantage for using it is that last year I did
> > some research related to open-source search engines because I needed to
> > create a search engine for a library application that stores information
> > (using MySQL databases) for over 25.000 books (I know it`s not a big
> > number, but without it was working pretty slow because I made the
> > application to work similarly to Google Instant). Moreover, the
> open-source
> > search engines like Lucene, Solr or Sphinx  provide highly configurable
> > options (here, here and here are some good comparisons between some of
> the
> > most used open-source search engines).
> >
>
> You need to look at hyperkitty https://github.com/syst3mw0rm/HyperKitty,
> which is the core of the archive mailman plans to use .  I believe that one
> of the GSOC students in 2010 (I don't think it was Priya, but that was a
> long time ago....) looked at Lucerne. Look at the student projects for
> 2010.  You might be able to add one of these to hyperkitty -- you would not
> need to commit to a specific one in your proposal, but talk about what you
> need to investigate to decide on one.
>
> 3. Why not replace the pipermail archiver with a relational database (in
> > this article, at the paragraph 'About its present form' a similar idea is
> > presented)? We could create a solid database design and thus we could
> sort
> > the data by subject, by date or by any other field. Furthermore, we can
> use
> > one of the open-source search engines and this way we could basically
> > search for anything in the database. Currently I`m thinking about an idea
> > to store an id number for every message that belong to a certain thread
> and
> > when searching for a keyword in the database this way would be very easy
> to
> > return the whole conversation (just search for all the messages that have
> > the same id as the message in which the word was found).
> >
>
> Again, look at hyperkitty.  I believe that's what it does for the messages.
>
> I think that searching in a relational database, searching through full
> message text for individual words is going to be very slow.  You probably
> need an auxiliary data structure to help you out.
>
>
> > For the dynamic lists project I would like to know if the current system
> > used by Sisters is stable and if my only task would be to fully integrate
> > it with Mailman 3.0 or if my task will be to come up with a new proposal
> > and implementation of the system. If my task consists of only integrating
> > the existing system with Mailman 3.0 all what I will have to do will be
> to
> > understand well how the code works, the custom changes made by you and
> > finding the way to successfully integrate it with Mailman 3.0?
> >
>
> I'm now confused -- you are talking about a completely different project,
> right?  Yes, the integration with mailman 3.0 is to take the existing
> system and make it work with mailman 3.0.  It is a smaller project
> (suitable for a junior student), but there is still a summer's worth of
> work here.
>
>
>
> > For now I would like to know your opinion regarding the presented ideas,
> > if they are good enough to be used and if I should start writing my
> > application and focus only on extending the search capabilities and
> > implementing the dynamic list with Mailman 3.0, leaving the UI part
> > aside. I also thought about the possibilities of creating a nice UI for
> the
> > Mailman using Django, but for now, as I said I would like to know if the
> > presented ideas are good enough to be accepted and if I should focus only
> > on them. Your advice it`s very valuable because you can evaluate the work
> > volume much better than me and I do not want to end up in the situation
> to
> > over-promise and  under-deliver.
> > In regard to the timeline, I could definitely do some work between the
> > bonding period in order to make up for the final exams period so that I
> > would be fine at the mid point. Furthermore, I do not have any problems
> > with re-planing if that will be necessary.
> >
>
> If you are going to focus on the archiver, especially search, you will need
> to present some sort of UI, but I think your goal should be to have a very
> solid backend, and as much of a ui prototype as you can get done in the
> summer.  The hyperkitty project may have a student working for mailman
> working on some UI ideas, so you may be able to fit your UI into those
> ideas, or you might work on just the part of the UI that relates to search
> (there is a lot more to an archiver UI than search -- check out some of the
> mock screenshots in hyperkitty to see where else this is going.)
>
> The archiver project is a big project for a first year undergraduate.  You
> seem to be willing to tackle an ambitious project, so this really depends
> on how likely you are to get "stuck".  If you can make regular progress
> with turnaround like I am giving you, perhaps slower, even if that progress
> is just to expose (and solve) one unexpected problem after another, we can
> make this into a successful project.  If that doesn't sound like fun, you
> may want to scale back to something more like the patches project or the
> porting to mailman3.0 project.
>
>
> I am looking forward to hearing your advice!
> >
> > Thank you,
> > Emanuel DANCI
> >
> >
> > ________________________________
> >  From: Robin Jeffries <robin at jeffries.org>
> > To: Danci Emanuel <danci_emanuel at yahoo.com>
> > Cc: "systers-dev+eligibility at systers.org" <
> > systers-dev+eligibility at systers.org>
> > Sent: Wednesday, April 4, 2012 7:11 AM
> > Subject: Re: [Systers-dev] GSoC 2012 project
> >
> >
> > Some answers in line.  I think you are about to find your fit.
> >
> >
> >
> > On Tue, Apr 3, 2012 at 2:17 PM, Danci Emanuel <danci_emanuel at yahoo.com>
> > wrote:
> >
> > Hello Robin!
> > >
> > >Thanks for the rapid response and I apologize for not responding right
> > away but I had the mid-term exams and I was a little busy with them.
> Also,
> > thanks for the provided guidelines in regard to which project to choose.
> > >Indeed, algorithms are a very interesting part of the computing world
> for
> > me, but I am also interested in learning other so-valuable practices
> > involved in the software development process.
> > > I  took a look at the Archive access project and I also read the
> details
> > about what the other students have done in order to improve Mailman
> archive
> > Access/Searching. Furthermore, I took a look at the ideas that Mairin
> Duffy
> > has for creating a richer web interface for the mailing lists. So far I
> > think that the ideas that she proposed with the mock-ups presented on her
> > website would be one of the best ways of tackling the problem, because it
> > could solve three problems from just one shot:
> > >
> >
> > Good, we and Mailman are both interested in some of these ideas.
> >
> > First of all, by creating a web interface similar to the one that she
> > posted on her web site we could enhance the access experience and we also
> > could expose to the users numerous options and statistics that currently
> > are not available.
> > >Second of all, we could implement the dynamic sub-lists very easily this
> > way.
> > >
> >
> > I'm not sure what this sentence means.  You probably need to explain it
> > more
> >
> > Third of all, I think we could use the code the Priya Kuber has already
> > written and we could add some features to it in order to extend it`s
> > usability (e.g: as far I have seen from the rapid look that I took on her
> > code and on the description, it does not have the capability of searching
> > for keywords contained in the body of the messages, and I think this
> would
> > be a nice feature to have).
> > >
> >
> > Yes, that's a good place to start.  Priya is still on this list (hi,
> > Priya) and I think we can arm twist her into giving you -- or any other
> > student chosen for this project -- some help getting started.
> >
> >
> > Do you think that this is do-able in 12 weeks? The only time-related
> > problem that could appear is the fact that during the first 3 weeks of
> June
> > I have to take the final exams and during that period I will have to work
> > at half-capacity, but I am sure that I can catch up along the way by
> > working extra-hours or during the weekends.
> > >
> >
> > If you have time to work during the bonding period (from the time you
> > are notified till  mid may), that could make up for it. And if you really
> > can put in 3 weeks of somewhat high quality work (meaning that you
> > understand the work to be done and are ready to start coding on at least
> > part of it, so that you can be productive) at half time during your
> exams,
> > we could make this work.  We are willing to be flexible with students who
> > have some conflicts, but we have to notify Google about the work you have
> > done at the mid point and at the end, and you will have to think about
> > whether we will be able to honestly say you have done 6 weeks of
> > full-time-equivalent good work by July 9.
> >
> > I know that several students want to work on the archive project, so you
> > will need to be flexible (in case we have the resources to take more
> than 1
> > of you, you will have to replan), but you should write an application
> > assuming you are the only one working on archives.  I would pick a small
> > number of Mairin's ideas as the core functionality you want to provide,
> > propose a time line for that, and include the work it will take to hook
> > those ideas up to the necessary backend (for which you should be looking
> at
> > hyperkitty -- mentioned in another thread -- and at Priya's work.
> >
> > Remember, if systers is going to support you in the archive project, we
> > want your proposal to include how you will support dynamic sublists, and
> > also that our main concern is the search aspect -- how do you find
> > something that was posted 2 years ago? How do you find the entire
> > conversation that it was posted in?  I strongly suggest you find a large
> > mailman list (one of the mailman developers lists would be ideal) and try
> > to find something you think would have been discussed there using the
> > current archives.  It will help you feel the pain of current users.
> >
> >
> > I know that there is little time left, but I would like to get some
> > feedback and some guidance, in order to clarify the direction in which to
> > go and to be able to make a clear timeline for the application.
> > >Thank you very much!
> > >
> >
> > Good luck,
> >
> >
> > >Emanuel DANCI
> > >
> > >
> > >________________________________
> > >
> > >
> > >Well, even with this info, its hard to tell.  Integrating dynamic
> sublists
> > >into Mailman 3.0 is critical to systers, while Other mailman extensions
> is
> > >closer to mailman and you would probably end up with at least 1 mentor
> > from
> > >the mailman project.
> > >
> > >The dlists project will require you to understand mailman (and our
> > changes)
> > >at a relatively deep level -- it's a large code base, and learning to
> > >understand such a system would be good for you.  It may also introduce
> you
> > >to python packages that you are not already familiar with.  I think it
> > will
> > >be straightforward and easy to make progress.  For someone with your
> > >experience, it might be useful to take on an additional project at the
> end
> > >of the summer, as, if you know python well, this might only take you a
> > >month or less.  You will learn some new algorithms, but probably won't
> > >create any
> > >
> > >The other mailman extensions will require that you decide what
> extensions
> > >are valuable, how people will want to use them (so you'll get introduced
> > to
> > >use cases, if you aren't already familiar with them).  There will be
> lots
> > >of opportunities to work out new algorithms, but that's only a small
> part
> > >of the project.
> > >
> > >You might also look into the Archive access project -- it's about making
> > >the archives usefully searchable.  There is an active mailman project on
> > >this, which should get you started, and might make this accessible to an
> > >undergraduate. There is a use case component to this too, but there
> should
> > >be a lot of algorithm work too, if that is what you like. Look into it
> and
> > >see if it appeals to you.
> > >
> > >I've given some advice about the Other Mailman extensions project
> already.
> > >You should search the archives for that.  For the others, do a little
> > >research and ask some questions.  That will enable us to give you more
> > >concrete advice.
> > >
> > >Robin
> > >
> > >Robin
> > >
> > >On Mon, Mar 26, 2012 at 3:29 PM, Danci Emanuel <danci_emanuel at yahoo.com
> > >wrote:
> > >
> > >>
> > >>
> > >> Thank you for the prompt reply! This is great news! First of all, let
> me
> > >> introduce myself:
> > >> I am a 1st year undergraduate student pursuing a Computer Science
> degree
> > >> at the "Politehnica" University of Timisoara, Romania. I have a keen
> > >> interest in software development and in solving algorithms and
> > mathematics
> > >> problems, and up to this moment I have gained programming experience
> by
> > >> participating and winning multiple regional and national contests in
> > >> algorithms and project-based software development competitions and by
> > >> creating several pet projects. The programming languages that I have
> > used
> > >> so far and the associated level of experience for each of them are: C,
> > C#,
> > >> Python, MySQL (intermediate), C++ (beginner/intermediate), nsis
> > scripting
> > >> language (beginner).Currently I am building a simulation tool for
> solar
> > >> panels, for one of the professors from our university and I am also
> > >> participating at the Competitive Programming Seminar from our college
> > where
> > >> we train for programming competitions like the ACM or Challenge24 by
> > >> solving problems from previous
> > >>  years and also by studying new different algorithms (audio and image
> > >> processing algorithms, linear programming algorithms etc).
> > >> This is who I am, in a few words, and from what I have already read on
> > the
> > >> mailing list, I consider this as being a great opportunity to
> > contribute to
> > >> a great project by creating new features that will have a positive
> > impact
> > >> on a large number of users. I have read the proposed project ideas,
> and
> > >> although I found very interesting these two projects: "Other mailman
> > >> extensions" and "Integrating Dynamic Sub-lists with Mailman 3.0", I
> > would
> > >> like to get some advice in regard to which project to choose depending
> > on
> > >> which one would be more useful for the Systers' community and also
> which
> > >> one would be more suitable for me. Furthermore, it would be nice to
> get
> > >> some guidance in order to make a good application.
> > >>
> > >> Thank you,
> > >>
> > >> Emanuel DANCI
> > >>
> > >>
> > >> To unsubscribe from this conversation, send email to <
> > >> systers-dev+eligibility+unsubscribe at systers.org> or visit <
> > >>
> > http://systers.org/mailman/options/systers-dev?override=180&preference=0
> >
> > >> To contribute to this conversation, use your mailer's reply-all or
> > >> reply-group command or send your message to
> > >> systers-dev+eligibility at systers.org
> > >> To start a new conversation, send email to <systers-dev+
> new at systers.org
> > >
> > >> To unsubscribe entirely from systers-dev, send email to <
> > >> systers-dev-request at systers.org> with subject unsubscribe.
> > >>
> > >
> > >
> > >To unsubscribe from this conversation, send email to <
> > systers-dev+eligibility+unsubscribe at systers.org> or visit <
> > http://systers.org/mailman/options/systers-dev?override=180&preference=0
> >
> > >To contribute to this conversation, use your mailer's reply-all or
> > reply-group command or send your message to
> > systers-dev+eligibility at systers.org
> > >To start a new conversation, send email to <systers-dev+new at systers.org
> >
> > >To unsubscribe entirely from systers-dev, send email to <
> > systers-dev-request at systers.org> with subject unsubscribe.
> > >
> > >
> > >To unsubscribe from this conversation, send email to <
> > systers-dev+eligibility+unsubscribe at systers.org> or visit <
> > http://systers.org/mailman/options/systers-dev?override=180&preference=0
> >
> > >To contribute to this conversation, use your mailer's reply-all or
> > reply-group command or send your message to
> > systers-dev+eligibility at systers.org
> > >To start a new conversation, send email to <systers-dev+new at systers.org
> >
> > >To unsubscribe entirely from systers-dev, send email to <
> > systers-dev-request at systers.org> with subject unsubscribe.
> > >
> >
> >
> > To unsubscribe from this conversation, send email to <
> > systers-dev+eligibility+unsubscribe at systers.org> or visit <
> > http://systers.org/mailman/options/systers-dev?override=180&preference=0
> >
> > To contribute to this conversation, use your mailer's reply-all or
> > reply-group command or send your message to
> > systers-dev+eligibility at systers.org
> > To start a new conversation, send email to <systers-dev+new at systers.org>
> > To unsubscribe entirely from systers-dev, send email to <
> > systers-dev-request at systers.org> with subject unsubscribe.
> >
>
>
> To unsubscribe from this conversation, send email to
> <systers-dev+eligibility+unsubscribe at systers.org> or visit <
> http://systers.org/mailman/options/systers-dev?override=180&preference=0>
> To contribute to this conversation, use your mailer's reply-all or
> reply-group command or send your message to systers-dev+
> eligibility at systers.org
> To start a new conversation, send email to <systers-dev+new at systers.org>
> To unsubscribe entirely from systers-dev, send email to <
> systers-dev-request at systers.org> with subject unsubscribe.
>
>
>

To contribute to this conversation, send mail to <Danci Emanuel >


More information about the Systers-dev mailing list